# Data Cleaning in Python

This dataset was obtained through web scrapping of OpenRice website: https://sg.openrice.com/en/singapore

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set()

---

### Checking the dataset

After importing the relevant libraries, we'll need to check the dataset for missing values

In [2]:
projectD = pd.read_csv('features.csv')
projectD.head()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,,1.252299,103.820211
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904
2,10 SCOTTS,"Grand Hyatt Singapore, 10 Scotts Road Lobby Level",$31 - $50,,3.5,1.306345,103.833283
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,,1.299438,103.852403


In [3]:
projectD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3998 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3998 non-null   object 
 1   street_address  3998 non-null   object 
 2   price           3992 non-null   object 
 3   cuisine         2874 non-null   object 
 4   rating          2287 non-null   float64
 5   latitude        3965 non-null   float64
 6   longitude       3965 non-null   float64
dtypes: float64(3), object(4)
memory usage: 218.8+ KB


---

### Identifying missing values

The following problems with the dataset

> **price** : missing values & currently in string format eg.\\$11-\\$30   
> **cuisine** : missing values & contains multiple values in a cell separated by "/" and "," eg. Italian, Japanese   
> **rating** : missing values & does not contain ratings lower than 2   
> **latitude** & **longitude** : missing values   

We begin by dropping the rows containing missing values in prices, latitude or longitude

In [4]:
projectD = projectD.dropna(subset=['price','latitude','longitude'])
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3959 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3959 non-null   object 
 1   street_address  3959 non-null   object 
 2   price           3959 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          2268 non-null   float64
 5   latitude        3959 non-null   float64
 6   longitude       3959 non-null   float64
dtypes: float64(3), object(4)
memory usage: 247.4+ KB


---

#### Cleaning "Rating"

We decided to fill the missing values with the ***mode* of ratings**, as that is the measure of central tendency which makes sense with discrete categories of data.

In [5]:
rate = projectD['rating']
rate.mode()

0    3.5
dtype: float64

In [6]:
rate = rate.fillna(0)
rate = rate.replace(0, 3.5)
projectD['rating'] = rate
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3959 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3959 non-null   object 
 1   street_address  3959 non-null   object 
 2   price           3959 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          3959 non-null   float64
 5   latitude        3959 non-null   float64
 6   longitude       3959 non-null   float64
dtypes: float64(3), object(4)
memory usage: 247.4+ KB


---

#### Cleaning "Price"

Price contains string values in the format \\$11-\\$30 and "Under \\$10".

We want to **treat price as numerical data**, as some price categories are meaningfully "closer" to specific other price categories, and we want our clustering algorithm to take this into account. For example, "Under 11" is close to "11-20" and far from "31-50".

Steps taken to clean price

1. Replace all "Under \\$11" with \\$0 - $10   
2. Split price into LowPrice and HighPrice   
3. String slice the \\$ out and convert value to integer   
4. Calculate the middle of LowPrice and HighPrice   
5. Make new column in dataframe with the new calculated price value

In [7]:
projectD['price'] = np.where(projectD['price'].str.contains("-"), projectD['price'], '$0 - $9')
projectD.head(6)

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,3.5,1.252299,103.820211
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904
2,10 SCOTTS,"Grand Hyatt Singapore, 10 Scotts Road Lobby Level",$31 - $50,,3.5,1.306345,103.833283
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,3.5,1.299438,103.852403
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",$0 - $9,"Singaporean, Chinese",3.5,1.28236,103.843546


In [8]:
priceData = projectD['price'].str.split(" - ",expand=True)
LowPrice = priceData[0].str.slice(start=1)
LowPrice = LowPrice.astype('int')
HighPrice = priceData[1]
HighPrice = HighPrice.str.slice(start=1)
HighPrice = HighPrice.fillna(10)
HighPrice = HighPrice.astype('int')
projectD['price_mid'] = (HighPrice + LowPrice).div(2)
projectD.head(6)

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude,price_mid
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,3.5,1.252299,103.820211,40.5
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904,65.5
2,10 SCOTTS,"Grand Hyatt Singapore, 10 Scotts Road Lobby Level",$31 - $50,,3.5,1.306345,103.833283,40.5
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092,25.5
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,3.5,1.299438,103.852403,15.5
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",$0 - $9,"Singaporean, Chinese",3.5,1.28236,103.843546,4.5


---

#### Cleaning "Cuisine"

Each restaurant may have multiple cuisines attached to it; the categories are not mutually exclusive, so we will have to think about how to deal with that before we run clustering.

Before that we'll parse the data, which is in the format of cuisines separated by slashes and commas. We will make a list with each individual cuisine and make a new column which is a list of indices into our cuisines list.

We will drop the missing values of cuisine before doing this.


In [9]:
projectD = projectD.dropna(subset=['cuisine'])
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2846 entries, 0 to 3997
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2846 non-null   object 
 1   street_address  2846 non-null   object 
 2   price           2846 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          2846 non-null   float64
 5   latitude        2846 non-null   float64
 6   longitude       2846 non-null   float64
 7   price_mid       2846 non-null   float64
dtypes: float64(4), object(4)
memory usage: 200.1+ KB


In [10]:
# What cuisine categories do we have?
cuisines = []
for idx, row in projectD.iterrows():
    vals = row['cuisine']
    if vals == vals: #test for NaNs
        vals = vals.replace('/', ', ')
        vals = vals.split(', ')
        
        for val in vals:
            if val not in cuisines:
                cuisines.append(val)
                
print(cuisines)

Cuisines: 52
['Asian Variety', 'Multi-Cuisine', 'Korean', 'Belgian', 'Singaporean', 'Chinese', 'Cantonese', 'Hong Kong', 'Teochew', 'Malay', 'Middle Eastern', 'Mediterranean', 'Thai', 'Peranakan ', ' Nonya', 'Singaporean Chinese', 'American', 'Taiwanese', 'French', 'Italian', 'Indian', 'Fusion', 'Hainanese', 'Singaporean Western', 'Japanese', 'Malaysian', 'Fujian', 'Heng Hwa', 'European', 'Australian ', ' New Zealand', 'English', 'Shanghainese', 'Indonesian', 'Spanish', 'Mexican', 'Portuguese', 'Sichuan', 'Western Variety', 'German', 'Latin American', 'South American', 'Swiss', 'Vietnamese', 'Beijing', 'Russian', 'Foochow', 'Hakka', 'Caribbean', 'Filipino', 'Irish', 'Dong Bei']


In [15]:
projectD['cuisine_cats'] = projectD.apply(lambda x: [] if x[3] != x[3]
                                          else [cuisines.index(i) for i in x[3].replace(', ', '/').split('/')], axis=1)
projectD.head()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude,price_mid,cuisine_cats
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,3.5,1.252299,103.820211,40.5,[0]
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904,65.5,[1]
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092,25.5,[2]
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,3.5,1.299438,103.852403,15.5,[3]
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",$0 - $9,"Singaporean, Chinese",3.5,1.28236,103.843546,4.5,"[4, 5]"


Some observations of ours are that there are 52 distinct cuisines, and that a restaurant has at most 6 cuisines.

In [27]:
print("Cuisines: {}".format(len(cuisines)))
print("The most cuisines in a restaurant: {}".format(projectD['cuisine_cats'].apply(len).max()))

Cuisines: 52
The most cuisines in a restaurant: 6


We will turn the rows with more than one cuisine into multiple rows, each with one cuisine, in order to make it easier to do clustering on our data. We will save this duplicated data as a separate dataframe.

In [63]:
dupedD = projectD.explode('cuisine_cats').rename(columns={'cuisine_cats':'cuisine_cat'})
dupedD.info()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude,price_mid,cuisine_cat
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,3.5,1.252299,103.820211,40.5,0
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904,65.5,1
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092,25.5,2
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,3.5,1.299438,103.852403,15.5,3
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",$0 - $9,"Singaporean, Chinese",3.5,1.28236,103.843546,4.5,4


---

#### Save into a csv

In [65]:
projectD.to_csv("cleaned_data.csv")
dupedD.to_csv("duped_data.csv")