# Data Cleaning in Python

This dataset was obtained through web scrapping of OpenRice website: https://sg.openrice.com/en/singapore

In [22]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set()

---

### Checking the dataset

After importing the relevant libraries, we'll need to check the dataset for missing values

In [23]:
projectD = pd.read_csv('features2.csv')
projectD.head()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,,1.252299,103.820211
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904
2,10 SCOTTS,"Grand Hyatt Singapore, 10 Scotts Road Lobby Level",$31 - $50,,3.5,1.306345,103.833283
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,,1.299438,103.852403


In [24]:
projectD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3998 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3998 non-null   object 
 1   street_address  3998 non-null   object 
 2   price           3992 non-null   object 
 3   cuisine         2874 non-null   object 
 4   rating          2287 non-null   float64
 5   latitude        3965 non-null   float64
 6   longitude       3965 non-null   float64
dtypes: float64(3), object(4)
memory usage: 218.8+ KB


---

### Identifying missing values

The following problems with the dataset

> **price** : missing values & currently in string format eg.\\$11-\\$30   
> **cuisine** : missing values & contains multiple values in a cell separated by "/" and "," eg. Italian, Japanese   
> **rating** : missing values & does not contain ratings lower than 2   
> **latitude** & **longitude** : missing values   

We begin by dropping the rows containing missing prices, latitude and longitude

In [25]:
projectD = projectD.dropna(subset=['price','latitude','longitude'])
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3959 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3959 non-null   object 
 1   street_address  3959 non-null   object 
 2   price           3959 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          2268 non-null   float64
 5   latitude        3959 non-null   float64
 6   longitude       3959 non-null   float64
dtypes: float64(3), object(4)
memory usage: 247.4+ KB


---

#### Cleaning "Rating"

We decided to fill the missing values with the ***mode* of ratings**

In [26]:
rate = projectD['rating']
rate.mode()

0    3.5
dtype: float64

In [27]:
rate = rate.fillna(0)
rate = rate.replace(0, 3.5)
projectD['rating'] = rate
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3959 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            3959 non-null   object 
 1   street_address  3959 non-null   object 
 2   price           3959 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          3959 non-null   float64
 5   latitude        3959 non-null   float64
 6   longitude       3959 non-null   float64
dtypes: float64(3), object(4)
memory usage: 247.4+ KB


---

#### Cleaning "Cuisine"

Steps taken to clean cuisine

> 1. Drop the missing values of cuisine   
> 2. Split the cuisine to its multiple variables eg. Teochew, Malay, Japanese   
> 3. Add a primary and secondary cuisine column

In [28]:
projectD = projectD.dropna(subset=['cuisine'])
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2846 entries, 0 to 3997
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2846 non-null   object 
 1   street_address  2846 non-null   object 
 2   price           2846 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          2846 non-null   float64
 5   latitude        2846 non-null   float64
 6   longitude       2846 non-null   float64
dtypes: float64(3), object(4)
memory usage: 177.9+ KB


In [29]:
cuisineData = projectD['cuisine'].str.split(", |/",expand=True)
cuisineData = cuisineData.fillna(np.nan)
cuisineData1 = cuisineData[0]
cuisineData2 = cuisineData[1]

In [30]:
#projectD = projectD.drop(columns=['cuisine'])
projectD['1st cuisine'] = cuisineData1
projectD['2nd cuisine'] = cuisineData2
projectD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2846 entries, 0 to 3997
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2846 non-null   object 
 1   street_address  2846 non-null   object 
 2   price           2846 non-null   object 
 3   cuisine         2846 non-null   object 
 4   rating          2846 non-null   float64
 5   latitude        2846 non-null   float64
 6   longitude       2846 non-null   float64
 7   1st cuisine     2846 non-null   object 
 8   2nd cuisine     759 non-null    object 
 9   3rd cuisine     129 non-null    object 
 10  4th cuisine     39 non-null     object 
 11  5th cuisine     4 non-null      object 
 12  6th cuisine     1 non-null      object 
dtypes: float64(3), object(10)
memory usage: 311.3+ KB


---

#### Cleaning "Price"

Price contains string values in the format \\$11-\\$30 and "Under \\$10"

Steps taken to clean price

> 1. Replace all "Under \\$10" with \\$11   
> 2. Split price into LowPrice and HighPrice   
> 3. The LowPrice will need to minus 1 as its values are formated as \\$11, \\$31, etc   
> 4. String slice the \\$ out and convert value to integer   
> 5. Calculate the middle of LowPrice and HighPrice   
> 6. Replace the price values in dataframe with the new calculated price value

In [31]:
projectD['price'] = np.where(projectD['price'].str.contains("-"), projectD['price'], '$11')
projectD.head()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude,1st cuisine,2nd cuisine,3rd cuisine,4th cuisine,5th cuisine,6th cuisine
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",$31 - $50,Asian Variety,3.5,1.252299,103.820211,Asian Variety,,,,,
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",$51 - $80,Multi-Cuisine,4.0,1.307401,103.829904,Multi-Cuisine,,,,,
3,100g Korean BBQ,93 Amoy Street,$21 - $30,Korean,3.5,1.281299,103.847092,Korean,,,,,
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",$11 - $20,Belgian,3.5,1.299438,103.852403,Belgian,,,,,
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",$11,"Singaporean, Chinese",3.5,1.28236,103.843546,Singaporean,Chinese,,,,


In [32]:
priceData = projectD['price'].str.split(" - ",expand=True)
LowPrice = priceData[0].str.slice(start=1)
LowPrice = LowPrice.astype('int') - 1
HighPrice = priceData[1]
HighPrice = HighPrice.str.slice(start=1)
HighPrice = HighPrice.fillna(10)
HighPrice = HighPrice.astype('int')
projectD['price'] = (HighPrice + LowPrice).div(2)
projectD.head()

Unnamed: 0,name,street_address,price,cuisine,rating,latitude,longitude,1st cuisine,2nd cuisine,3rd cuisine,4th cuisine,5th cuisine,6th cuisine
0,1-V:U,"The Outpost Hotel Sentosa, 10 Artillery Avenue...",40.0,Asian Variety,3.5,1.252299,103.820211,Asian Variety,,,,,
1,10 At Claymore,"Pan Pacific Orchard, 10 Claymore Road Level 2",65.0,Multi-Cuisine,4.0,1.307401,103.829904,Multi-Cuisine,,,,,
3,100g Korean BBQ,93 Amoy Street,25.0,Korean,3.5,1.281299,103.847092,Korean,,,,,
4,10th Chocolate Street,"GSM Buidling, 141 Middle Road",15.0,Belgian,3.5,1.299438,103.852403,Belgian,,,,,
5,123 Seafood,"Chinatown Complex Market and Food Centre, 335 ...",10.0,"Singaporean, Chinese",3.5,1.28236,103.843546,Singaporean,Chinese,,,,


---

#### Save into a csv

In [153]:
projectD.to_csv("cleaned_data.csv")