## **Data Cleaning**
### This notebook implements steps to perform data cleaning. The procedures used are detailed and explained in the table below.

| Steps | Columns Affected | Justification |
|:--------|:--------|:--------|
|  Convert all string data to lowercase   |  All columns with categorical data in string format   |  We standardise the casing of categorical string data to make sure that there is consistency, to help with downstream preprocessing   |
|  Remove duplicated labels   |  - `flat_type`   | `flat_type` column contains same labels in different forms. For example, for an x-room flat, where is 2, 3, 4 or 5 depending on the flat type, there is a `x room` label and `x-room` label. Since both are essentially the same, we standardise to a `x-room` label for all such cases   |
|  Remove columns where there is only a single label  | - `furnished` <br> - `elevation`   |  Since there is only 1 label for these columns, they do not provide any value in distiguishing between rental prices of different houses, hence we can remove such columns  |
|  Remove columns deemd unhelpful to analysis | - `block` <br> - `town` <br> - `subzone` <br> - `street_name` |These location data are deemed unhelpful to analysis because: <br> 1. There is arbitrary relationship between these data points and rental price due to the high variation in their feature values. Eg arbitrary block numbers (257, 416a, 341b, etc) have no clear impact on rental pricing. <br> 2. Location data is already encoded in the `latitude` and `longitude` columns, which give a better indication of a house's location. <br> 3. `town` and `planning_area`are similar in terms of the information they provide. Between `town` and `planning_area`, `town` was chosen to be dropped as there is an example of poor separation of data within the `town` column, where there is a `kallang/whampoa` label that clusters the kallang and whampoa areas together, whereas in the `planning_area` column, this data is well-separated.|

In [10]:
import pandas as pd
from src.cleaning import *

%load_ext autoreload
%autoreload 2
%matplotlib inline

In [11]:
CSV_FILE = "data/train.csv"
NEW_CSV_FILE = "data/train_cleaned.csv"

In [12]:
df = pd.read_csv(CSV_FILE)
df.head()

Unnamed: 0,rent_approval_date,town,block,street_name,flat_type,flat_model,floor_area_sqm,furnished,lease_commence_date,latitude,longitude,elevation,subzone,planning_area,region,monthly_rent
0,2021-09,jurong east,257,Jurong East Street 24,3 room,new generation,67.0,yes,1983,1.344518,103.73863,0.0,yuhua east,jurong east,west region,1600
1,2022-05,bedok,119,bedok north road,4-room,new generation,92.0,yes,1978,1.330186,103.938717,0.0,bedok north,bedok,east region,2250
2,2022-10,toa payoh,157,lorong 1 toa payoh,3-room,improved,67.0,yes,1971,1.332242,103.845643,0.0,toa payoh central,toa payoh,central region,1900
3,2021-08,pasir ris,250,Pasir Ris Street 21,executive,apartment,149.0,yes,1993,1.370239,103.962894,0.0,pasir ris drive,pasir ris,east region,2850
4,2022-11,kallang/whampoa,34,Whampoa West,3-room,improved,68.0,yes,1972,1.320502,103.863341,0.0,bendemeer,kallang,central region,2100


#### Step 1: Convert all string data to lowercase

In [13]:
df = convert_strings_to_lowercase(df)
df.head()

Unnamed: 0,rent_approval_date,town,block,street_name,flat_type,flat_model,floor_area_sqm,furnished,lease_commence_date,latitude,longitude,elevation,subzone,planning_area,region,monthly_rent
0,2021-09,jurong east,257,jurong east street 24,3 room,new generation,67.0,yes,1983,1.344518,103.73863,0.0,yuhua east,jurong east,west region,1600
1,2022-05,bedok,119,bedok north road,4-room,new generation,92.0,yes,1978,1.330186,103.938717,0.0,bedok north,bedok,east region,2250
2,2022-10,toa payoh,157,lorong 1 toa payoh,3-room,improved,67.0,yes,1971,1.332242,103.845643,0.0,toa payoh central,toa payoh,central region,1900
3,2021-08,pasir ris,250,pasir ris street 21,executive,apartment,149.0,yes,1993,1.370239,103.962894,0.0,pasir ris drive,pasir ris,east region,2850
4,2022-11,kallang/whampoa,34,whampoa west,3-room,improved,68.0,yes,1972,1.320502,103.863341,0.0,bendemeer,kallang,central region,2100


#### Step 2: Standardise the labels for `flat_type` column

In [14]:
df = clean_flat_type_labels(df)
df['flat_type'].unique()

array(['3-room', '4-room', 'executive', '5-room', '2-room'], dtype=object)

#### Step 3: Remove unwanted columns from dataset

In [15]:
columns_to_remove = ['furnished', 'elevation', 'block', 'town', 'subzone', 'street_name']

In [16]:
df = drop_data(df, columns_to_remove)
df.columns

Index(['rent_approval_date', 'flat_type', 'flat_model', 'floor_area_sqm',
       'lease_commence_date', 'latitude', 'longitude', 'planning_area',
       'region', 'monthly_rent'],
      dtype='object')

#### Step 4. Save new cleaned dataset as a new csv file for further processing

In [17]:
df.to_csv(NEW_CSV_FILE)