## Data Cleaning & Preparation

### **Understanding null, undefined, and NaN**
**1. Null Value**

A null value represent a reference that points generally, to nonexistent or invalid objects or address. Even though it points to something non existing, it's a global object (and on of JavaScript's a primitive values).


In basic maths operations, null value is converted to 0.


**2. Undefined**

The global undefined property represents the primitive value undefined. It is one of JavaScript's primitive type. It basically tells us that something isn't defined. You get this e.g. by displaying a value of variable which don't have assigned value.

### **What's the difference? Null vs Undefined**

**Similarities**
- Both when negated are giving true (falsy values), but none of them equals true or false
- The represent something non existing.

**Differences**
- Null represents 'Nothing', fully non existing, undefined something which isn't defined
- Undefined has its own data type (undefined), null is only an object.
- Null is treated as 0 in basic arithmetic operation, undefined returns NaN

**2. NaN (Not a Number)**

The global NaN property is a value representing Not-A-Number

I think the definition is clear enough, JavaScript returns this value when number we've supposed to get isn't a number. For example, whe you're trying to subtract a "cucumber" from 10 or divide 12 by "R2D2".

### **How are NaN value dangerous?**

NaN values are dangerous in two ways:

- The change of some metrics as mean or median values, therefore giving wrong information to scientists.
- The sklearn implemented algorithms can’t perform on datasets that have such values (try to implement the TreeDecsisionClassifier on the heart-disease dataset).

### **How to Detect Missing Values?**

In Python, we can detect missing values using Pandas library.

`pandas.isnull()`

`pandas.isna()`

`pandas.isnull().sum()`

`pandas.isna().sum()`

### **Common types of missing data:**

`1. Missing Completely at Random (MCAR)`

- Missing completely at random (MCAR) analysis assumes that missingness is unrelated of any unobserved data (response and covariate), meaning that the probability of a missing data value is independent of any observation in the data set. When we say data are missing completely at random, we mean that the missingness is nothing to do with the person being studied.

`2. Missing At Random (MAR)`

- Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. When we say data are missing at random, we mean that the missingness is to do with the person but can be predicted from other information about the person.

`3. Missing Not At Random (MNAR)`

- Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR. When we say that the data are missing not at random, we mean that the missingness is specifically related to what is missing (the value of the variable that's missing is related to the reason it's missing).

In [None]:
import pandas as pd
import numpy as np

### Quality Checking

In [None]:
df = pd.read_csv('data/melb_data.csv')

In [None]:
# show top data
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [None]:
# show bottom data
df.tail()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0
13579,Yarraville,6 Agnes St,4,h,1285000.0,SP,Village,26/08/2017,6.3,3013.0,...,1.0,1.0,362.0,112.0,1920.0,,-37.81188,144.88449,Western Metropolitan,6543.0


In [None]:
# show info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [None]:
df["Method"].unique()

array(['S', 'SP', 'PI', 'VB', 'SA'], dtype=object)

In [None]:
# show missing values
df.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
# show descriptive stats
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


### **How to handle missing values?**

In Python, we can detect missing values using pandas library.
- To erase the rows that have NaN values. But this is not a good choice because in such a way we lose the information, especially when we work with small datasets.
- To impute NaN values with specific methods or values. This article refers to these methods.

There are a lot of ways to impute these gaps and in most cases. Data Scientists especially newbies, don't know them. Here are the ways to do that:

Inpute them with specific values.

Impute with special metrics, for example, mean or median.

Impute using a method: MICE or KNN.

### Reference

1. https://codeburst.io/understanding-null-undefined-and-nan-b603cb74b44c
2. https://towardsdatascience.com/whats-the-best-way-to-handle-nan-values-62d50f738fc
3. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
4. https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot

**1. Dropping Missing Values/Delete**

> ### Listwise Dropping

In [None]:
# copy dataset
df1 = df.copy()

In [None]:
df1.shape

(13580, 21)

In [None]:
df1[df1["Car"].isna()]

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
12221,Ascot Vale,132 The Parade,3,h,985000.0,S,Brad,3/09/2017,4.3,3032.0,...,1.0,,245.0,91.0,1945.0,,-37.77215,144.91144,Western Metropolitan,6567.0
12247,Brunswick East,18 Ethel St,2,h,1023000.0,S,Domain,3/09/2017,4.0,3057.0,...,1.0,,154.0,76.0,1890.0,,-37.77221,144.97537,Northern Metropolitan,5533.0
12259,Clifton Hill,34 Fenwick St,3,h,1436000.0,S,Jellis,3/09/2017,3.6,3068.0,...,2.0,,123.0,128.0,1990.0,,-37.78888,145.00036,Northern Metropolitan,2954.0
12320,Glen Waverley,19 Diamond Av,3,h,1370000.0,S,Fletchers,3/09/2017,16.7,3150.0,...,1.0,,652.0,,,,-37.87170,145.17267,Eastern Metropolitan,15321.0
12362,Newport,11 Collingwood Rd,4,h,1180000.0,PI,Williams,3/09/2017,6.2,3015.0,...,1.0,,545.0,,,,-37.84399,144.89125,Western Metropolitan,5498.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13496,Moonee Ponds,46 Eglinton St,4,h,1525000.0,S,Nelson,26/08/2017,6.2,3039.0,...,3.0,,233.0,,,,-37.76884,144.91708,Western Metropolitan,6232.0
13508,North Melbourne,9 Erskine St,2,h,1080000.0,S,Jellis,26/08/2017,1.8,3051.0,...,1.0,,100.0,67.0,1890.0,,-37.79524,144.94642,Northern Metropolitan,6821.0
13522,Port Melbourne,201 Stokes St,2,h,1515000.0,SP,Marshall,26/08/2017,3.5,3207.0,...,2.0,,197.0,,,,-37.83754,144.93954,Southern Metropolitan,8648.0
13524,Prahran,17 Packington Pl,2,h,1365000.0,S,Jellis,26/08/2017,4.6,3181.0,...,1.0,,206.0,100.0,1900.0,,-37.85569,145.00522,Southern Metropolitan,7717.0


In [None]:
# dropping values
df1_dropped1 = df1.dropna()

In [None]:
# check shape after dropping
df1_dropped1.shape

(6196, 21)

In [None]:
# reset index
df1_dropped1 = df1_dropped1.reset_index(drop=True)

In [None]:
# checking null values after dropping
df1_dropped1.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

**2. Dropping Columns**

In [None]:
# copy dataset
df2 = df.copy()

In [None]:
df2.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
# dropping entire column (BuildingArea)
df2_dropped = df2.drop('BuildingArea',axis =1 )

In [None]:
# check data after dropping column
df2_dropped.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
# dropping multiple columns (YearBuilt, CouncilArea) - Alternative
df2_dropped = df2_dropped.drop(columns = ['YearBuilt','CouncilArea'])

In [None]:
# check data after dropping column
df2_dropped.isna().sum()

Suburb            0
Address           0
Rooms             0
Type              0
Price             0
Method            0
SellerG           0
Date              0
Distance          0
Postcode          0
Bedroom2          0
Bathroom          0
Car              62
Landsize          0
Lattitude         0
Longtitude        0
Regionname        0
Propertycount     0
dtype: int64

In [None]:
# check shape
df2_dropped.shape

(13580, 18)

In [None]:
# we can use dropna after we see our data is fit for it (nulla values is less than 2% or 3%)
df2_dropped = df2_dropped.dropna()

In [None]:
# reset index
df2_dropped = df2_dropped.reset_index(drop = True)

In [None]:
# check shape after dropping
df2_dropped.shape

(13518, 18)

In [None]:
# check data after dropping column
df2_dropped.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

**3. Pairwise Dropping**

In [None]:
df3 = df.copy()

In [None]:
# dropping entire colum (car)
df3_dropped = df3.dropna(subset=['Car'])

In [None]:
# check shape
df3_dropped.shape

(13518, 21)

In [None]:
df3.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

### **Impute Missing Values**

There are many options we could consider when replacing a missing value, for example:
- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median, or mode value for the column.
- A value estimated by another predictive model.

**Impute with Central Tendency**

In [None]:
df4 = df.copy()

In [None]:
# show the description stats (BuildingArea)
df4['BuildingArea'].describe()

count     7130.000000
mean       151.967650
std        541.014538
min          0.000000
25%         93.000000
50%        126.000000
75%        174.000000
max      44515.000000
Name: BuildingArea, dtype: float64

In [None]:
df4[['BuildingArea']].fillna(0)

Unnamed: 0,BuildingArea
0,0.0
1,79.0
2,150.0
3,0.0
4,142.0
...,...
13575,0.0
13576,133.0
13577,0.0
13578,157.0


In [None]:
# fill data to mean
df4[['BuildingArea']] = df4[['BuildingArea']].fillna(round(df4['BuildingArea'].mean(), 2))

In [None]:
# check data after imputing
df4.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea        0
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
df4[['Car', 'YearBuilt']].describe()

Unnamed: 0,Car,YearBuilt
count,13518.0,8205.0
mean,1.610075,1964.684217
std,0.962634,37.273762
min,0.0,1196.0
25%,1.0,1940.0
50%,2.0,1970.0
75%,2.0,1999.0
max,10.0,2018.0


In [None]:
# fill every missing values with mean
df4 = df4.fillna(df4.mean())

  df4 = df4.fillna(df4.mean())


In [None]:
df4.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                 0
Landsize            0
BuildingArea        0
YearBuilt           0
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
# feature `CouncilArea` still null, since it's a categorical data -> string
# we can fill this one with mode
df3['CouncilArea'].describe()

count        12211
unique          33
top       Moreland
freq          1163
Name: CouncilArea, dtype: object

In [None]:
# fill with moreland
df4['CouncilArea'] = df4['CouncilArea'].fillna('Moreland')

In [None]:
# show data after fill / imputing
df4.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

In [None]:
print("hello")

hello
