<h1> Preprocessing </h1>
Preprocessing is the stepsto make the data more suitable for the next process (machine learning or deep learning). <br>

There are several steps in data pre-processing (it depends on the dataset):<br>
<ul>
    <li> Handling Missing Value</li>
    <li> Handling Categorical Data</li>
    <li> Normalization </li>
    <li> Handling imbalance dataset </li>
    <li> Splitting data training and testing </li>
</ul>


<h2> Handling Missing Value </h2>
How to deal with missing value?<br>
<ul>
    <li>Delete row with missing values</li>
    <li>Replacing with arbitrary value</li>
    <li>Interpolation</li>
</ul>

<h3>Delete row with missing values</h3>
We can drop row or column with missing values.<br>
<ul>
    <li>Drop row or column with full missing values</li>
    <li>Drop row or column with any number of missing values</li>
    <li>Drop row or column with missing values more than threshold</li>
</ul>
<img src="./images/table.png"></img>

<h3>Replacing with Arbitrary Values</h3>
<ul>
    <li>Replacing with previous value-forward fill</li>
    <li>Replacing with next value-Backward fill</li>
    <li>Fill with a constant value</li>
    <li>Fill with an aggregate value (mean, median)</li>
</ul>
<img src="./images/table2.png"></img>
<img src="./images/table3.png"></img>


<h3> Interpolation</h3>
Interpolation is a method for generating points between given points.<br>
Interpolation is applicable when the data is in a sequence or a series.<br>
Type of interpolation:<br>
<ul>
    <li>Linear</li>
    <li>Polynomial</li>
</ul>

In [1]:
import pandas as pd

df = pd.DataFrame({"A":[12, 4, 7, None, 2],
                   "B":[None, 3, 57, 3, None],
                   "C":[20, 16, None, 3, 8],
                   "D":[14, 3, None, None, 6]})

In [2]:
df

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,3.0,16.0,3.0
2,7.0,57.0,,
3,,3.0,3.0,
4,2.0,,8.0,6.0


In [4]:
# Linear interpolation: Interpolasi pada garis yang linear
df_linear_fwd = df.interpolate(method ='linear', limit_direction ='forward')
df_linear_fwd

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,3.0,16.0,3.0
2,7.0,57.0,9.5,4.0
3,4.5,3.0,3.0,5.0
4,2.0,3.0,8.0,6.0


In [6]:
# backward means the change start from the end of the dataframe
# forward start from the beginning of the data frame

df_linear_bwd = df.interpolate(method ='linear', limit_direction ='backward')
df_linear_bwd

Unnamed: 0,A,B,C,D
0,12.0,3.0,20.0,14.0
1,4.0,3.0,16.0,3.0
2,7.0,57.0,9.5,4.0
3,4.5,3.0,3.0,5.0
4,2.0,,8.0,6.0


In [8]:
# interpolation using padding
# interpolation using the same values present above them in the dataset
# if the missing value at the first row then this method will not work
# we need to specify the limit --> the number of NaN values to fill

df_linear_pad = df.interpolate(method ='pad', limit=2)
df_linear_pad

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,3.0,16.0,3.0
2,7.0,57.0,16.0,3.0
3,7.0,3.0,3.0,3.0
4,2.0,3.0,8.0,6.0


In [11]:
# polynomial interpolation
df_poly = df.interpolate(method ='polynomial', order=2)
df_poly

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,3.0,16.0,3.0
2,7.0,57.0,9.5,4.0
3,4.5,3.0,3.0,5.0
4,2.0,,8.0,6.0


In [None]:
# exercise missing values:
# 1. load the data
# 2. check if there are missing values
# 3. print total missing values for each column
# 4. handle missing values by using several methods mention above
# 5. compare your result an anlyze it


# for comparation you can use bellow code
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds) # 0 - tak terhingga