## Machine Learning Process

1. Data Pre-processing
    * Import data
    * Clean data
    * Split into training & test sets
        * Usually 80% train, 20% test
    

2. Modelling
    * Build model
    * Train model
    * Make predictions
    
    
3. Evaluation
    * Calculate performance metrics
    * Make verdict

# Data Preprocessing

## Feature Scaling

Feature Scaling is always applied to columns, **NEVER** a row.

2 main types:
1. Normalization
    * Subtract **every** value in a column by **minimum** of that column
    * Then, divide by **difference between maximum & minimum**
    * End up with values between **0 & 1**
    
    
2. Standardization
    * Similar, but subtract by **average**
    * Then, divide by **standard deviation**
    * End up with values between **-3 & 3**

<img src='resources/preprocessing/feature_scaling.png' width='50%' />

As we might think 2000 is way larger than 3, we might group purple & red together. **But we do not want this.**

**So, we need to normalize it**

They are two different things (2 columns), so it is **important to scale the features.**

<img src='resources/preprocessing/feature_scaling_example.png' width='50%' />

After normalization,

<img src='resources/preprocessing/normalization_example.png' width='50%' />

<hr/>

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Importing data

Dependent variable = the variable that you want to predict

In [30]:
df = pd.read_csv('data/1.preprocessing/data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


`loc` gets values at **label of index**

`iloc` gets value at **location of index** -> despite of index label on the data set

In [40]:
X = df.drop('Purchased', axis=1).values
y = df['Purchased'].values

# iloc - takes value of specified rows, then specified columns
# X = df.iloc[:, :-1].values # -> [rows, columns]
# y = df.iloc[:, -1].values

In [41]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [42]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

### Dealing with missing data

1. We can remove the **entire** data set, **IF** data set it **large**.
2. Replace the missing values by average/median/most frequent

Average is most classic, but median is better as it is less affected by outliers (if data outliers are **LARGE**).

In [43]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [44]:
X[:]

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [45]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(X[:, 1:3]) # Connect imputer with X with ALL columns BUT ONLY Numerical
X[:, 1:3] = imputer.transform(X[:, 1:3]) # Does the action, replace the missing values with the mean

In [47]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Encoding Categorical Data

As countries have no order correlation, we cannot simply encode them into 0, 1, 2.
   * To avoid this, we can use `one hot encoding`.
        * Turning country column into 3 columns (as 3 countries in total).
        * Eg. France - vector (1,0,0), Germany (0,1,0) etc.
        
But for binary outcome (y), 0 & 1 is ok.

**Encode Features Categorical Data**

In [48]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# For ColumnTransformer, the argument=[('name', EncoderClass, Column of X)], remainder='what to do for columns not encoded'
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ct.fit_transform(X)) # Does fit & transform columns at the same time & force to np array type

In [51]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

**Encode Labels Categorical Data**

In [56]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y) # Dependent variable vector does not have to be numpy array

In [57]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

### Splitting dataset - Training & Test Set

Despite many discussion, but `Feature Scaling` comes after splitting the dataset.

`Feature Scaling` make sure all features on the same scale, and avoid one feature dominates the other.
   * The `Test Set` suppose to be a **Brand New Set**, treat is as something **Not suppose to work with yet**.
       * Otherwise, the scaling includes test set - information leakage.
       * In fact, we need to fit the test set to the same scale using mean & s.d. of training set.

In [81]:
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Or we can add random_state=1 in argument

In [82]:
print(X_train)

[[1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]]


In [83]:
print(X_test)

[[0.0 1.0 0.0 50.0 83000.0]
 [0.0 0.0 1.0 27.0 48000.0]]


In [84]:
print(y_train)

[1 0 1 0 1 1 0 0]


In [85]:
print(y_test)

[0 1]


### Feature Scaling

To avoid some features dominated other features,and not even considered by ML models.

We **DO NOT** need to apply feature scaling for all ML models, but only **some**.

<img src='resources/preprocessing/feature_scaling.png' width='50%' />

**Standardization**
   * Subtracting each value of the feature by the mean of all values of the feature & divide by standard deviation

**Normalization**
   * Subtracting each value of the feature by the minimum of all values, divide by (difference of max & min of the feature)
   
**Which to apply?**
   * `Normalization` is recommended when having a `normal distribution` in most of the features. (specific situation)
   * `Standardization` works well `all the time`. (go for standardization, as it always works well in training process)
   
**Do we have to apply feature scaling to dummy variables?**
   * No, the goal of feature scaling is to have the features in the same range, and  dummy variables are already at the same scale.
   * Lose information if we apply on dummy variables. eg: matrix already between 0 & 1.
   * **Only apply** feature scaling to numerical values.

**For test set, we ONLY transform, as we need to `scale` the test set under the training set - To the same scale**

**BUT, fit only using training set (scaler), to treat test set as brand new set, and `does not contribute` to feature scaling formula**

In [86]:
# Fit will just get the mean & s.d. of each of the features.
# Transform will apply the formula, and actually manipulate the values.

from sklearn.preprocessing import StandardScaler # Standardization

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# We need to apply the SAME scaler on training set onto test set
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [87]:
print(X_train)

[[1.0 0.0 0.0 -0.7529426005471072 -0.6260377781240918]
 [1.0 0.0 0.0 1.008453807952985 1.0130429500553495]
 [1.0 0.0 0.0 1.7912966561752484 1.8325833141450703]
 [0.0 1.0 0.0 -1.7314961608249362 -1.0943465576039322]
 [1.0 0.0 0.0 -0.3615211764359756 0.42765697570554906]
 [0.0 1.0 0.0 0.22561095973072184 0.05040823668012247]
 [0.0 0.0 1.0 -0.16581046438040975 -0.27480619351421154]
 [0.0 0.0 1.0 -0.013591021670525094 -1.3285009473438525]]


In [88]:
print(X_test)

[[0.0 1.0 0.0 2.1827180802863797 2.3008920936249107]
 [0.0 0.0 1.0 -2.3186282969916334 -1.7968097268236927]]
