# Data Preprocessing

In [1]:
# Importing Common Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing Dataset
dataset = pd.read_csv('./Data.csv')
dataset.head(10)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In the above table, we can see that the last column 'Purchased' is the dependent variable and the other column are the independent variables that lead to that observation. Therefore, we'd have the 'Purchased' column as 'y' and get the rest of the data into 'X'.

In [3]:
# Preparing the Dataset
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

In [4]:
# Display the 'X' and 'y' matrices
print('X = \n', X)
print('y = \n', y)

X = 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
y = 
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### Missing Data

Missing data can cause some errors and problems during training. There are several ways to handle missing data:
- Ignore the observation/row by deleting them
- Replace the missing value with the mean or average value of the entire column
- Replace the missing value with the median of that column

Therefore, we'll handle the missing data in two columns of this dataset, 'Age' and 'Salary'. Note that the code specifies `X[:, 1:3]`, and not `X[:, 1:2]` because the upper bound is excluded.

In [5]:
# Import and Setup Simple Imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

In [6]:
# Fit and Transform
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [7]:
# View the transformed 'X' Matrix
print('X = \n', X)

X = 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Categorical Data

In order to handle categorical data, we'd need to encode them. One way to encode the data would be to use integers, for example, in our dataset, we have a column for 'Country' where we could use $France = 1$, $Germany = 2$ and $Spain = 3$. But this approach can cause some misinterpreted correlations between features and the outcome. So instead of using a numerical order, we can simply use boolean and split the countries into different columns.

#### Before:
| Country | Age | Salary |
| --- | --- | --- |
| France | 44 | 72000 |
| Spain | 27 | 48000 |
| Germany | 30 | 54000 |


#### Wrong Approach
Using numeric values like $France = 1, Germany = 2$ and so on

| Country | Age | Salary |
| --- | --- | --- |
| 1 | 44 | 72000 |
| 2 | 27 | 48000 |
| 3 | 30 | 54000 |


#### Correct Approach
Use different columns for countries and boolean values to denote the country

| France | Spain | Germany | Age | Salary |
| --- | --- | --- | --- | --- |
| 1 | 0 | 0 | 44 | 72000 |
| 0 | 1 | 0 | 27 | 48000 |
| 0 | 0 | 1 | 30 | 54000 |

In [8]:
# Import Libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [9]:
# Encoding Independent Variable
labelencoder_X = LabelEncoder()
labelencoder_X.fit_transform(X[:, 0])
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
X = np.array(columnTransformer.fit_transform(X))

In [10]:
# View the transformed 'X' Matrix
print('X = \n', X)

X = 
 [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


For the dependent variables, we can directly transform it into boolean values, that is `yes = 1` and `no = 0`

In [11]:
# Encoding Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [12]:
# View the transformed 'y' Matrix
print('y = ', y)

y =  [0 1 0 0 1 1 0 1 0 1]


## Splitting Data

In [13]:
# Splitting the Dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [14]:
# View the Sets
print('X_train = \n', X_train)
print('X_test = \n', X_test)
print('y_train = ', y_train)
print('y_test = ', y_test)

X_train = 
 [[0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
X_test = 
 [[0.0 1.0 0.0 30.0 54000.0]
 [0.0 1.0 0.0 50.0 83000.0]]
y_train =  [1 1 1 0 1 0 0 1]
y_test =  [0 0]


## Feature Scaling

Feature scaling has to be used after the splitting of data because the test set is supposed to be a brand new set that we're going to use after training the model. It is in no way supposed to be a part of training the model, and when we scale before splitting, the test set observations will alter the scaling and have an effect during training the model.

Lots of models already take cares of this automatically, we need to do this manually only for few cases.

There are multiple ways to scale, such as:
- Standardization
$$
 X_{Standardization} = \frac{x - \text{Mean}(x)}{\text{Standard Deviation}(x)}
$$
- Normalization
$$
X_{Normalization} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}
$$

In [15]:
# Import Library
from sklearn.preprocessing import StandardScaler

In [16]:
# Scaling Independent Variable
sc_X = StandardScaler()
X_train[:, 3:] = sc_X.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc_X.transform(X_test[:, 3:])

In [17]:
# View the Sets
print('X_train = \n', X_train)
print('X_test = \n', X_test)

X_train = 
 [[0.0 1.0 0.0 0.2630675731713538 0.1238147854838185]
 [1.0 0.0 0.0 -0.25350147960148617 0.4617563176278856]
 [0.0 0.0 1.0 -1.9753983221776195 -1.5309334063940294]
 [0.0 0.0 1.0 0.05261351463427101 -1.1114197802841526]
 [1.0 0.0 0.0 1.6405850472322605 1.7202971959575162]
 [0.0 0.0 1.0 -0.08131179534387283 -0.16751412153692966]
 [1.0 0.0 0.0 0.9518263102018072 0.9861483502652316]
 [1.0 0.0 0.0 -0.5978808481167128 -0.48214934111933727]]
X_test = 
 [[0.0 1.0 0.0 -1.4588292694047795 -0.9016629672292141]
 [0.0 1.0 0.0 1.984964415747487 2.139810822067393]]
