# Part 1 - Data Preprocessing

## Importing the libraries

The first step is to import all the useful libraries we will use to create Machine Learning models

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

The dataset we will use in this pre processing tutorial we will use the file `Data.csv`.

In [3]:
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Missing data

Option 1: Drop all the data which has missing. <br> Option 2: Replace with the mean instead.

In [4]:
#Taking care of missing data
data = dataset.fillna(dataset.mean())
data

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Categorical data

We have two categorical variables: country and Purchased. Since ML models are based around numbers and not words we have to encode these categorical variables into numbers. 

In python this can be done using a technique called label encoding. However, this would replace France with 0, Germany with 1, and Spain with 2.

By doing this we encoded the country problem. However, the problem is that ML models are based on equations and it's great that we changed the labels to include numbers. However, since 1 is greater than 0 and 2 is greater than 1, the model will think that Spain is better than Germany and Germany is better than France. This is not the case. If we had ordered categorical data such as size (small, medium, and large) then it may sense to do this. We have to prevent the machine learning model for thinking one country is better than another. To achieve this we use the one hot encoder technique.

In [5]:
#Taking care of missing data
country_one_hot = pd.get_dummies(dataset['Country'], prefix = 'Country')
data_one_hot = data.join(country_one_hot).drop(columns=['Country'])
data_one_hot

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,No,1,0,0
1,27.0,48000.0,Yes,0,0,1
2,30.0,54000.0,No,0,1,0
3,38.0,61000.0,No,0,0,1
4,40.0,63777.777778,Yes,0,1,0
5,35.0,58000.0,Yes,1,0,0
6,38.777778,52000.0,No,0,0,1
7,48.0,79000.0,Yes,1,0,0
8,50.0,83000.0,No,0,1,0
9,37.0,67000.0,Yes,1,0,0


For the output column, in our case purchased, the ML model knows this is a categorical variable and that there is no order vetween the two so we won't have to use one hot encoder, just label encoder. In the below code I do this more quickly by leveraging the map function.

In [16]:
from sklearn.preprocessing import LabelEncoder
y = dataset.iloc[:, 3].values
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
dataset_oh_label = data_one_hot.copy()
dataset_oh_label['Purchased'] = y
dataset_oh_label

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,0,1,0,0
1,27.0,48000.0,1,0,0,1
2,30.0,54000.0,0,0,1,0
3,38.0,61000.0,0,0,0,1
4,40.0,63777.777778,1,0,1,0
5,35.0,58000.0,1,1,0,0
6,38.777778,52000.0,0,0,0,1
7,48.0,79000.0,1,1,0,0
8,50.0,83000.0,0,0,1,0
9,37.0,67000.0,1,1,0,0


## Spitting the dataset into the training set and test set

In [31]:
from sklearn.model_selection import train_test_split 
X = np.asarray(dataset_oh_label.drop(columns=['Purchased']))
y = np.asarray(dataset_oh_label['Purchased'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Feature scaling

We have age and and salary that are two features that are not on the same scale. This will cause some issues in our ML models. This is because a lot of ML models are based on the *Euclidian distance* between two data observation points.

In this case the salary feature will dominate the age feature when it comes to computing the Euclidian distance.

There are two common ways of computing feature scaling.

The first is standardisation:

$$x_{stand} = \frac{x - mean(x)}{s(x)}$$

The second is normalisation:

$$x_{norm} = \frac{x - min(x)}{max(x)-min(x)}$$

In [33]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train_sc = sc_X.fit_transform(X_train)
X_test_sc = sc_X.transform(X_test)

Note it's very important to *fit* your scaler on the `X_train` data set and apply that same scaling to the `X_test` set.

Do we need to scale the dummy variables? It depends on the context. How much do you want to keep interpretation in your models. If you do want to keep interpretation, then you can run your ML models without scaling your dummy variables. However, if you don't want to keep interpretation then scale and your ML models will fit better.

Do we need to apply feature scaling to the `y` vector?

For classification problems with categorical variables as output: **no**.

For regression problems: in general, **yes**.

**Note feature scaling is often included as an inbuilt feature of several** `sklearn` **models. Ensure you check before using**