# Data Preprocessing Tools

## Importing the libraries

In [18]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [19]:
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values #excludes last column
y = dataset.iloc[:, -1].values  #includes only last column

In [20]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [21]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [22]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(x[:, 1:3]) #looks for missing values in age and salary columns
x[:, 1:3] = imputer.transform(x[:, 1:3]) #replace empty with averages


In [23]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

Rather than giving the countries numbers such as 1, 2, 3, its better to five them vector values such as 1,0,0 0,1,0 and 0,0,1 as this was the machine does not think that there is any relationship between the variables. 


### Encoding the Independent Variable

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')

x = np.array(ct.fit_transform(x))  #changing output to an np array for future use 



In [25]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [26]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #one single vector so no need for anything to be within () 
y = le.fit_transform(y)

In [27]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [30]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

In [31]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [32]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [33]:
print(y_train) #correspond to the same customers as in x_train

[0 1 0 0 1 1 0 1]


In [34]:
print(y_test) #correspond to the same cusomers as in x_test

[0 1]


## Feature Scaling

Scaling all features to make sure they all take values in the same scale --> done to prevent one feature dominating the other which therefore would be neglected by the ML model.


Applied AFTER splitting the sets - because the test set is supposed to be a brand new set, on which the ML model is evaluated on. The test set is something youre not supposed to work on for the training, feature scaling gets the mean and SD of the feature to perform the scaling, so if you apply feature scaling before the split, then it will get the mean and SD of all the values including the ones in the test set. Would result in information leakage on the test set.

Standardisation gives all the feautres a value between -3 and +3, normalisation 0 and 1. Normalisation for when you have normal distribution within your features, standardisation works always.

Bare in mind - this is not necessary for all models.

In [35]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])  #fit gets mean and SD of each feature, transform will transform values so they can all be in the same scale
x_test[:, 3:] = sc.transform(x_test[:, 3:]) 

In [36]:
print(x_train) #now that the values are between -3 and 3 they can be used to improve training of model

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [37]:
print(x_test) 

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
