 # Data Preprocessing Template
 Data pre-processing is an essential step in Machine Learning.

In [1]:
# Importing the libraries 
import numpy as np # Mathematics (Linear Algebra). Makes Python programming like R.
import matplotlib.pyplot as plt # For plotting and viewing graphs from datasets
import pandas as pd # Importing and managing datasets

In [2]:
# Importing the dataset
dataset = pd.read_csv('Data.csv') # Read the dataset and store into the 'dataset' variable
print(dataset) # Lets take a look at the dataset
%store dataset

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
Stored 'dataset' (DataFrame)


In [3]:
# Create Matrix of Features
#%store -r dataset
X = dataset.iloc[:, :-1].values # All columns except the last one
print("\nThe X variable (dataset) predictor variables \n")
print(X)
y = dataset.iloc[:, 3].values
print("\nThe Y variable (dataset) \n")
print(y)


The X variable (dataset) predictor variables 

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

The Y variable (dataset) 

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [4]:
# Fix missing values
from sklearn.preprocessing import Imputer # Imputer class helps in data preprocessing. 
# help(Imputer)# Imputation transformer for completing missing values
# Here, we are replacing 'NaN' (missing values) with mean
imputer=Imputer(missing_values='NaN', strategy='mean',axis=0) # Mean is default value. 
imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [8]:
# Convert Categorical Data
from sklearn.preprocessing import LabelEncoder
labelencoder_X= LabelEncoder()
X[:,0]=labelencoder_X.fit_transform(X[:,0])
print("This is X after Label Encoding\n")
print(X)
labelencoder_y=LabelEncoder()
y=labelencoder_X.fit_transform(y)
print("\n\nThis is y after Label Encoding\n")
print(y)

This is X after Label Encoding

[[  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.40000000e+01
    7.20000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   2.70000000e+01
    4.80000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   3.00000000e+01
    5.40000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.80000000e+01
    6.10000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   4.00000000e+01
    6.37777778e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.50000000e+01
    5.80000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.87777778e+01
    5.20000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.80000000e+01
    7.90000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   5.00000000e+01
    8.30000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.70000000e+01
    6.70000000e+04]]


This is y after Label Encoding

[0 1 0 0 1 1 0 1 0 1]


The main problem here is that machine learning is based on equations and numbers. The categorical variables are given an ordinal value and this means that the model can be biased for the higher value. i.e. 

* 0 for France
* 1 for Germany
* 2 for Spain

could mean that Spain gets a higher priority than `France` or `Germany` and this could lead to an invalid model.

In [6]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.0000000

In [10]:
# Splitting test set and trainset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test=train_test_split(X,y,test_size=0.2)

### Feature Scaling

Feature scaling is a very important method. If variables are not on the same scale, it will cause issues in the machine learning model. Why is it so?

Since a lot of machine learning models are based on **Euclidean Distance**. This means that they work on the squared differences. If the variable is not scaled, it would mean that the difference between one squared difference and another would be large. Thus, scaling is needed.

There are two types of feature scaling:
1. **Standardization**

$$x_{stand}=\frac{x-\bar{x}}{{\sigma}_x}$$

Where ${\sigma}_x$ is the Standard Deviation and $\bar{x}$ is the mean of x.

2. **Normalization**

$$ x_{norm}=\frac{x-min(x)}{max(x)-min(x)}$$

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)