Data Pre-Processing Guide
======================

In this guide, i will apply various data processing techniques for better interpretation of data for machine learning using scikit learn, pandas and numpy


### Step 1: Importing required Libraries and modules

In [184]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

### Step 2: Reading the dataset into Pandas dataframe

In [185]:
dframe = pd.read_csv("./data.csv")
dframe

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Step 3: Separating X(independent variables) and Y(dependent variables) from dataframe

In [186]:
x_data = dframe.iloc[:, :-1]
x_data

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [187]:
y_data = dframe.iloc[:,-1]
y_data

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

### Step 4: Dealing with missing Data

To deal with missing data, we can make use of either Imputer from sklearn.preprocessing or pandas methods. In this case, NULL values is filled with mean of the record and make use of Pandas for convenience. But you can make use of Imputer as alternate as following:<br>
> `imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)` <br>
> `imputer = imputer.fit(X[:, 1:3])` <br>
> `X[:, 1:3] = imputer.transform(X[:, 1:3])`

In [188]:
x_data.fillna(x_data.mean(), inplace=True)
x_data

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


### Step 5: Converting Categorical Data to Numerical
#### Label Encoder vs Hot Encoding: A Concept

LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

One-Hot-Encoding has a the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. The disadvantage is that for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality. In these cases, I typically employ one-hot-encoding followed by PCA for dimensionality reduction. I find that the judicious combination of one-hot plus PCA can seldom be beat by other encoding schemes. PCA finds the linear overlap, so will naturally tend to group similar features into the same feature. <br><br>
Reference: https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor

In [189]:
label_encoder_x = LabelEncoder()
x_data.Country = label_encoder_x.fit_transform(x_data.Country)
one_hot_encoder = OneHotEncoder(categorical_features=[0])

x = one_hot_encoder.fit_transform(x_data)
x_hot_encoded = x.toarray().astype(np.int64)
x_hot_encoded

array([[    1,     0,     0,    44, 72000],
       [    0,     0,     1,    27, 48000],
       [    0,     1,     0,    30, 54000],
       [    0,     0,     1,    38, 61000],
       [    0,     1,     0,    40, 63777],
       [    1,     0,     0,    35, 58000],
       [    0,     0,     1,    38, 52000],
       [    1,     0,     0,    48, 79000],
       [    0,     1,     0,    50, 83000],
       [    1,     0,     0,    37, 67000]], dtype=int64)

**Note:** I have created dummy encoding for Country to preserve interpretation of the features

In [190]:
label_encoder_y = LabelEncoder()
y_label_encoded = label_encoder_y.fit_transform(y_data)
y_label_encoded

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

### Step 6: Splitting Train and Test Data

In [191]:
x_train, x_test, y_train, y_test = train_test_split(x_hot_encoded, y_label_encoded, test_size=0.2, random_state=0)

### Step 7: Feature Scaling

Exempted Dummy Encoding for Country (first three columns) as to preserve interpretability of the Country data

In [192]:
feature_scaler = StandardScaler()
x_train[:,[-2,-1]] = feature_scaler.fit_transform(x_train[:,[-2,-1]])
x_test[:,[-2,-1]] = feature_scaler.transform(x_test[:,[-2,-1]])



As far of now, we've implemented data pre-processing successfully.