<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-1">Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Import-the-dataset" data-toc-modified-id="Import-the-dataset-1.0.1">Import the dataset</a></span></li></ul></li></ul></li><li><span><a href="#Handling-the-Missing-Data" data-toc-modified-id="Handling-the-Missing-Data-2">Handling the Missing Data</a></span></li><li><span><a href="#Handling-the-Categorical-Data" data-toc-modified-id="Handling-the-Categorical-Data-3">Handling the Categorical Data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Encoding-categorical-data" data-toc-modified-id="Encoding-categorical-data-3.0.0.1">Encoding categorical data</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Splitting-the-dataset-into-the-Training-Set-and-Test-Set" data-toc-modified-id="Splitting-the-dataset-into-the-Training-Set-and-Test-Set-4">Splitting the dataset into the Training Set and Test Set</a></span></li><li><span><a href="#Feature-Scaling" data-toc-modified-id="Feature-Scaling-5">Feature Scaling</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Do-we-need-to-apply-feature-scaling-to-dependent-variable-Y?" data-toc-modified-id="Do-we-need-to-apply-feature-scaling-to-dependent-variable-Y?-5.0.1">Do we need to apply feature scaling to dependent variable Y?</a></span></li></ul></li></ul></li></ul></div>

# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler



### Import the dataset

In [2]:
dataset = pd.read_csv("/Users/ankitsharma/Documents/Documents/Stuff/Machine Learning/Udemy Course/Part 1 - Data Preprocessing/Data.csv")


#matrix of features

X = dataset.iloc[:, :-1].values
# : first denotes all the lines (columns), :-1 all the lines (columns) except the last one

Y = dataset.iloc[:, 3].values #index for last column is 3

In [3]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [4]:
print(X)

print("\n")

print(Y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [5]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


# Handling the Missing Data

As you can see Age is missing in the seventh row. Salary data is missing in the 5th row.

> We will replace the missing data by the mean of all the data in that particular column

In [6]:
?Imputer #To check the parameters of any function or class.. We can also check the documentation on the website

In [7]:
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

imputer = imputer.fit(X[:, 1:3]) #3 because upper bound is excluded

#replace missing data

X[:, 1:3] = imputer.transform(X[:, 1:3])

In [8]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Handling the Categorical Data

We have two categorical column `Country` and `Purchased`

* Reason?
> Simply, bc they contain categories

* `Country` contains 3 categories : **France, Spain** and **Germany**  

* `Purchased` contains 2 categories : **Yes, No**

#### Encoding categorical data

In [9]:
#creating object

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) #Still a problem : refer video since, ML algo will take it as higher value country is special or something like that

#To prevent this we will use dummy encoding
onehotencoder = OneHotEncoder(categorical_features = [0])

X = onehotencoder.fit_transform(X).toarray()

In [10]:
print(X)

[[  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.40000000e+01
    7.20000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   2.70000000e+01
    4.80000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   3.00000000e+01
    5.40000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.80000000e+01
    6.10000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   4.00000000e+01
    6.37777778e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.50000000e+01
    5.80000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.87777778e+01
    5.20000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.80000000e+01
    7.90000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   5.00000000e+01
    8.30000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.70000000e+01
    6.70000000e+04]]


In [11]:
#Now for Purchased column

labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)


In [12]:
print(Y) # 0 to No, 1 to Yes

[0 1 0 0 1 1 0 1 0 1]


# Splitting the dataset into the Training Set and Test Set

* Need?
> Machine is going to learn something on the datasets. 
  Then we need it to test it on some different test sets.
  We will train it using some data then we will test it with some data which won't be much different than the training data.

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) #testSize is in percentage

In [14]:
X_test

array([[  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04]])

# Feature Scaling

* Why we need this?

> To decrease the dominance of different scale of data. 


* **Standardisation**
> $Xstand = \frac{X - mean(X)}{standard deviation (X)}$

* **Normalisation**
> $Xnorm = \frac{X - min(X)}{max(X) - min(X)}$

In [15]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test) #don't need to fit bc already fitted the training sets

# Do we need to fit or transform Dummy variables?
## We can scale those dummy variables..

In [16]:
print(X_train)

print("\n Now the X test \n")

print(X_test)

[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]

 Now the X test 

[[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]


### Do we need to apply feature scaling to dependent variable Y?

> No, bc we have only 2 categorical value