## Data Preprocessing

1. Importing python library
2. Reading Data
3. Missing Data
4. Deal with Categorical Data
5. Splitting Data
6. Normalize Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('2.1 Data.csv')

In [3]:
data

Unnamed: 0,City,Age,Sex,Smoke,HappinessIndex,Healthy
0,Mumbai,24.0,Male,Yes,241.0,Yes
1,London,80.0,Female,No,928.0,No
2,NewYork,38.0,Male,Yes,,Yes
3,NewYork,22.0,Female,Yes,786.0,Yes
4,NewYork,36.0,Male,Yes,967.0,Yes
5,London,,Female,Yes,665.0,Yes
6,Mumbai,17.0,Female,No,293.0,No
7,NewYork,28.0,Female,No,494.0,Yes
8,Mumbai,45.0,Female,No,707.0,No
9,London,29.0,Male,Yes,599.0,No


In [4]:
X = data.iloc[:,0:5].values #without values, it'll create dataframe
y = data.iloc[:,5].values

In [5]:
X

array([['Mumbai', 24.0, 'Male', 'Yes', 241.0],
       ['London', 80.0, 'Female', 'No', 928.0],
       ['NewYork', 38.0, 'Male', 'Yes', nan],
       ['NewYork', 22.0, 'Female', 'Yes', 786.0],
       ['NewYork', 36.0, 'Male', 'Yes', 967.0],
       ['London', nan, 'Female', 'Yes', 665.0],
       ['Mumbai', 17.0, 'Female', 'No', 293.0],
       ['NewYork', 28.0, 'Female', 'No', 494.0],
       ['Mumbai', 45.0, 'Female', 'No', 707.0],
       ['London', 29.0, 'Male', 'Yes', 599.0]], dtype=object)

## Missing Data

In [6]:
from sklearn.preprocessing import Imputer

In [7]:
imputer = Imputer(missing_values= 'NaN',strategy= 'mean',axis = 0)

In [8]:
X[:,1:2]

array([[24.0],
       [80.0],
       [38.0],
       [22.0],
       [36.0],
       [nan],
       [17.0],
       [28.0],
       [45.0],
       [29.0]], dtype=object)

In [9]:
X[:,1:2] = imputer.fit_transform(X[:,1:2])

X[:,4:5] = imputer.fit_transform(X[:,4:5])

#.fit - calculates the mean of missing values in particular column and later we have to fillout values separately byy transform
#.fit_transform - does both fit and transform

In [10]:
X

array([['Mumbai', 24.0, 'Male', 'Yes', 241.0],
       ['London', 80.0, 'Female', 'No', 928.0],
       ['NewYork', 38.0, 'Male', 'Yes', 631.1111111111111],
       ['NewYork', 22.0, 'Female', 'Yes', 786.0],
       ['NewYork', 36.0, 'Male', 'Yes', 967.0],
       ['London', 35.44444444444444, 'Female', 'Yes', 665.0],
       ['Mumbai', 17.0, 'Female', 'No', 293.0],
       ['NewYork', 28.0, 'Female', 'No', 494.0],
       ['Mumbai', 45.0, 'Female', 'No', 707.0],
       ['London', 29.0, 'Male', 'Yes', 599.0]], dtype=object)

In [13]:
y

array(['Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'No'],
      dtype=object)

## Categorical Data

In Machine Learning, numeric data is taken into account. It doesn't understand the non-numeric data like strings.

Under this category, we gonna see how can we make use of categorical data in our model

In [12]:
X #Here city, sex and Yes/No are strings

array([['Mumbai', 24.0, 'Male', 'Yes', 241.0],
       ['London', 80.0, 'Female', 'No', 928.0],
       ['NewYork', 38.0, 'Male', 'Yes', 631.1111111111111],
       ['NewYork', 22.0, 'Female', 'Yes', 786.0],
       ['NewYork', 36.0, 'Male', 'Yes', 967.0],
       ['London', 35.44444444444444, 'Female', 'Yes', 665.0],
       ['Mumbai', 17.0, 'Female', 'No', 293.0],
       ['NewYork', 28.0, 'Female', 'No', 494.0],
       ['Mumbai', 45.0, 'Female', 'No', 707.0],
       ['London', 29.0, 'Male', 'Yes', 599.0]], dtype=object)

----
If there are two categories(Y/N), we can convert Yes -> 1 and No -> 0 using **LabelEncoder**

In case of more than 2 categories, we can't convert (Mum - 0, Lon - 1, New - 2) as ML model consider that New>Lon>Mum automatically. That's not our Intention. In order to overcome, the solution is **One Hot Encoder**

Mum  Lon  New <br>
1     0    0 <br>
0     1    0

In [14]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [15]:
le_x = LabelEncoder()
le_y = LabelEncoder()

#LabelEncoder() is used for both i/p and o/p

In [16]:
X[:,0] = le_x.fit_transform(X[:,0])
X[:,2] = le_x.fit_transform(X[:,2])
X[:,3] = le_x.fit_transform(X[:,3])
y = le_y.fit_transform(y)

In [18]:
X #cities, sex and Y/N have been changed. but we have use OneHotEncoder for cities again

array([[1, 24.0, 1, 1, 241.0],
       [0, 80.0, 0, 0, 928.0],
       [2, 38.0, 1, 1, 631.1111111111111],
       [2, 22.0, 0, 1, 786.0],
       [2, 36.0, 1, 1, 967.0],
       [0, 35.44444444444444, 0, 1, 665.0],
       [1, 17.0, 0, 0, 293.0],
       [2, 28.0, 0, 0, 494.0],
       [1, 45.0, 0, 0, 707.0],
       [0, 29.0, 1, 1, 599.0]], dtype=object)

In [19]:
y

array([1, 0, 1, 1, 1, 1, 0, 1, 0, 0], dtype=int64)

In [20]:
onehot = OneHotEncoder(categorical_features=[0])

In [21]:
X = onehot.fit_transform(X).toarray() #converts to array after changes

In [23]:
X #first three indicates the city in onehotecncoder matrix

array([[  0.        ,   1.        ,   0.        ,  24.        ,
          1.        ,   1.        , 241.        ],
       [  1.        ,   0.        ,   0.        ,  80.        ,
          0.        ,   0.        , 928.        ],
       [  0.        ,   0.        ,   1.        ,  38.        ,
          1.        ,   1.        , 631.11111111],
       [  0.        ,   0.        ,   1.        ,  22.        ,
          0.        ,   1.        , 786.        ],
       [  0.        ,   0.        ,   1.        ,  36.        ,
          1.        ,   1.        , 967.        ],
       [  1.        ,   0.        ,   0.        ,  35.44444444,
          0.        ,   1.        , 665.        ],
       [  0.        ,   1.        ,   0.        ,  17.        ,
          0.        ,   0.        , 293.        ],
       [  0.        ,   0.        ,   1.        ,  28.        ,
          0.        ,   0.        , 494.        ],
       [  0.        ,   1.        ,   0.        ,  45.        ,
          0.    

## Splitting train and test data

In [24]:
from sklearn.cross_validation import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

In [27]:
X #total 10 i/p's

array([[  0.        ,   1.        ,   0.        ,  24.        ,
          1.        ,   1.        , 241.        ],
       [  1.        ,   0.        ,   0.        ,  80.        ,
          0.        ,   0.        , 928.        ],
       [  0.        ,   0.        ,   1.        ,  38.        ,
          1.        ,   1.        , 631.11111111],
       [  0.        ,   0.        ,   1.        ,  22.        ,
          0.        ,   1.        , 786.        ],
       [  0.        ,   0.        ,   1.        ,  36.        ,
          1.        ,   1.        , 967.        ],
       [  1.        ,   0.        ,   0.        ,  35.44444444,
          0.        ,   1.        , 665.        ],
       [  0.        ,   1.        ,   0.        ,  17.        ,
          0.        ,   0.        , 293.        ],
       [  0.        ,   0.        ,   1.        ,  28.        ,
          0.        ,   0.        , 494.        ],
       [  0.        ,   1.        ,   0.        ,  45.        ,
          0.    

In [29]:
y_train #test size is 20% of 10, so it should be 8

array([1, 0, 0, 1, 1, 1, 0, 1], dtype=int64)

In [30]:
y_test

array([0, 1], dtype=int64)

## Normalize Data

In [31]:
from sklearn.preprocessing import StandardScaler

In [32]:
#we have huge difference in X value only

In [33]:
ss_x = StandardScaler()

In [34]:
X_train=ss_x.fit_transform(X_train)
X_test=ss_x.transform(X_test)

In [35]:
X_train

array([[ 1.29099445, -0.37796447, -1.        , -0.22280927, -0.77459667,
         0.77459667, -0.37533278],
       [ 1.29099445, -0.37796447, -1.        ,  2.43433619, -0.77459667,
        -1.29099445,  1.35225629],
       [-0.77459667,  2.64575131, -1.        ,  0.34705235, -0.77459667,
        -1.29099445, -0.09944403],
       [-0.77459667, -0.37796447,  1.        , -1.02459131, -0.77459667,
         0.77459667,  0.41948957],
       [-0.77459667, -0.37796447,  1.        , -0.66677123, -0.77459667,
        -1.29099445, -1.49859411],
       [-0.77459667, -0.37796447,  1.        , -0.18967778,  1.29099445,
         0.77459667,  1.6084387 ],
       [ 1.29099445, -0.37796447, -1.        , -0.60713454,  1.29099445,
         0.77459667, -0.80887224],
       [-0.77459667, -0.37796447,  1.        , -0.07040442,  1.29099445,
         0.77459667, -0.59794142]])