# Data Preprocessing



Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.
The concepts that I will cover in this article are-
1. Handling Null Values
2. Standardization
3. Handling Categorical Variables
4. One-Hot Encoding
5. Multicollinearity

## Step 1: Importing the libraries

In [5]:

import numpy as np
import pandas as pd
from io import StringIO
# StringIO - It is only used for the purpose of illustration,so that the csv_data will behave as if it was present in our Disk.


## Step 2: Importing dataset

In [1]:
df=pd.read_csv('datasets/Data.csv')

NameError: name 'pd' is not defined

## Step 3: Check Null values

In [None]:

df.isnull()

In [None]:
df.isnull().sum()

## Step 4: Handling the missing data using Imputer

Imputation is simply the process of substituting the missing values of our dataset.

In [None]:
from sklearn.impute import SimpleImputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(df[["Age","Salary"]])
df[['Age','Salary']] = imputer.transform(df[['Age','Salary']])
df


## Step 5:Encoding categorical data

Handling categorical variables is another integral aspect of Machine Learning. Categorical variables are basically the variables that are discrete and not continuous. Ex — color of an item is a discrete variables where as its price is a continuous variable.
Categorical variables are further divided into 2 types —
Ordinal categorical variables — These variables can be ordered. Ex — Size of a T-shirt. We can say that M<L<XL.
Nominal categorical variables — These variables can’t be ordered. Ex — Color of a T-shirt. We can’t say that Blue<Green as it doesn’t make any sense to compare the colors as they don’t have any relationship.

#### Handling Ordinal Categorical Variables —
```
df_cat = pd.DataFrame(data = 
                     [['green','M',10.1,'class1'],
                      ['blue','L',20.1,'class2'],
                      ['white','M',30.1,'class1']])
df_cat.columns = ['color','size','price','classlabel']


```
Here the columns ‘size’ and ‘classlabel’ are ordinal categorical variables whereas ‘color’ is a nominal categorical variable.
There are 2 pretty simple and neat techniques to transform ordinal CVs.

1. Using map() function —

```
size_mapping = {'M':1,'L':2}
df_cat['size'] = df_cat['size'].map(size_mapping)
```

2. Using Label Encoder —

```
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df_cat['classlabel'] = class_le.fit_transform(df_cat['classlabel'].values)
```

## Incorrect way of handling Nominal Categorical Variables —
The biggest mistake that most people make is that they are not able to differentiate between ordinal and nominal CVs.So if you use the same map() function or LabelEncoder with nominal variables then the model will think that there is some sort of relationship between the nominal CVs.
So if we use map() to map the colors like -

```
col_mapping = {'Blue':1,'Green':2}
```
Then according to the model Green > Blue, which is again a senseless assumption so the model will give you results considering this relationship.So although you will get the results using this method they won’t be optimal.

# Correct way of handling Nominal Categorical Variables —

The correct way of handling nominal CVs is to use One-Hot Encoding. The easiest way to use One-Hot Encoding is to use the get_dummies() function.

```
df_cat = pd.get_dummies(df_cat[['color','size','price']])
```
Here we have passed ‘size’ and ‘price’ along with ‘color’ but the get_dummies() function is pretty smart and will consider only the string variables. So it will just transform the ‘color’ variable.

In [None]:
!inline matplotlib%
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()
df.corr()

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
X[ : , 0]


In [None]:
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
X
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)


In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)


In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)