# Data Pre-processing

Data preprocessing is a technique that is used to convert the raw data into a clean data set. Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn.

We consider the following techniques:

1. Handling Null Values
2. Standardisation, Rescaling
3. Handling Categorical Variables
4. One-Hot Encoding
5. Multicollinearity

In [4]:
import pandas as pd
from io import StringIO 

##  Handling Null Values

In any real world dataset there are always few null values. It doesn’t really matter whether it is a regression,classfication or any other kind of problem, no model can handle these NULL or NaN values on its own so we need to intervene.

In [12]:
csv_data = u\
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
9.0,10.0,11.0,
'''

In [13]:
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,9.0,10.0,11.0,


Check whether there are Null values or not.

In [14]:
df.isnull()      
# Returns a boolean matrix, if the value is NaN then True otherwise False
df.isnull().sum() 
# Returns the column names along with the number of NaN values in that particular column

A    0
B    0
C    1
D    1
dtype: int64

There are various ways for us to handle this problem. The easiest way is by dropping the rows or columns that contain null values.

In [17]:
df.dropna(subset=['D'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0


However it is not the best option unless you have unlimited amount of data as it can lead to loss of valuable information. We will do data <b>imputation</b> instead. Imputation is the process of substituting the missing values of our dataset.

In [21]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN' ,strategy='mean')
imputer = imputer.fit(df[['C','D']])
df[['C','D']] = imputer.transform(df[['C','D']])
df



Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0
2,9.0,10.0,11.0,6.0


##  Standardisation, rescaling

In <b> rescaling</b> one transforms the values to the same interval. 
One can use the ```MinMaxScaler``` class for rescaling.

In <b>standardisation</b> one transforms the values such that the mean of the values is 0 and the standard deviation is 1. One can use ```StandardScaler``` for standardisation.

<b> Task:</b> Download Pima Indian diabetes or any other datasets from, e.g., <a href="https://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a>. Try to rescale and standardise the date. 

<b>Important</b>: remember to perform the same transformations on both training and testing data.

##  Categorical data

Categorical variables (CVs) are basically the variables that are discrete and not continuous. There are two types of CVs:

1. Ordinal categorical variables — These variables can be ordered. Example: size of a T-shirt, M<L<XL.
2. Nominal categorical variables — These variables can’t be ordered. Example: color of a T-shirt. We can’t say that Blue<Green as it doesn’t make any sense to compare the colors as they don’t have any relationship.

In [34]:
df_cat = pd.DataFrame(data = 
                     [['green','M',10.1,'class1'],
                      ['blue','L',20.1,'class2'],
                      ['white','M',30.1,'class1']])
df_cat.columns = ['color','size','price','classlabel']
df_cat

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,blue,L,20.1,class2
2,white,M,30.1,class1


<b> Task: </b> What are CVs here? Ordinal and nominal?

For ordinal CVs use ```map()``` to find a correspondence with numerical variables or ```LabelEncoder```, for nomical variabels one has to use another strategy such as one-hot encoding.

### Mapping Ordinal Features

In [35]:
size_mapping = {'M':1,'L':2}
df_cat['size'] = df_cat['size'].map(size_mapping)
df_cat

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,blue,2,20.1,class2
2,white,1,30.1,class1


### Using LabelEncoder

In [36]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df_cat['classlabel'] = class_le.fit_transform(df_cat['classlabel'].values)
df_cat

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,blue,2,20.1,1
2,white,1,30.1,0


### Handling nominal categorical variables 

If we use the same mapping strategy that we used with ordinal feature like 'Size', then we are actually misleading our model into believing that there is some sort of relationship between the various colors. 

So if use-blue = 0 and green = 1, Then the model will still think of it as some sort of a relationship like green>blue which doesn't make any sense. 

The correct way of handling nominal CVs is to use one-hot encoding. One can use the ```get_dummies() ``` function for one-hot encoding.

In [37]:
pd.get_dummies(df_cat[['color','size','price']])

Unnamed: 0,size,price,color_blue,color_green,color_white
0,1,10.1,0,1,0
1,2,20.1,1,0,0
2,1,30.1,0,0,1


Note that we passed ‘size’ and ‘price’ along with ‘color’ but the ```get_dummies()``` function considers only the string variables and just transforms the 'color' variable.

One-hot encoding can lead to multicollinearity. <b>Multicollinearity</b> occurs in the dataset when there are features which are strongly dependent on each other.

To identify a multicollinearity, one can plot a pairplot and can observe the relationships between different features. If you get a linear relationship between 2 features then they are strongly co-related with each other and there is multicollinearity in the dataset. One can also use the correlation matrix to check how closely related the features are.

We can use ```drop_first=True``` in order to avoid the problem of multicollinearity that will drop the first column of color. The important thing to note here is that we don’t lose any information because if color_green and color_white are both 0 then it implies that the color must have been blue.

In [38]:
pd.get_dummies(df_cat[['color','size','price']],drop_first=True)

Unnamed: 0,size,price,color_green,color_white
0,1,10.1,1,0
1,2,20.1,0,0
2,1,30.1,0,1


<b> Source:</b> D. Kumar, Towards Data Science.