# Data Preprocessing

Explains the procedure of Data Preprocessing in Python, including removal/ redefinition of undefined or missing values, applying categorisation etc.

## Template

### Importing Libraries

In [1]:
# For Basic Operations

import numpy as np                     # Computer
import matplotlib.pyplot as plt        # Plotter
import pandas as pd                    # Data Handler
from sklearn.preprocessing import *    # Data Preprocessor
from sklearn.cross_validation import * # Data CrossValidator

# For Displaying Dataset

from IPython.display import Markdown, display

# utility functions

# line print
def ln(): print("\n")
    
# markdown print
def md(string): display(Markdown(str(string)))
def bi(string): return "***"+str(string)+"***"
def bo(string): return "**"+str(string)+"**"
def it(string): return "*"+str(string)+"*"

# table print
def tab(data): display(pd.DataFrame(data))
def table(data, names=[]): display(pd.DataFrame(data, columns = names))



### Importing Dataset

In [2]:
dataset = pd.read_csv('./Data_Preprocessing/Data.csv')
tab(dataset)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Splitting the variables

This is required since models can be applied only on independent variables to *predict* the corresponding dependent variables.

#### Independent Variables

In [3]:
X = dataset.iloc[:, :-1].values # Independent Variables
table(X,['Country','Age','Salary'])

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


As you can see, the above command retrieves a numpy array of all rows (:) and all columns except last (:-1)

#### Dependent Variables

In this case, there is only one column, *"Purchased"* that is dependent on the other 3 independent variables.

In [4]:
Y = dataset.iloc[:, 3].values # Dependent Variables
table(Y,['Purchased'])

Unnamed: 0,Purchased
0,No
1,Yes
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


### Check Missing Data

In [5]:
md("Does the dataset have missing data:" + bo(dataset.isnull().values.any()))
ln()
md(bi("Col-wise Missing Values"))
table(dataset.isnull().sum(),['NaN Count'])
ln()
md("Total number of missing values in the dataset is " + bo(dataset.isnull().sum().sum()))

Does the dataset have missing data:**True**





***Col-wise Missing Values***

Unnamed: 0,NaN Count
Country,0
Age,1
Salary,1
Purchased,0






Total number of missing values in the dataset is **2**

### Handling Missing Data

Missing Data can be handled in Python using a class in *sklearn.preprocessing* module known as the **Imputer**.


***missing_values : integer or “NaN”, optional (default=”NaN”)***

The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.

***strategy : string, optional (default=”mean”)***

The imputation strategy.

If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.
axis : integer, optional (default=0)

The axis along which to impute.

If axis=0, then impute along columns.
If axis=1, then impute along rows.

***verbose : integer, optional (default=0)***

Controls the verbosity of the imputer.

***copy : boolean, optional (default=True)***

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. 
Note that, in the following cases, a new copy will always be made, even if copy=False:
    - If X is not an array of floating values
    - If X is sparse and missing_values=0
    - If axis=0 and X is encoded as a CSR matrix
    - If axis=1 and X is encoded as a CSC matrix

In [6]:
# Creating imputer objects to fit data with

def imputeBy (strata, dataset, missing='NaN'):
    return Imputer(missing_values = missing, strategy = strata, axis =0, copy = False).fit(dataset).transform(dataset)


In [7]:
# Simple imputer function call

X[:, 1:3] = imputeBy ('mean', X[:, 1:3])
table(X,['Country','Age','Salary'])

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


### Obtaining the Training and Test sets from the Dataset

#### Imports required from scikit learn ```(cross_validation)```

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [9]:
md(bi("X from Training Set"))
tab (X_train)
ln()
md(bi("X from Test Set"))
tab (X_test)
ln()
md(bi("Y from Training Set"))
tab (Y_train)
ln()
md(bi("Y from Test Set"))
tab (Y_test)

***X from Training Set***

Unnamed: 0,0,1,2
0,Germany,40.0,63777.8
1,France,37.0,67000.0
2,Spain,27.0,48000.0
3,Spain,38.7778,52000.0
4,France,48.0,79000.0
5,Spain,38.0,61000.0
6,France,44.0,72000.0
7,France,35.0,58000.0






***X from Test Set***

Unnamed: 0,0,1,2
0,Germany,30,54000
1,Germany,50,83000






***Y from Training Set***

Unnamed: 0,0
0,Yes
1,Yes
2,Yes
3,No
4,Yes
5,No
6,No
7,Yes






***Y from Test Set***

Unnamed: 0,0
0,No
1,No


### How to Convert Categorical Data to Numerical Data?

This involves two steps:

Integer Encoding (done by LabelEncoder)
One-Hot Encoding (done by OneHotEncoder)

1. ***Integer Encoding***

As a first step, each unique category value is assigned an integer value.

For example, ```“red” is 1, “green” is 2, and “blue” is 3.```

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

2. ***One-Hot Encoding***

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

**in Label Encoding or Integer Encoding**
```
1
2
3
```
**in One Hot Encoding**
```
red,	green,	blue
1,		0,		0
0,		1,		0
0,		0,		1
```
The binary variables are often called ***“dummy variables”*** in other fields, such as statistics.

### Encoding Categorical Variables

In [10]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

def labelEncode (dataset, column):
    return LabelEncoder().fit_transform(dataset[:,column])

def oneHotEncode (dataset, column):
    return OneHotEncoder(categorical_features = [column]).fit_transform(X).toarray()

In [11]:
Y = X

# Label Encoding
X[:, 0] = labelEncode (X,0)
table(X,['Country','Age','Salary'])

Unnamed: 0,Country,Age,Salary
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,63777.8
5,0,35.0,58000.0
6,2,38.7778,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


In [18]:
# One Hot Encoding
Y = oneHotEncode (Y,0)
table(Y.astype(int),['France','Germany','Spain','Age','Salary'])

Unnamed: 0,France,Germany,Spain,Age,Salary
0,1,0,0,44,72000
1,0,0,1,27,48000
2,0,1,0,30,54000
3,0,0,1,38,61000
4,0,1,0,40,63777
5,1,0,0,35,58000
6,0,0,1,38,52000
7,1,0,0,48,79000
8,0,1,0,50,83000
9,1,0,0,37,67000


***Note:*** 

   **fit_transform(y)** :	
       Fit label encoder and return encoded labels
   
   **transform(y)** :	
       Transform labels to normalized encoding.

### Feature Scaling ```(standardScaler)```

Applied for scaling of features of datasets, in case of being sparsely populated. This is usually not required for larger datasets with more features.

In [None]:
def featureScale(dataset):
    return StandardScaler().fit_transform(dataset)