# Lesson 23 - Data Preprocessing

### The following topics are discussed in this notebook:
* One-Hot Encoding of Categorical Variables
* Feature Scaling

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Categorical Predictors

A variable is **categorical**, or **qualitative** if it takes on values from a finite set of categories or classes. Such values are often recorded as strings, but might also be encoded using numerical labels. Such numerical labels do not typically have any numerical or quantitative significance.The possible values that a categorical variable can assume are referred to as its **levels**.

## One-Hot Encoding

The Scikit-Learn implementations of supervised learning models require all of the feature information to be provided in a numerical format. If we wish to use a categorical variable as a feature in a Scikit-Learn model, we can do so, but we must first find a way to numerically encode the variable. The most common technique for numerically encoding qualitative information is to use **one-hot encoding**. 

To perform one-hot encoding on a categorical variable, we must introduce new variables, referred to as **dummy variables**. The encoding will have one dummy variable for each level found within the categorical variable. The values found within a dummy variable will be either 0 or 1. A particular dummy variable is equal to one for observations with the associated level, and is zero otherwise.

For example, assume that we have a categorical variable `Z` with four levels: `a`, `b`, `c`, and `d`. A one-hot encoding for `Z` will create four new variables: `Za`, `Zb`, `Zc`, and `Zd`. The value of these dummy variables for each possible level of `Z` is shown in the table below.

| Z | - | Za | Zb | Zc | Zd |
|---|---|----|----|----|----|
| a | - | 1  | 0  | 0  | 0  |
| b | - | 0  | 1  | 0  | 0  |
| c | - | 0  | 0  | 1  | 0  |
| d | - | 0  | 0  | 0  | 1  |

We could perform one-hot encoding manually using tools provided by numpy and pandas. However, sklearn provides a class `OneHotEncoder` for performing one-hot encoding. This class can be imported as follows:

In [2]:
from sklearn.preprocessing import OneHotEncoder

We will now consider a simple example to demonstrate how to perform one-hot encoding.

## Example: One-Hot Encoding

Assume that we have a data set of customers for an online retailer, and we wish to build a model to help us make marketing decisions. Suppose the data set contains three categorical features:
* **`sex`** - The sex of the customer. This has levels "F" and "M".
* **`emp`** - The customer's employment status. This has levels "FT" (full time), "PT" (part time), and "U" (unemployed).
* **`reg`** - The region in which the customer lives. This is encoded with levels 1 (Americas), 2 (Asia), and 3 (Europe). 

A DataFrame containing 6 observations from the DataSet has been created below.

In [3]:
X_cat = pd.DataFrame({
    'sex':['M', 'F', 'F', 'M', 'M', 'F'],
    'emp':['PT', 'FT', 'U', 'FT', 'PT', 'PT'],
    'reg':[1, 3, 3, 2, 1, 2]
})

X_cat

Unnamed: 0,sex,emp,reg
0,M,PT,1
1,F,FT,3
2,F,U,3
3,M,FT,2
4,M,PT,1
5,F,PT,2


Notice that `sex` has two levels, while `emp` and `reg` each have three levels. If we apply one-hot encoding to this DataFrame, the encoding will consist of 8 columns.

In the cell below, we will use `OneHotEncoder` to perform the encoding. This is a three-step process:
1. Create an instance of the encoder object. 
2. Fit the encoder to the data using the **`fit()`** method. This does not actually perform the encoding. It simply learns the levels for each categorical variable.
3. We then perform the encoding using the **`transform()`** method of the encoder object.

In [4]:
encoder = OneHotEncoder(sparse=False)
encoder.fit(X_cat)
X_enc = encoder.transform(X_cat)

print(X_enc)

[[0. 1. 0. 1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0. 1. 0.]]


The output of the encoder is an array. Since the columns are not labeled, it can be difficult to tell what the new columns refer to. The columns of the original DataFrame are encoded from left to right, and the encoded columns for a particular variable are arranged in alphabetical order by level. 

It follows that the first two columns of the encoded DataFrame encode the `sex` variable. The first column refers to `F`, and the second to `M`. The next three columns are associated with `emp`, and are arranged in the order `FT`, `PT`, and `U`. The final three columns encode the `reg` variable, and refer to levels `1`, `2`, and `3`, in that order. 

We will make this clearer by converting the encoded array into a data frame and adding column names.

In [5]:
X_enc_df = pd.DataFrame(X_enc.astype('int'))
X_enc_df.columns = ['sex_F', 'sex_M', 'emp_FT', 'emp_PT', 
                  'emp_U', 'reg_1', 'reg_2', 'reg_3']
X_enc_df

Unnamed: 0,sex_F,sex_M,emp_FT,emp_PT,emp_U,reg_1,reg_2,reg_3
0,0,1,0,1,0,1,0,0
1,1,0,1,0,0,0,0,1
2,1,0,0,0,1,0,0,1
3,0,1,1,0,0,0,1,0
4,0,1,0,1,0,1,0,0
5,1,0,0,1,0,0,1,0


To help illustrate how the encoding was performed, we will now join the original DataFrame with the encoded DataFrame.

In [6]:
sep = pd.DataFrame({'---':['---']*6})
pd.concat([X_cat, sep, X_enc_df], axis=1)

Unnamed: 0,sex,emp,reg,---,sex_F,sex_M,emp_FT,emp_PT,emp_U,reg_1,reg_2,reg_3
0,M,PT,1,---,0,1,0,1,0,1,0,0
1,F,FT,3,---,1,0,1,0,0,0,0,1
2,F,U,3,---,1,0,0,0,1,0,0,1
3,M,FT,2,---,0,1,1,0,0,0,1,0
4,M,PT,1,---,0,1,0,1,0,1,0,0
5,F,PT,2,---,1,0,0,1,0,0,1,0


## Feature Scaling

For some supervised learning models, it is important for all of the numerical features to be on roughly the same scale. When working with these models, it is typically necessary to scale any numerical features prior to using them to create a supervised learning model. 

We will discuss two common scaling techniques: **Min-Max Scaling** and **Standard Scaling** (or **Standardization**).

## Min-Max Scaling

Let $X$ refer to a feature in a dataset. Let $x_{min}$ be the smallest value of $X$ for observations in the set, and let $x_{max}$ be the largest value of $X$ for observations in the set. 

If $x_0$ is a value of $X$ for an observation, its scaled value is given by: $\large w_0 = \frac{x_0 ~-~ x_{min}}{x_{max} ~-~ x_{min}}$

When using min-max scaling to scale features, each feature in the scaled set will have values ranging between 0 and 1.

We can perform min-max scaling using the `MinMaxScaler` class from `sklearn.preprocessing`.

In [7]:
from sklearn.preprocessing import MinMaxScaler

We will now see an example of applying min-max scaling to numerical features.

## Example: Min-Max Scaling

Assume that we are creating a supervised learning model to predict the prices of houses. Suppose our dataset contains the following three numerical features:

* **`rooms`** - The number of rooms in the house.
* **`sqft`** - The size of the living area of the house, in square feet.
* **`lot`** - The size of the lot, in acres. 

Suppose our dataset consists of the following 6 observations:

In [8]:
X_num = pd.DataFrame({
    'rooms':[10, 12, 8, 11, 10, 9], 
    'sqft':[2100, 3400, 1800, 2900, 2800, 2300],
    'lot':[0.34, 0.45, 0.24, 0.49, 0.37, 0.28]
})

X_num

Unnamed: 0,rooms,sqft,lot
0,10,2100,0.34
1,12,3400,0.45
2,8,1800,0.24
3,11,2900,0.49
4,10,2800,0.37
5,9,2300,0.28


Using `MinMaxScaler` to perform feature scaling is a three step process:
1. Create an instance of the scaler object. 
2. Fit the scaler to the data using the **`fit()`** method. This does not actually perform the scaling. It simply learns the min and max values for each feature.
3. We then perform the scaling using the **`transform()`** method of the scaler object.

In [9]:
mm_scaler = MinMaxScaler()
mm_scaler.fit(X_num.astype('float'))
X_mm = mm_scaler.transform(X_num.astype('float'))
print(X_mm)

[[0.5    0.1875 0.4   ]
 [1.     1.     0.84  ]
 [0.     0.     0.    ]
 [0.75   0.6875 1.    ]
 [0.5    0.625  0.52  ]
 [0.25   0.3125 0.16  ]]


In the cell below, we display the scaled values next to the original feature values.

In [10]:
X_mm_df = pd.DataFrame(X_mm)
X_mm_df.columns = ['rooms_mm', 'sqft_mm', 'lot_mm']
pd.concat([X_num, sep, X_mm_df], axis=1)

Unnamed: 0,rooms,sqft,lot,---,rooms_mm,sqft_mm,lot_mm
0,10,2100,0.34,---,0.5,0.1875,0.4
1,12,3400,0.45,---,1.0,1.0,0.84
2,8,1800,0.24,---,0.0,0.0,0.0
3,11,2900,0.49,---,0.75,0.6875,1.0
4,10,2800,0.37,---,0.5,0.625,0.52
5,9,2300,0.28,---,0.25,0.3125,0.16


## Standard Scaling

Let $X$ refer to a feature in a dataset. Let $\bar x$ be the mean of $X$ values for observations in the set, and let $s_X$ be the standard deviation of $X$ for observations in the set. 

If $x_0$ is a value of $X$ for an observation, its standard scaled value is given by: $\large z_0 = \frac{x_0 ~-~ \bar x}{s_X}$

When using standardication to scale features, then each feature in the scaled set will have a mean of 0 and a standard deviation of 1. 

We can perform standardization using the `StandardScaler` class from `sklearn.preprocessing`.

In [11]:
from sklearn.preprocessing import StandardScaler

We will now see an example of standard scaling of numerical features.

## Example: Standard Scaling

We will demonstrate standard scaling using the housing data from the previous example. 

In [12]:
X_num

Unnamed: 0,rooms,sqft,lot
0,10,2100,0.34
1,12,3400,0.45
2,8,1800,0.24
3,11,2900,0.49
4,10,2800,0.37
5,9,2300,0.28


Using `StandardScaler` to perform feature scaling is a three step process:
1. Create an instance of the scaler object. 
2. Fit the scaler to the data using the **`fit()`** method. This does not actually perform the scaling. It simply learns the mean and standard deviation for each feature.
3. We then perform the scaling using the **`transform()`** method of the scaler object.

In [13]:
ss_scaler = StandardScaler()
ss_scaler.fit(X_num.astype('float'))
X_ss = ss_scaler.transform(X_num.astype('float'))
print(X_ss)

[[ 0.         -0.83683223 -0.2466922 ]
 [ 1.54919334  1.5806831   1.00574511]
 [-1.54919334 -1.39472039 -1.38527157]
 [ 0.77459667  0.65086951  1.46117686]
 [ 0.          0.4649068   0.09488161]
 [-0.77459667 -0.4649068  -0.92983982]]


In the cell below, we display the scaled values next to the original feature values.

In [14]:
X_ss_df = pd.DataFrame(X_ss)
X_ss_df.columns = ['rooms_ss', 'sqft_ss', 'lot_ss']
pd.concat([X_num, sep, X_ss_df], axis=1)

Unnamed: 0,rooms,sqft,lot,---,rooms_ss,sqft_ss,lot_ss
0,10,2100,0.34,---,0.0,-0.836832,-0.246692
1,12,3400,0.45,---,1.549193,1.580683,1.005745
2,8,1800,0.24,---,-1.549193,-1.39472,-1.385272
3,11,2900,0.49,---,0.774597,0.65087,1.461177
4,10,2800,0.37,---,0.0,0.464907,0.094882
5,9,2300,0.28,---,-0.774597,-0.464907,-0.92984


## Feature Scaling and Training, Validation, and Test Sets

When using scaled features to create a supervised learning model, it is important that the feature scaler is fit using the exact same data set that the model will be trained on. From the perspective of using `sklearn`, this means that the `fit()` method of a scaler object should only ever be used on the training set. Once the scaler has been fit to the training set, its `transform()` method can then be used to scale the validation set, the training set, or a set of newly collected data. 

## Example: Scaling a Test Set

Returning to our housing example, assume that we have used the previous six observations to train a model. 

In [15]:
X_num_train = X_num
X_num_train

Unnamed: 0,rooms,sqft,lot
0,10,2100,0.34
1,12,3400,0.45
2,8,1800,0.24
3,11,2900,0.49
4,10,2800,0.37
5,9,2300,0.28


Suppse we ultimately wish to test our model using the following three observations.

In [16]:
X_num_test = pd.DataFrame({
    'rooms':[11, 8, 10], 
    'sqft':[3200, 1600, 2700],
    'lot':[0.56, 0.31, 0.42]
})

X_num_test

Unnamed: 0,rooms,sqft,lot
0,11,3200,0.56
1,8,1600,0.31
2,10,2700,0.42


If we were to apply min-max scaling, we would need to first fit the scaler to the training set, and then apply it to both the training and test sets. 

In [17]:
mm_scaler = MinMaxScaler()
mm_scaler.fit(X_num_train.astype('float'))

X_mm_train = mm_scaler.transform(X_num_train.astype('float'))
X_mm_test = mm_scaler.transform(X_num_test.astype('float'))

In [18]:
print(X_mm_train)

[[0.5    0.1875 0.4   ]
 [1.     1.     0.84  ]
 [0.     0.     0.    ]
 [0.75   0.6875 1.    ]
 [0.5    0.625  0.52  ]
 [0.25   0.3125 0.16  ]]


In [19]:
print(X_mm_test)

[[ 0.75    0.875   1.28  ]
 [ 0.     -0.125   0.28  ]
 [ 0.5     0.5625  0.72  ]]


Note that the values in the scaled version of the training set are all between 0 and 1. That is not the case for the scaled version of the test set. If an observation in the test set contains a feature value that is less than the minimum value of that feature in the training set, then its scaled value will be negative. Likewise, if a feature value in the test set is larger than the maximum of that feature in the training set, then its scaled value will be greater than 1. 

## Example: Titanic Dataset

We will now provide an example illustrating how to use feature scaling, one-hot encoding, and data splitting to prepare data for use in creating a model. 

For this example, we will be using the Titanic Dataset. This dataset contains the following pieces of information for each of the 887 passengers on the voyage of the Titanic. 

* **`Survived`** - Binary variable indicating whether or not the passenger survived the voyage. 
* **`Pclass`** - Categorical variable indicating the passenger class. 
* **`Name`** - The Passenger's name.
* **`Sex`** - Categorical variable indicating the sex of the passenger. 
* **`Age`** - Age of the passenger.

In [20]:
titanic = pd.read_csv('data/titanic.txt', sep='\t')
titanic.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age
0,0,3,Mr. Owen Harris Braund,male,22.0
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0
2,1,3,Miss. Laina Heikkinen,female,26.0
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0
4,0,3,Mr. William Henry Allen,male,35.0
5,0,3,Mr. James Moran,male,27.0
6,0,1,Mr. Timothy J McCarthy,male,54.0
7,0,3,Master. Gosta Leonard Palsson,male,2.0
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0


### Extracting Labels

Suppose we are interested in creating a model to predict whether or not a passenger would survive based on their passenger class, sex, and age. We will start by extracting the labels from the DataFrame.

In [21]:
y = titanic.loc[:, 'Survived']
print(y.shape)

(887,)


### Separating Categorical and Numerical Features

Since `Pclass` and `Sex` are categorical variables, we will need to apply one-hot encoding to these. Suppose that we wish to apply min-max scaling to the single numerical variable, `Age`. Since these features require different types of processing, we will split these into two different DataFrames. 

In [22]:
X_cat = titanic.loc[:, ['Pclass', 'Sex']]
print(X_cat.shape)

(887, 2)


In [23]:
X_num = titanic.loc[:, ['Age']]
print(X_num.shape)

(887, 1)


### One-Hot Encoding of Categorical Variables

We will apply one-hot encoding to the categorical variables.

In [24]:
encoder = OneHotEncoder(sparse=False)
encoder.fit(X_cat)
X_enc = encoder.transform(X_cat)
print(X_enc.shape)

(887, 5)


### Splitting the Data

Recall that a feature scaler must be trained only on the training data. As a result, we must perform our train/validation/test split before performing scaling. Suppose we wish to create a 70/15/15 split.

In [25]:
X_num_train, X_num_hold, y_train, y_hold =\
    train_test_split(X_num, y, test_size = 0.3, random_state=1, stratify=y)

X_num_valid, X_num_test, y_valid, y_test =\
    train_test_split(X_num_hold, y_hold, test_size = 0.5, random_state=1, stratify=y_hold)

print(X_num_train.shape)
print(X_num_valid.shape)
print(X_num_test.shape)

(620, 1)
(133, 1)
(134, 1)


Since we will eventually want to merge our encoded categorical features back with the scaled numerical features, we should also perform the same split on our encoded categorical feature array. 

In [26]:
X_enc_train, X_enc_hold, y_train, y_hold =\
    train_test_split(X_enc, y, test_size = 0.3, random_state=1, stratify=y)

X_enc_valid, X_enc_test, y_valid, y_test =\
    train_test_split(X_enc_hold, y_hold, test_size = 0.5, random_state=1, stratify=y_hold)

print(X_enc_train.shape)
print(X_enc_valid.shape)
print(X_enc_test.shape)

(620, 5)
(133, 5)
(134, 5)


### Scaling Numerical Features.

We will now apply min-max scaling to the single numerical feature.

In [27]:
mm_scaler = MinMaxScaler()
mm_scaler.fit(X_num_train.astype('float'))

X_mm_train = mm_scaler.transform(X_num_train.astype('float'))
X_mm_valid = mm_scaler.transform(X_num_valid.astype('float'))
X_mm_test = mm_scaler.transform(X_num_test.astype('float'))

print(X_mm_train.shape)
print(X_mm_valid.shape)
print(X_mm_test.shape)

(620, 1)
(133, 1)
(134, 1)


### Recombining Feature Arrays

We have completed the desired encoding and scaling of features. We must now combine the encoded categorical features with the scaled numerical features. We will need to do this for the training set, the validation set, and the test set. 

In [28]:
X_train = np.hstack([X_enc_train, X_mm_train])
X_valid = np.hstack([X_enc_valid, X_mm_valid])
X_test = np.hstack([X_enc_test, X_mm_test])

print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)

(620, 6)
(133, 6)
(134, 6)


### Moving on to the Modeling Step

We are now ready to use the processed data to create supervised learning models. We will discuss this process in more detail in future lectures. However, to provide a quick example of the modeling process, we will train three models on the training set. In particular, we will train a logistic regression model, a decision tree model, and a KNN model. We will score each model on the validation set. 

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model_1 = LogisticRegression(solver='lbfgs')
model_1.fit(X_train, y_train)

model_2 = DecisionTreeClassifier(max_depth=4)
model_2.fit(X_train, y_train)

model_3 = KNeighborsClassifier(n_neighbors=8)
model_3.fit(X_train, y_train)

print("Model 1 Training Accuracy:  ", model_1.score(X_train, y_train))
print("Model 1 Validation Accuracy:", model_1.score(X_valid, y_valid))
print()
print("Model 2 Training Accuracy:  ", model_2.score(X_train, y_train))
print("Model 2 Validation Accuracy:", model_2.score(X_valid, y_valid))
print()
print("Model 3 Training Accuracy:  ", model_3.score(X_train, y_train))
print("Model 3 Validation Accuracy:", model_3.score(X_valid, y_valid))

Model 1 Training Accuracy:   0.8
Model 1 Validation Accuracy: 0.7593984962406015

Model 2 Training Accuracy:   0.8274193548387097
Model 2 Validation Accuracy: 0.7669172932330827

Model 3 Training Accuracy:   0.8419354838709677
Model 3 Validation Accuracy: 0.7894736842105263


Model 3 has the best performance on the validation set. We will select Model 3 to be our final model and score it one last time on the test set. 

In [30]:
print("Model 1 Testing Accuracy:  ", model_3.score(X_test, y_test))

Model 1 Testing Accuracy:   0.7835820895522388
