# Building Good Training Sets – Data Preprocessing

Main topics covered in this notebook:
* Removing and imputing missing values from the dataset
* Getting categorical data into shape for machine learning algorithms
* Selecting relevant features for the model construction

In [44]:
%load_ext watermark
%watermark -a "Ankit Kumar" -u -d -p numpy,pandas,matplotlib,sklearn

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Ankit Kumar 
last updated: 2018-10-01 

numpy 1.14.5
pandas 0.23.3
matplotlib 2.2.2
sklearn 0.20.0


In [51]:
#Importing libraries
from IPython.display import Image
%matplotlib inline

import pandas as pd
import numpy as np
from io import StringIO
import sys

from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

## Dealing with missing data

### Identifying missing values in tabular data

In [3]:
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

In [4]:
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [5]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [6]:
#access the underlying numpy array via the values attribute
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

### Eliminating samples or features with missing values

In [7]:
#remove rows that contains missing values

df.dropna(axis = 0) #axis = 0 for row and axis = 1 for columns

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [8]:
#remove columns that contains missing values

df.dropna(axis = 1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [9]:
#only drop rows where all columns are NaN

df.dropna(how = 'all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [10]:
#drop rows that have less than 3 real values

df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [11]:
#drop rows where NaN appear in specific columns (for example: C)

df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


**Removing features is not always a good step**

* We may end up removing too many samples, which will make a reliable analysis impossible.

* If we remove too many feature columns, we will run the risk of losing valuable information that our classifier needs to discriminate between classes.

Other alternatives are:

### Inputing missing values

* We can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. 

* One of the most common interpolation techniques is `mean imputation`, where we simply replace the missing value by the mean value of the entire feature column.

In [12]:
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

In [13]:
imr = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imr = imr.fit(df.values)

imputed_data = imr.transform(df.values)
imputed_data



array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

## Handling Categorical Data

### Nominal and Ordinal features

**Ordinal**: Categorical values that can be sorted or ordered.

**Nominal**: Nominal features don't imply any order.

In [14]:
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2']
                  ])

df.columns = ['color', 'size', 'price', 'classlabel']

df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


As we can see, in this data frame we have a `nominal feature` (color), an `ordinal feature` (size) and a `numerical feature` (price)

The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column.

### Mapping ordinal features

In the following simple example, let's assume that we know the difference between features, for example, XL=L+1=M+2.

In [15]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1
                   }

df['size'] = df['size'].map(size_mapping)

df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


If we want to transform the integer values back to the original string
representation at a later stage, we can simply define a reverse-mapping
dictionary `inv_size_mapping = {v: k for k, v in size_mapping.items()}` that can then be used via the pandas' map method on the transformed feature column similar to the size_mapping dictionary that we used previously.


### Encoding class labels

* Many machine learning libraries require that class labels are encoded as integer values. 

* Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches.

In [16]:
#create a mapping dictionary to convert the class labels from string to integers

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}

class_mapping

{'class1': 0, 'class2': 1}

In [17]:
#to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


In [18]:
#If in any case you want to reverse it...

inv_class_mapping = {v:k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


Alternatively, there is `LabelEncoder` class in sklearn

In [19]:
#Label encoding with sklearn's LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([1, 0, 1])

In [20]:
#reverse mapping

class_le.inverse_transform(y)

array(['class2', 'class1', 'class2'], dtype=object)

### Performing one-hot encoding on nominal features

Since scikit-learn's estimators treat class labels without any order, we used the convenient `LabelEncoder class` to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows:


In [21]:
X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

blue -> 0

green -> 1

red -> 2

* If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. 

* A learning algorithm will now assume that green is larger than blue, and red is larger than green. 

* A common workaround for this problem is to use a technique called `one-hot encoding`.

* The idea behind this approach is to create a new `dummy feature` for each unique value in the nominal feature column.

Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of a sample; for example, a blue sample can be encoded as `blue=1, green=0, red=0`. 

In [23]:
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

* By default, the `OneHotEncoder` returns a sparse matrix when we use the `transform` method

* We converted the sparse matrix representation into a regular (dense) NumPy array for the purposes of visualization via the toarray method. 

* To omit the `toarray` step, we could initialize the encoder as `OneHotEncoder(...,sparse=False)` to return a regular NumPy array.

An even more convenient way to create those dummy features via `one-hot encoding` is to use the `get_dummies` method implemented in pandas. Applied on a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged:

In [25]:
#returns dense array so that we can skip the toarray part
ohe = OneHotEncoder(categorical_features=[0], sparse=False)
ohe.fit_transform(X)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

In [26]:
#one hot encoding via pandas

pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


In [27]:
#multiliniearity guard in get_dummies
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)

Unnamed: 0,price,size,color_green,color_red
0,10.1,1,1,0
1,13.5,2,0,1
2,15.3,3,0,0


In [28]:
#multicollinearity guard for the OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])

ohe.fit_transform(X).toarray()[:, 1:]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[ 1. ,  0. ,  1. , 10.1],
       [ 0. ,  1. ,  2. , 13.5],
       [ 0. ,  0. ,  3. , 15.3]])

## Partitioning dataset into training set and test set

In [29]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data', header = None)

In [30]:
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalnity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', '0D280/0D315 of diluted wines', 'Proline' ]

print('Class labels', np.unique(df_wine['Class label']))

df_wine.head()

Class labels [1 2 3]


Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalnity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,0D280/0D315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [32]:
#As 0th column is our target label

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)


In [37]:
X_train?

[0;31mType:[0m            ndarray
[0;31mString form:[0m    
[[1.362e+01 4.950e+00 2.350e+00 ... 9.100e-01 2.050e+00 5.500e+02]
           [1.376e+01 1.530e+00 2.700e+0 <...> .900e-01 3.130e+00 8.860e+02]
           [1.270e+01 3.870e+00 2.400e+00 ... 1.190e+00 3.130e+00 4.630e+02]]
[0;31mLength:[0m          124
[0;31mFile:[0m            /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/__init__.py
[0;31mDocstring:[0m       <no docstring>
[0;31mClass docstring:[0m
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

An array object represents a multidimensional, homogeneous array
of fixed-size items.  An associated data-type object describes the
format of each element in the array (its byte-order, how many bytes it
occupies in memory, whether it is an integer, a floating point number,
or something else, etc.)

Arrays should be constructed using `array`, `zeros` or `empty` (refer
to the See Also section below).

## Bringing features onto the same scale

* Feature scaling is a crucial step in our preprocessing pipeline that can easily be forgotten. 

* Decision trees and random forests are one of the very few machine learning algorithms where we don't need to worry about feature scaling.

*  Let's assume that we have two features where one feature is measured on a scale from 1 to 10 and the second feature is measured on a scale from 1 to 100,000. 

* `K-nearest neighbors (KNN)` algorithm with a Euclidean distance measure; the computed distances between samples will be dominated by the second feature axis.

* There are two common approaches to bringing different features onto the same scale: `normalization` and `standardization`.

* Most often, normalization refers to the rescaling of the features to a range of [0, 1], which is a
special case of `min-max` scaling. 

$x^{(i)}_{norm} = \frac{x^{(i)}-x_{min}}{x_{max} - x_{min}}$

Here, $x_{(i)}$ is a particular sample, $x_{min}$ is the smallest value in the feature column and $x_{max}$ is the largest value.

In [40]:
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

In [41]:
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

Example:

In [43]:
ex = np.array([0, 1, 2, 3, 4, 5])

print('standardized:', (ex - ex.mean()) / ex.std())

print('normalized:', (ex-ex.min()) / (ex.max() - ex.min()))

standardized: [-1.46385011 -0.87831007 -0.29277002  0.29277002  0.87831007  1.46385011]
normalized: [0.  0.2 0.4 0.6 0.8 1. ]


## Selecting Meaningful Features

* If we notice that a model performs much better on a training dataset than on the test dataset, this observation is a strong indicator for `overfitting`.

* Overfitting means that model fits the parameters too closely to the particular observations in the training dataset but does not generalize well to real data—we say that the model has a `high variance`.

* Common solutions to reduce the generalization error are listed as follows:
>  * Collect more training data
   * Introduce a penalty for complexity via regularization
   * Choose a simpler model with fewer parameters
   * Reduce the dimensionality of the data

### Sparse solutions with L1 regularization

* L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we defined the L2 norm of our weight vector w as follows:

    $L2: ||w||^2_2 = \sum^m_{j=1}w^2_j$
    
* Another approach to reduce the model complexity is the related L1 regularization:
    
    $L1: ||w||_1 = \sum^m_{j=1}|w_j|$
    
* In contrast to L2 regularization, L1 regularization yields sparse feature vectors; most feature weights will be zero.

* Sparsity can be useful in practice if we have a high-dimensional dataset with many features that are irrelevant, especially cases where we have more irrelevant dimensions than samples. 

* Thus, by increasing the regularization strength via the regularization parameter λ , we shrink the weights towards zero and decrease the dependence of our model on the training data. 

For regularized models in scikit-learn that support L1 regularization, we can simply set the `penalty` parameter to `'l1'` to obtain a sparse solution:

In [46]:
LogisticRegression(penalty='l1')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [47]:
lr = LogisticRegression(penalty='l1', C = 1.0)
#C = 1.0 is by default. We can increase or decrease it to make the regularization effect stronger or weaker

lr.fit(X_train_std, y_train)

print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

Training accuracy: 1.0
Test accuracy: 1.0




In [48]:
lr.intercept_

array([-1.26389476, -1.21596124, -2.36998624])

In [49]:
np.set_printoptions(8)

lr.coef_[lr.coef_!=0].shape

(23,)

In [50]:
lr.coef_

array([[ 1.24636871,  0.17996748,  0.74614654, -1.16369832,  0.        ,
         0.        ,  1.15896787,  0.        ,  0.        ,  0.        ,
         0.        ,  0.55733313,  2.50886608],
       [-1.53667392, -0.38785074, -0.99490211,  0.36516046, -0.05983875,
         0.        ,  0.66820857,  0.        ,  0.        , -1.93480065,
         1.23286534,  0.        , -2.23175868],
       [ 0.13565138,  0.1684093 ,  0.35717469,  0.        ,  0.        ,
         0.        , -2.43831278,  0.        ,  0.        ,  1.56378977,
        -0.81919538, -0.49207541,  0.        ]])

In [None]:
fig = plt.figure