# Building Good Training Sets

## Dealing with missing data

Let's generate a CSV with some missing data to understand how to deal with them

In [2]:
import pandas as pd
from io import StringIO

csv_data = """A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,"""
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [3]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

### Eliminating missing data

One of the easiest ways to deal with missing data is to simply drop the rows or the columns containing them.

In [4]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [5]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


### Imputing missing values

One of the most used techniques is **mean imputation**, with this technique we substitute missing values with the mean of the feature.

In [6]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values="NaN", strategy="mean", axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

Other strategies are **median** or **most_frequent**.

## Handling Categorical Data

We can have **nominal** and **ordinal** data, meaning that we don't have numbers, but some categories. The split between nominal and ordinal is self-explanatory.

Let's generate a dateset with categorical data.

In [7]:
df = pd.DataFrame([["green", "M", 10.1, "class1"], 
                   ["red", "L", 13.5, "class2"],
                   ["blue", "XL", 15.3, "class1"]])
df.columns = ["color", "size", "price", "classlabel"]
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


### Mapping Ordinal Features

To make sure that the models interprets correctly the ordinal features we have to convert them to integers

In [9]:
size_mapping = {'XL': 3, 'L': 2, 'M': 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


To turn values back to labels we can define a reverse mapping dictionary

In [10]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}

### Encoding Class Labels

Many machine learning libraries require that class labels are encoded as integers, scikit-learn automatically converts them to integers, but is a good practice to convert them ourselves to avoid any possible issue.

In [11]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

In [12]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


Or we may use the `LabelEncoder` class

In [13]:
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0], dtype=int64)

In [14]:
class_le.inverse_transform(y)

array([0, 1, 0], dtype=int64)

### One-hot encoding

In order to deal correctly with categorical data we have to perform another transformation, in fact regular integer mapping goes well with ordinal data, not with nominal. This is because 0, 1, 2 are naturally ordered and have different magnitude. To avoid this pitfall we will generate one dummy feature for every unique value.

In [17]:
from sklearn.preprocessing import OneHotEncoder

X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()

array([[  0. ,   1. ,   0. ,   1. ,  10.1],
       [  0. ,   0. ,   1. ,   2. ,  13.5],
       [  1. ,   0. ,   0. ,   3. ,  15.3]])

Another efficient way to get one-hot encoding is to use the `get_dummies` method in Pandas that will convert all string columns

In [18]:
pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


## Partitioning a dataset into training and test sets

We will use the **Wine** dataset to show various feature selection techniques

In [19]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']
print('Class labels', np.unique(df_wine['Class label']))

Class labels [1 2 3]


In [20]:
df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


The samples belong to three different classes referring to three different cultivars in Italy.

To split the dataset into training and test sets we can use the `train_test_split`.

In [21]:
from sklearn.cross_validation import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Features Scaling

There are two common approaches to feature scaling: **normalization** and **standardization**. Normalization refers to the rescaling of the feature to a range from 0 to 1. To normalize data we would apply the following logic:

$$
x_{norm}^{(i)}=\frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}
$$

We can take advantage of the `MinMaxScaler` class from sklearn

In [22]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

Normalization can be useful when we need values in a bounded interval, but standardization can be more practical for many machine learning algorithms. The reason is that many linear models initialize the weights to 0 or small random values around 0. By centering the feature column around 0 with standard deviation 1 makes it easier to learn the weights.

Standardization can be explained with the following equation:

$$
x_{std}^{(i)}=\frac{x^{(i)}-\mu_x}{\sigma_x}
$$

Here $\mu_x$ is the sample mean of a feature and $\sigma_x$ the corresponding standard deviation.

In [23]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

**N.B.:** It's important to note that we have to fit the `StandardScaler` only once on training data and then we transform the test set and every new data point with the same parameters

## Selecting features

