## Common preprocessing steps

All classifiers in scikit-learn expect the training and test data-tables to have only numbers.  Further, there should be no missing values. Raw data is seldom in this format. We will look at a few functions in Pandas that are handy in converting a raw table to the required format. We use a toy dataset on T-shirts for the purpose of illustration. The table has four features: Brand, Size, Color, and Price.

In [1]:
import pandas as pd

In [2]:
master_data = pd.read_csv('misc/tshirts.csv')

We see that the data has 19 rows. Columns `Brand` and `Price` have 1 and 2 null values, respectively.

In [3]:
master_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
Brand    18 non-null object
Size     19 non-null object
Color    19 non-null object
Price    17 non-null float64
dtypes: float64(1), object(3)
memory usage: 688.0+ bytes


In [4]:
master_data.sample(5)

Unnamed: 0,Brand,Size,Color,Price
13,Puma,M,Blue,2400.0
8,Arrow,M,Blue,2400.0
16,Puma,XXL,Black,2450.0
3,Arrow,M,Blue,
11,Arrow,XL,Red,1200.0


### Coping with missing values

#### Method 1: Drop them

The easiest way to deal with missing values is to drop the rows having them. This is a viable option if the number of rows with missing values is small. In the code below, we drop the rows where either the `Brand` or the  `Price` information is missing.

In [5]:
df = master_data.copy()

In [6]:
df.shape

(19, 4)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
Brand    18 non-null object
Size     19 non-null object
Color    19 non-null object
Price    17 non-null float64
dtypes: float64(1), object(3)
memory usage: 688.0+ bytes


In [8]:
df.dropna( subset=['Brand','Price'] , inplace = True)

In [9]:
df

Unnamed: 0,Brand,Size,Color,Price
0,Adidas,XXL,Blue,1400.0
1,Adidas,XL,Red,1200.0
2,Arrow,XXL,Black,2450.0
5,Adidas,L,Green,1200.0
6,Arrow,XL,Blue,2400.0
7,Arrow,M,Blue,1400.0
8,Arrow,M,Blue,2400.0
9,Puma,XL,Blue,2400.0
10,Puma,M,Blue,1400.0
11,Arrow,XL,Red,1200.0


#### Method 2: Replace with mean

The most common way to deal with missing numerical values is by filling them with the mean value. We do so by using two functions. If f is a series object, then the command f.fillna( x ) replaces the NULL values in f by x. We choose x to be f.mean()

In [10]:
df = master_data.copy()

In [11]:
def fill_missing(f):
    f_mean = f.mean()
    return f.fillna( f_mean )

In [12]:
fill_missing( df['Price'] )

0     1400.000000
1     1200.000000
2     2450.000000
3     1788.235294
4      500.000000
5     1200.000000
6     2400.000000
7     1400.000000
8     2400.000000
9     2400.000000
10    1400.000000
11    1200.000000
12    1200.000000
13    2400.000000
14    2200.000000
15    1788.235294
16    2450.000000
17    1000.000000
18    3200.000000
Name: Price, dtype: float64

In [13]:
df['Price'] = fill_missing(df['Price'])

In [14]:
df['Price']

0     1400.000000
1     1200.000000
2     2450.000000
3     1788.235294
4      500.000000
5     1200.000000
6     2400.000000
7     1400.000000
8     2400.000000
9     2400.000000
10    1400.000000
11    1200.000000
12    1200.000000
13    2400.000000
14    2200.000000
15    1788.235294
16    2450.000000
17    1000.000000
18    3200.000000
Name: Price, dtype: float64

##### Comment about inplace argument

The function <code>fill_missing</code> does not change the data-table df. We should have called fillna with the option <code>inplace=True</code> if we wanted to do so. We can also change the price column in df by explicitly assigning <code>df['Price'] = fill_missing( df['Price'])</code>

#### Method 3: Replace with mean value of a group

A better idea would be to fill the missing price of a T-shirt with the mean price of its brand. We use two functions to do this: groupby and transform. See the explanation below for how transform works under the hood. 

In [15]:
df = master_data.copy()

In [16]:
df.dropna(subset=['Brand'], inplace=True)

In [17]:
g = df.groupby(['Brand'])['Price']

In [18]:
type(g)

pandas.core.groupby.generic.SeriesGroupBy

In [19]:
g.mean()

Brand
Adidas    1266.666667
Arrow     1841.666667
Puma      2150.000000
Name: Price, dtype: float64

In [20]:
g.transform( fill_missing ) 

0     1400.000000
1     1200.000000
2     2450.000000
3     1841.666667
5     1200.000000
6     2400.000000
7     1400.000000
8     2400.000000
9     2400.000000
10    1400.000000
11    1200.000000
12    1200.000000
13    2400.000000
14    2200.000000
15    1266.666667
16    2450.000000
17    1000.000000
18    3200.000000
Name: Price, dtype: float64

**Explanation:** In the code above, g is Series.GroupBy object. The column 'Price' is grouped by Brand. The function transform effectively works as follows:

1. It splits the group-series object g into three Series objects--one for each brand.
2. It passes each of the three Series objects one by one to the function fill_missing.
3. The function fill_missing is the same as the one we wrote earlier. It fills the missing values by mean and returns a series object. This function is called three times -- once for each brand. Each time it receives only prices related to one brand. So fill_missing computes the mean for that brand and fills the missing values for NULL values of Price for that brand.
4. The three series objects returned by fill_missing is combined and put in the original order given by df['Price']. This results in one Series object that is the output of the line g.transform(..)

#### Method 4: Replace with the most frequent item

If the missing values are of categorical type, a reasonable thing to do would be to replace them with the most frequently occurring value. This is called the *mode*. The mode() function returns the list of all values that occur the maximum number of times. Brands Arrow and Puma occur the maximum number of times (7 times). Even if the mode is unique, mode() still returns a list so we use [0] to pick the first value.

In [21]:
df['Brand'].mode()

0    Arrow
1     Puma
dtype: object

In [22]:
most_frequent = df['Brand'].mode()[1]
df['Brand'].fillna( most_frequent )

0     Adidas
1     Adidas
2      Arrow
3      Arrow
5     Adidas
6      Arrow
7      Arrow
8      Arrow
9       Puma
10      Puma
11     Arrow
12     Arrow
13      Puma
14      Puma
15    Adidas
16      Puma
17      Puma
18      Puma
Name: Brand, dtype: object

### How to convert Categories to numbers?

As we said earlier, all classifiers in scikit-learn expect the data-tables to have only numbers. We need a numerical representation of categorical data. Functions that help us do this are discussed below.

#### Method 1: Suitable for ordinal features

It is meaningful to order the sizes of T-shirts as $M < L < XL < XXL$. However, we cannot impose any natural ordering on the colors of the T-shirt. So we say 'Size' is an *ordinal* feature and 'Color' is a *nominal* feature. These two types are treated differently. Let us look at how to convert ordinal categories to numbers first.  

In [23]:
df = master_data.copy()

In [24]:
size_num = {'M':1,
            'L':2,
            'XL':3,
            'XXL':4}

The apply method take a function as argument. It runs this function on each value in Size.

In [25]:
df['Size'] = df['Size'].apply( lambda x: size_num[x] )

In [26]:
df['Size']

0     4
1     3
2     4
3     1
4     1
5     2
6     3
7     1
8     1
9     3
10    1
11    3
12    2
13    1
14    3
15    4
16    4
17    1
18    2
Name: Size, dtype: int64

#### Method 2: Suitable for nominal features

In case of a nominal feature like color, we represent each unique color as a (dummy) feature of its own. Once transformed, for every data point, exactly one feature has the value 1 and the rest of them have zero. This representation is known as **one-hot encoding**. 

In [27]:
df = master_data.copy()

In [28]:
df.head()

Unnamed: 0,Brand,Size,Color,Price
0,Adidas,XXL,Blue,1400.0
1,Adidas,XL,Red,1200.0
2,Arrow,XXL,Black,2450.0
3,Arrow,M,Blue,
4,,M,Blue,500.0


In [29]:
df = pd.get_dummies(df, columns=['Color'] , drop_first=True )

In [30]:
df

Unnamed: 0,Brand,Size,Price,Color_Blue,Color_Green,Color_Red
0,Adidas,XXL,1400.0,1,0,0
1,Adidas,XL,1200.0,0,0,1
2,Arrow,XXL,2450.0,0,0,0
3,Arrow,M,,1,0,0
4,,M,500.0,1,0,0
5,Adidas,L,1200.0,0,1,0
6,Arrow,XL,2400.0,1,0,0
7,Arrow,M,1400.0,1,0,0
8,Arrow,M,2400.0,1,0,0
9,Puma,XL,2400.0,1,0,0
