## Chapter 8 Principles of Feature Engineering and Selection

# 8.1 Transforming non-numerical features

Before any machine learning paradigm can 

In this brief Section we 

- Categorical feature one-hot endcoding

- Filling in NaN values

In [6]:
import pandas as pd
import numpy as np
import sys
import csv

## 8.1.1  Handling categorical features

Lets assume we want to classify individuals into two classes, affluent and not-affluent, using data that has, among many other features, the *most commonly used method of transportation* as an input feature with four possible outcomes: **walking**, **biking**, **driving**, and **public transportation**. How can we translate these outcomes into numbers decipherable by computers? Well, the first approach anyone might guess is to assign a distinct number to each outcome, e.g., 1 to walking, 2 to biking, 3 to driving, and 4 to public transportation. Seems easy enough!

<img src="../../mlrefined_images/superlearn_images/dummy_1.png" width=650 height=450/>

There is however one issue with this approach. Imagine there are two individuals in our dataset, both belonging to the not-affluent class, who differ from each other only in terms of their most commonly used method of transportation. Lets call them Trey and Matt. Trey walks to work everyday while Matt takes the bus. Recall from our discussion of histogram features for real data types in [Section 4.6 of the book](http://media.wix.com/ugd/f09e45_cc1cba3852eb40da8395636f34c755fa.pdf), that

<img src="../../mlrefined_images/superlearn_images/quote.png" width=550 height=450/>

That is, we generally want that instances from the same class to stay close to each other in the feature space and far away from instances of the other class(es). The current encoding of the transportation feature does not satisfy this desire, at least not for Trey and Matt who are - with the current encoding - maximally distant from one another! But also remember we assigned numbers 1 through 4 to the four outcomes arbitrarily. We could have instead encoded the outcomes as follows

<img src="../../mlrefined_images/superlearn_images/dummy_2.png" width=650 height=450/>

which, one could argue, better represents the data since *"not-affluent individuals are more likely to walk or use public transportation and less likely to drive their own vehicle"*. Regardless of how much this statement is true or can be trusted, one thing is clear: we need a better and more general way of encoding categorical features that does not rely on  our intuition or preconceived biases. 

Fortunately there is a simple way of fixing this issue. Instead of assigning a unique integer to each of the four outcomes, we can replace the transportation feature with four new 'dummy' features:

* Is the most commonly used method of transportation, **walking**? 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **biking**? &nbsp;&nbsp; 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **driving**? &nbsp; 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **public transportation**? &nbsp; 1 for yes,&nbsp; 0 for no

This way the transportation feature will be encoded using these dummy features as a binary string of length four, with exactly one '1' and three '0's:  


<img src="../../mlrefined_images/superlearn_images/Trey_Matt.png" width=475 align=left height=450/>

This method of encoding categorical features is sometimes referred to as *one-hot encoding*. Note that using this approach all possible outcomes will be equidistant from one another - at the cost of replacing the original feature with several dummy variables. Now that we know how to handle categorical features, lets use it to prepare a real dataset for classification.   

## Census income data

The [census income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income) comprises of 6 numerical and 8 categorical features (listed below) on 32561 individuals, along with a binary label: '**<=50K**' indicating that the individual makes less than $50K annually, and '**>50K**' indicating otherwise.

* feature 0: **age** (numerical)
* feature 1: **type of work** (categorical)
* feature 2: **final weight determined by census** (numerical)
* feature 3: **education level** (categorical)
* feature 4: **number of years of education** (numerical)
* feature 5: **marital status** (categorical)
* feature 6: **occupation** (categorical) 
* feature 7: **relationship** (categorical)
* feature 8: **race** (categorical)
* feature 9: **sex** (categorical)
* feature 10: **capital gain** (numerical)
* feature 11: **capital loss** (numerical)
* feature 12: **work hours per week** (numerical) 
* feature 13: **native country** (categorical)

In [7]:

datapath = '../../mlrefined_datasets/superlearn_datasets/'

csvname = datapath + 'adult.data'
data = pd.read_csv(csvname, header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [8]:
data.shape

(32561, 15)

Last column in the data contains the labels which we can store as a binary variable in y:

In [9]:
y = (data[14] == ' >50K')*1  # binary label 
data = data.drop([14], 1)

[pandas.get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) makes it very easy to create dummy variables from categorical features:  

In [10]:
dummies = pd.get_dummies(data[8]) # getting dummies for the race feature 
dummies.head()

Unnamed: 0,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,1,0,0
4,0,0,1,0,0


According to the one-hot encoding method and as can be seen from the dataframe above, the first 3 individuals in the data are white while the next two are black. We now repeat this procedure, this time for all categorical features in the data:  

In [11]:
for i in [1, 3, 5, 6, 7, 8, 9, 13]:      # indicies of categorical features
    dummies = pd.get_dummies(data[i]).rename(columns=lambda x: 'dummy_' + str(x)) 
    data = pd.concat([data, dummies], axis=1) 
    data = data.drop([i], 1)
X = np.asarray(data)

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

The 2-D array X now contains all the features and is ready to be called by e.g., logistic regression:

In [12]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)

NameError: name 'X' is not defined

## 8.1.2  Handeling missing feature information