In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display
%matplotlib inline

# Chapter 4. Representing Data and Engineering Features

So far, we’ve assumed that our data comes in as a two-dimensional array of floating-point numbers, where each column is a [continuous feature](https://www.mathsisfun.com/data/data-discrete-continuous.html) that describes the data points.  
For many applications, this is not how the data is collected.  
A particularly common type of feature is the [categorical features](http://www.dummies.com/education/math/statistics/types-of-statistical-data-numerical-categorical-and-ordinal/).  
Also known as [discrete features](https://www.mathsisfun.com/data/data-discrete-continuous.html), these are usually not numeric.  
The distinction between categorical features and continuous features is analogous to the distinction between classification and regression, only on the input side rather than the output side.  
Examples of continuous features that we have seen are pixel brightnesses and size measurements of plant flowers.  
Examples of categorical features are the brand of a product, the color of a product, or the department (books, clothing, hardware) it is sold in.  
These are all properties that can describe a product, but they don’t vary in a continuous way.  
A product belongs either in the clothing department or in the books department.  
There is no middle ground between books and clothing, and no natural order for the different categories (books is not greater or less than clothing, hardware is not between books and clothing, etc.).

Regardless of the types of features your data consists of, how you represent them can have an enormous effect on the performance of machine learning models.  
We saw in Chapters 2 and 3 that scaling of the data is important.  
In other words, if you don’t rescale your data (say, to unit variance), then it makes a difference whether you represent a measurement in centimeters or inches.  
We also saw in Chapter 2 that it can be helpful to *augment* your data with additional features, like adding interactions (products) of features or more general polynomials.

The question of how to represent your data best for a particular application is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems.  
Representing your data in the right way can have a bigger influence on the performance of a supervised model than the exact parameters you choose.  
In this chapter, we will first go over the important and very common case of categorical features, and then give some examples of helpful transformations for specific combinations of features and models.

## Categorical Variables

As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database.  
The task of the adult dataset is to predict whether a worker has an income of over \$50,000 or under \$50,000.  
The features in this dataset include the workers’ ages, how they are employed (self employed, private industry employee, government employee, etc.), their education, their gender, their working hours per week, occupation, and more.  
Table 4-1 shows the first few entries in the dataset:

![title](images/adult_incomes_data.png)

The task is phrased as a classification task with the two classes being income `<=50k` and `>50k`.  
It would also be possible to predict the exact income, and make this a regression task.  
However, that would be much more difficult, and the 50K division is interesting to understand on its own.  
In this dataset, `age` and `hours-per-week` are continuous features, which we know how to treat.  
The `workclass`, `education`, `sex`, and `occupation` features are categorical, however.  
All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.  
As a starting point, let’s say we want to learn a logistic regression classifier on this data.  
We know from Chapter 2 that a logistic regression makes predictions, $ŷ$, using the following formula:

$ŷ = w[0] * x[0] +w[1] * x[1] + \ldots + w[p] * x[p] + b > 0$

where $w[i]$ and $b$ are coefficients learned from the training set and $x[i]$ are the input features.  
This formula makes sense when $x[i]$ are numbers, but not when $x[2]$ is `"Masters"` or `"Bachelors"`.  
Clearly we need to represent our data in some different way when applying logistic regression.  
The next section will explain how we can overcome this problem.

### One-Hot-Encoding (Dummy Variables)

By far the most common way to represent categorical variables is using the [one-hot-encoding](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science) or *one-out-of-N encoding*, also known as [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29).  
The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 and 1.  
The values 0 and 1 make sense in the formula for linear binary classification (and for all other models in `scikit-learn`), and we can represent any number of categories by introducing one new feature per category, as described here.  
Let’s say for the workclass feature we have possible values of `"Government Employee"`, `"Private Employee"`, `"Self Employed"`, and `"Self Employed Incorporated"`.  
To encode these four possible values, we create four new features, called `"Government Employee"`, `"Private Employee"`, `"Self Employed"`, and `"Self Employed Incorporated"`.  
A feature is 1 if `workclass` for this person has the corresponding value and 0 otherwise, so exactly one of the four new features will be 1 for each data point.  
This is why this is called *one-hot* or *one-out-of-N* encoding.

The principle is illustrated in Table 4-2.  
A single feature is encoded using four new features.  
When using this data in a machine learning algorithm, we would drop the original `workclass` feature and only keep the 0-1 features.

![title](images/Table_4-2.png)

**NOTE**  
The one-hot encoding we use is quite similar, but not identical, to the dummy coding used in statistics.  
For simplicity, we encode each category with a different binary feature.  
In statistics, it is common to encode a categorical feature with k different possible values into k–1 features (the last one is represented as all zeros).  
This is done to simplify the analysis (more technically, this will avoid making the data matrix rank-deficient.

There are two ways to convert your data to a one-hot encoding of categorical variables, using either [pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) or [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).  
At the time of writing, using pandas is slightly easier, so let’s go this route.  
First we load the data using pandas from a comma-separated values (CSV) file:

In [6]:
import os
# The file has no headers naking the columns, so we pass 
# header=None and provide the column names explicitly in "names".
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
print(adult_path)

/Users/benjamingrove/.pyenv/versions/3.6.1/lib/python3.6/site-packages/mglearn/data/adult.data


In [7]:
data = pd.read_csv(adult_path, header=None, index_col=False,
       names=['age', 'workclass', 'fnlwgt', 'education',
              'education-num', 'marital-status', 'occupation',
              'relationship', 'race', 'gender', 'capital-gain',
              'capital-loss', 'hours-per-week', 'native-country',
              'income'])
# For illustrative purposes, we'll only select some of the columns:
data = data[['age', 'workclass', 'education', 'gender',
             'hours-per-week', 'occupation', 'income']]
# IPython.display allows nice output formatting within the 
# Jupyter notebook:
display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


#### Checking string-encoded categorical data