In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib

In [2]:
import os
if not os.path.exists('figures'):
    os.makedirs('figures')

# Feature Engineering

The previous sections outline the fundamental isead of machine learning, but all of the examples assume that yiu have numerical data in a tidy, `[n_samples, n_features]` format. In the real world, data rarelt comes in such a form. With this in mind, one of the more importnant steps in using machine learning in practice is **feature engineering** - that is, taking whatever information you have about your problem ans turing it into numbers that you can use to build your feature matrix.

In this section, we will cover a few common examples of feature engineering tasks: features for representing **categorical data**, features for representing **text**, and features for representing **images**. Additionally, we will discuss **derived features** for increasing model complexity and **imputation** of missing data. Often this process is known as **vectorization**, as it involves converting arbitrary data into well-behaved vectors.

## Categorical Features

One common type of non-numerical data is **categorical** data. For example, imaging you are exploring some data on hpusing prices, and along with numerical features like **price** and **rooms**, you also have **neighborhood** information. For example, your data might look somethink like this:

In [4]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

You might be tempted to encode this data with a straightforward numerical mapping:

In [5]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

It turns out that it is not generally a useful approach in Scikit-Learn: the packahe's models make the fundamental assumption that numerical features reflect algebraic quantities. Thus such a mapping would imply, for example, that *Queen Anne < Fremont < Wallingford*, or even that *Wallingford - Queen Anne = Fremont*, which (niche demographic jokes aside) does not make much sense.

In this case, one proven technique is to use *one-hot encoding*, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively. When your data comes as a list of dictionaries, Scikit-Learn's `DictVectorizer` will do this for you:

In [6]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

Notice that he *neighborhood* column has been expanded into three separate columns, representing the three neighborhood labels, and that each row has a 1 in the column associated with its neighborhoods. With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.

To see the meaning of each column, you can inspect the feature names:

In [8]:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

There is one clear disadvantage of this approach: if your category has many possible values, this can *greatly* increase the size of your dataset. However, because the encoded data contains mostly zeros, a spare output can be a very efficient solution:

In [12]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs which fitting and evaluating models. `sklearn.preprocessing.OneHotEncoder` and `sklearn.feature_extraction.FeatureHasher` are two addtional tools that Scikit-Learn includes to support this type of encoding.

## Text Features

## Image Features

Another common need is to suitably encode **images** for machine learning analysis. The simplest approach is what we used for the digits data in the **Introducing Scikit-Learn** section: simply using the pixel values themselves. But depending on the application, such approaches may not be optimal.

A comprehensive summary of feature extraction techniques for images is well beyond the scope of this section, but you can find excellent implementations of many of the standard approaches in the Scikit-Image project (http://scikit-image.org). Fpr one example of using Scikit-Learn and Scikit-Image together, see the **Application: A Face Detection Pipeline** section.

## Derived Features

## Imputation of Missing Data

## Feature Pipelines

With any of the preceding examples, it ca qucikly become tedious to do the transformations by hand, especially if you with to string together multiple steps. For example, we might want a processing pipeline that looks something like this: