<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Feature Engineering

Most of the machine learning examples assume that you have numerical data in a tidy, ``[n_samples, n_features]`` format.

* In the **real world**, data rarely comes in such a form.
* One of the more important steps in using machine learning in practice is **feature engineering**: that is, taking whatever information you have about your problem and **turning it into numbers** that you can use to build your feature matrix.

In this section, we will cover 
* a few common examples of feature engineering tasks: features for representing **categorical data**, features for representing <span style="color:blue"> **text**</span>.
* Additionally, we will discuss *derived features* for increasing model complexity and *imputation* of **missing data**.

Often this process is known as **vectorization**, as it involves converting arbitrary data into well-behaved vectors.

## Categorical Features

One common type of non-numerical data is *categorical* data. 

For **example**, imagine you are exploring some data on **housing prices**, and along with 
* **numerical features** like "price" and "rooms", you also have 
* "**neighborhood**" information.

For example, your data might look something like this:

In [1]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

#### Wrong solution
You might be tempted to encode this data with a straightforward numerical mapping:

In [2]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

It turns out that this is not generally a useful approach in Scikit-Learn: the package's models make the fundamental **assumption** that **numerical features reflect algebraic quantities**.  
Thus such a mapping would imply, for example, that **Queen Anne < Fremont < Wallingford**, or even that **Wallingford - Queen Anne = Fremont**, which (niche demographic jokes aside) does not make much sense.



#### One-hot encoding
In this case, one proven technique is to use **one-hot encoding**, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.

When your data comes as a list of dictionaries, Scikit-Learn's ``DictVectorizer`` will do this for you:

In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int64)

Notice that 
* the 'neighborhood' column has been expanded into **three separate columns**, representing the three neighborhood labels, 
* **each row has a 1** in the column associated with its neighborhood.

With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.

#### Inspect the feature names
To see the meaning of each column, you can inspect the feature names:

In [4]:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

#### Disadvantage
There is one clear disadvantage of this approach: if your **category has many possible values**, this can *greatly* increase the size of your dataset.

However, because the encoded data contains mostly zeros, a **sparse output** can be a very efficient solution:

In [6]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding.

## Text Features

Another common need in feature engineering is to **convert text to** a set of representative **numerical values**.

For example, most automatic mining of social media data relies on some form of encoding the text as numbers.
* One of the **simplest methods** of encoding data is by **word counts**: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

#### Example
For example, consider the following set of three phrases:

In [7]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

For a vectorization of this data based on word count, we could **construct a column** representing the word "**problem**," the word "**evil**," the word "**horizon**," and so on.  

#### CountVectorizer
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

#### Result
The result is a **sparse matrix** recording the number of times each word appears;   
it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [12]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0


#### Problem
There are some issues with this approach, however: the raw word counts lead to features which put **too much weight on words that appear very frequently**, and this can be sub-optimal in some classification algorithms.


#### Solution
One approach to fix this is known as **term frequency-inverse document frequency** (**TF–IDF**) which weights the word counts by a measure of how often they appear in the documents.


### Term frequency
The **simplest choice** is to use the **raw count** of a term $t$ in a document $d$, i.e.,   
> tf = the number of times that term $t$ occurs in document $d$.

### Inverse document frequency
The inverse document frequency is a measure of **how much information the word provides**, i.e., if it's **common or rare** across all documents.

$$ \mathrm{idf}(t, D) =  \log \frac{N}{1 + |\{d \in D: t \in d\}|}$$

with
* $N$: total number of documents in the corpus $N = {|D|}$
* $|\{d \in D: t \in d\}|$  : number of documents where the term $t$ appears.

#### TF-IDF in Scikit-Learn
The syntax for computing these features is similar to the previous example:

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0


## Imputation of Missing Data

Another common need in feature engineering is handling of missing data.  
We discussed the handling of missing data in ``DataFrame``s, and saw that often the ``NaN`` value is used to mark missing values.  

For example, we might have a dataset that looks like this:

In [17]:
import numpy as np
from numpy import nan
X = np.array([[ nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

When applying a typical machine learning model to such data, we will need to first **replace such missing data** with some **appropriate fill value**.


The sophisticated approaches tend to be very application-specific, and we won't dive into them here.
For a baseline imputation approach, using the mean, median, or most frequent value, Scikit-Learn provides the ``Imputer`` class:

#### Strategies
This is known as <span style="color:blue">*imputation*</span> of missing values, and strategies range from 
* **simple** (e.g., replacing missing values with the **mean of the column**) 
* **sophisticated** (e.g., using a **robust model** to handle such data).

In [20]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

We see that in the resulting data, the two **missing values** have been **replaced with the mean** of the remaining values in the column. 

#### Use transformed data
This imputed data can then be fed directly into, for example, a ``LinearRegression`` estimator:

In [21]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X2, y)
model.predict(X2)

array([13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])