## Chapter 9: Principles of Feature Engineering and Selection

# 9.1 Data cleaning 

Every machine learning paradigm requires that the data we deal with consists strictly of *numerical values*, however raw data does not always come pre-packaged in this manner.  In in this Section we briefly describe two of the most common ways input features of a dataset may violate this numerical necessity: either when a dataset has *missing input values*, or when it consists of *categorical features*.  

In [6]:
import pandas as pd
import numpy as np
import sys
import csv

## 9.1.1  Filtering out irrelevant features

- functionally, if all elements of a given input feature are constant the feature has no discriminative power


- mathematically this simply result in all input models essentially adding a constant $c$ to them

\begin{equation}
model + x_{p,n}w_n = model + c = \approx y_p
\end{equation}


- thus such input features can be removed from a dataset

## 9.1.1  Handeling missing feature information

- Real world data can contain *missing values* due to human error in collection, storage issues, faulty sensors, etc.,


- If a supervised learning datapoint is missing its *output* value - e.g., if a classification datapoint is missing its *label* - there can be little we can do to salvage the datapoint, and usually throw it away


- However a datapoint with missing *input (feature) values* can be salvaged


- When data is a scarce resource, which can occur in buisness, statistics, and social science applications we do not want to just throw away data with missing *input* entries


- If we have a datapoint $\mathbf{x}_p$ with missing entries, we want our machine learning model to (naturally) *ignore* these missing entries since they cannot possible contribute to learning (they are missing after all)


- However in order to *feed this datapoint* into a machine learning model we must set these missing entries to some numerical value(s), but what values should we set them too?

- Values closer to the mean of input feature values are *less* indicative than those further away


- why

## 9.1.1  Handling categorical features

Lets assume we want to classify individuals into two classes, affluent and not-affluent, using data that has, among many other features, the *most commonly used method of transportation* as an input feature with four possible outcomes: **walking**, **biking**, **driving**, and **public transportation**. How can we translate these outcomes into numbers decipherable by computers? Well, the first approach anyone might guess is to assign a distinct number to each outcome, e.g., 1 to walking, 2 to biking, 3 to driving, and 4 to public transportation. Seems easy enough!

<img src="../../mlrefined_images/superlearn_images/dummy_1.png" width=650 height=450/>

There is however one issue with this approach. Imagine there are two individuals in our dataset, both belonging to the not-affluent class, who differ from each other only in terms of their most commonly used method of transportation. Lets call them Trey and Matt. Trey walks to work everyday while Matt takes the bus. 

That is, we generally want that instances from the same class to stay close to each other in the feature space and far away from instances of the other class(es). The current encoding of the transportation feature does not satisfy this desire, at least not for Trey and Matt who are - with the current encoding - maximally distant from one another! But also remember we assigned numbers 1 through 4 to the four outcomes arbitrarily. We could have instead encoded the outcomes as follows

<img src="../../mlrefined_images/superlearn_images/dummy_2.png" width=650 height=450/>

which, one could argue, better represents the data since *"not-affluent individuals are more likely to walk or use public transportation and less likely to drive their own vehicle"*. Regardless of how much this statement is true or can be trusted, one thing is clear: we need a better and more general way of encoding categorical features that does not rely on  our intuition or preconceived biases. 

Fortunately there is a simple way of fixing this issue. Instead of assigning a unique integer to each of the four outcomes, we can replace the transportation feature with four new 'dummy' features:

* Is the most commonly used method of transportation, **walking**? 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **biking**? &nbsp;&nbsp; 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **driving**? &nbsp; 1 for yes, &nbsp; 0 for no

* Is the most commonly used method of transportation, **public transportation**? &nbsp; 1 for yes,&nbsp; 0 for no

This way the transportation feature will be encoded using these dummy features as a binary string of length four, with exactly one '1' and three '0's:  


<img src="../../mlrefined_images/superlearn_images/Trey_Matt.png" width=475 align=left height=450/>

This method of encoding categorical features is sometimes referred to as *one-hot encoding*. Note that using this approach all possible outcomes will be equidistant from one another - at the cost of replacing the original feature with several dummy variables. Now that we know how to handle categorical features, lets use it to prepare a real dataset for classification.   