# **Feature Engineering**

One major caveat for our linear models up to this point: They've only worked with linear input variables 
* In other words, we've assumed that our response / predictions could be a linear combination of linear variables 

We define feature engineering as the process of *transforming* raw features into *more informative features* that can be used in modeling or EDA tasks and improve model performace 

### **Feature Functions**
* A feature function descrbes the transformations we apply to raw features in a dataset in order to create a design matrix of transformed features 

We denote this feature function as $\Phi$
* So $\Phi(\mathbb{X})$ is the transformed design matrix ready to be used in modeling 

Here is an example: 


<img src="https://ds100.org/course-notes/feature_engineering/images/phi.png" alt="Image Alt Text" width="700" height="320">

The new features introduced by $\Phi$ can then be used in modeling 
* we often use $\phi_i$ to represent transformed features after feature engineering 

$$\begin{align}
\hat{y} &= \theta_0 + \theta_1 x + \theta_2 x^2 \\
\hat{y} &= \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2
\end{align}$$


### **One Hot Encoding**

How do we quantify *non-numeric* features? We can make use of **one hot encoding**, which generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data 

For example, take a `"day"` column in a table. 
* Each entry correponds to the day of the week, such as Sunday, Monday, etc. 
* For each unique entry in `"day"`, we fill the corresponding feature in the new table with a $1$, the rest $0$

<img src="https://ds100.org/course-notes/feature_engineering/images/ohe.png" alt="Image Alt Text" width="700" height="300">

So each category of a categorical variable gets its own feature 
* Value = $1$ if a row belongs to a category 
* Value = $0$ otherwise 

One thing to keep in mind about one-hot encoded columns is that the sum of all the one-hot encoded columns will sum to $1$, which represents the bias column 
* The bias column is a linear combination of the one hot encoded columns 

There are two options to remedy this: 
1) We can omit a feature column 
2) We can get rid of our bias column 

<img src="https://ds100.org/course-notes/feature_engineering/images/remove.png" alt="Image Alt Text" width="700" height="160">

Had we not done anything, our OLS estimate $\hat{\theta} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ would fail 

### **Polynomial Features**

Consider the following scatterplot: 

<img src="https://ds100.org/course-notes/feature_engineering/feature_engineering_files/figure-html/cell-5-output-2.png" alt="Image Alt Text" width="500" height="350">

We can see that the data follows a curved line rather than a straight line, indicating a **non-linear** relationship between the features 
* One potential remedy is to introduce a **polynomial** term, such that we have: 

$$\hat{y} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)$$

$$\hat{y} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2$$

How can we fit a model with non-linear features?
* We can still the same thing! Our model is still technically **linear**
* Although it contains non-linear features, it is linear with respect to the model *parameters* 

<img src="https://ds100.org/course-notes/feature_engineering/feature_engineering_files/figure-html/cell-6-output-2.png" alt="Image Alt Text" width="500" height="350">

### **Complexity and Overfitting** 

* Feature engineering allows us to build all sorts of featrues to improve our feature, substantially allowing us to capture the model's ability to capture non-linear relationships 

What happens if we keep adding features to our model?

Let's plot models as complexity increases from $0$ to $7$ and observe the RMSE: 


<img src="https://ds100.org/course-notes/feature_engineering/images/degree_comparison2.png" alt="Image Alt Text" width="700" height="280">

* As we use our model to make predictions on the same data that was used to fit the model, we find the MSE decreases with each additional polynomial 

* The **training error**, or model's error evaluated on the training set, seems to go down as complexity increases 

<img src="https://ds100.org/course-notes/feature_engineering/images/train_error.png" alt="Image Alt Text" width="500" height="300">

One huge drawback of creating a complicated model is that if we make it *too* complex, we'll start to overfit to our training data 
* **overfitting** is the phenomenon where the model starts to "memorize" our training data, leaving it unable to **generalize** well to data it's never seen before. 

We say that complex models have high **variance**, as they tend to vary dramatically for small changes in different data sets 

There's now this dilemma: We can go about *decreasing* training error with increasing model complexity, but models that are *too* complex start to overfit and can't be reapplied to new datasets due to **high variance**

<img src="https://ds100.org/course-notes/feature_engineering/images/bvt.png" alt="Image Alt Text" width="500" height="300">


The key point here: We need to strike a balance in the complexity of our models; we want models that are capable of "generalizing" to unseen data 
* A model too simple won't capture the true relationships between our variable of interest 
* A model too complex runs the risk of overfitting