<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Learning Objectives</a></span></li><li><span><a href="#Model-Selection" data-toc-modified-id="Model-Selection-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Model Selection</a></span><ul class="toc-item"><li><span><a href="#Baseline-Model" data-toc-modified-id="Baseline-Model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Baseline Model</a></span></li></ul></li><li><span><a href="#Decisions,-Decisions,-Decisions..." data-toc-modified-id="Decisions,-Decisions,-Decisions...-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Decisions, Decisions, Decisions...</a></span></li><li><span><a href="#Correlation" data-toc-modified-id="Correlation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Correlation</a></span></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Distribution-Transformations" data-toc-modified-id="Distribution-Transformations-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Distribution Transformations</a></span><ul class="toc-item"><li><span><a href="#Log-Scaling" data-toc-modified-id="Log-Scaling-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Log Scaling</a></span></li><li><span><a href="#Build-model" data-toc-modified-id="Build-model-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Build model</a></span></li><li><span><a href="#Check-distribution-of-target" data-toc-modified-id="Check-distribution-of-target-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Check distribution of target</a></span></li><li><span><a href="#Build-model-with-log-scaled-target" data-toc-modified-id="Build-model-with-log-scaled-target-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Build model with log-scaled target</a></span></li></ul></li><li><span><a href="#Binning" data-toc-modified-id="Binning-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Binning</a></span><ul class="toc-item"><li><span><a href="#Volatile-Acidity" data-toc-modified-id="Volatile-Acidity-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Volatile Acidity</a></span></li><li><span><a href="#$\bf{SO_2}$" data-toc-modified-id="$\bf{SO_2}$-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>$\bf{SO_2}$</a></span></li></ul></li><li><span><a href="#Products-of-Features" data-toc-modified-id="Products-of-Features-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Products of Features</a></span></li><li><span><a href="#Polynomial-Features" data-toc-modified-id="Polynomial-Features-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Polynomial Features</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Exercise</a></span></li></ul></div>

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
import sklearn.metrics as metrics
import statsmodels.api as sm
from scipy import stats

## Learning Objectives

- Use correlations and other algorithms to inform feature selection
- Create new features for use in modeling
    - Use binning to turn numerical into categorical features
    - Use `PolynomialFeatures` to build compound features

## Model Selection

Let's imagine that I'm going to try to predict wine quality based on the other features.

In [None]:
wine = pd.read_csv('data/wine.csv')

In [None]:
wine.head(10)

### Baseline Model

Your baseline model for regression models takes the mean of the target as the prediction result for every observation (row) of data in your features/predictors.

Enter `DummyRegressor`!

In [None]:
X = wine.drop('quality', axis=1)
y = wine.quality

In [None]:
# Instantiate


In [None]:
# Score it!


In [None]:
# Predict!


In [None]:
# MSE


## Decisions, Decisions, Decisions...

Now: Which columns (predictors) should I choose? 

There are 12 predictors I could choose from. For each of these predictors, I could either use it or not use it in my model, which means that there are $2^{12} = 4096$ _different_ models I could construct! Well, okay, one of these is the "empty model" with no predictors in it. But there are still 4095 models from which I can choose.

How can I decide which predictors to use in my model?

We'll explore a few methods in the sections below.

## Correlation

Our first attempt might be just see which features are _correlated_ with the target to make a prediction.

We can use the correlation metric in making a decision.

In [None]:
# Use the .corr() DataFrame method to find out about the
# correlation values between all pairs of variables!

wine.corr()

In [None]:
sns.set(rc={'figure.figsize':(12, 12)})

# Use the .heatmap function to depict the relationships visually!


In [None]:
# Let's look at the correlations with 'quality'
# (our dependent variable) in particular.

wine_corrs = wine.corr()['quality'].map(abs).sort_values(ascending=False)
wine_corrs

It looks like we can see the features have different correlations with the target. The larger the correlation, the more we'd expect these features to be better predictors.

Let's try using only a subset of the strongest correlated features to make our model.

In [None]:
# Let's choose 'alcohol' and 'density'.


In [None]:
# ols


## Feature Engineering

> Domain knowledge can be helpful here! 🧠

In practice this aspect of data preparation can constitute a huge part of the data scientist's work. As we move into data modeling, much of the goal will be a matter of finding––**or creating**––features that are predictive of the targets we are trying to model.

There are infinitely many ways of transforming and combining a starting set of features. Good data scientists will have a nose for which engineering operations will be likely to yield fruit and for which operations won't. And part of the game here may be getting someone else on your team who understands what the data represent better than you!

Let's try this ourselves! Since I don't know much about wine, I'm really just guessing.

In [None]:
wine.head(10)

## Distribution Transformations

### Log Scaling

Linear regression can work better if the predictor and target are normally distributed. 

**Log-scaling** can be a good tool to make *right-skewed* data more normal.

(For *left-skewed* data, which is rarer, we can try transforming our data by raising it to an exponent greater than 1.)

Suppose e.g. a kde plot of my predictor $X$ looks like this:

![original](./images/skewplot.png)

In that case, the kde plot of a log-transformed version of $X$ could look like this:

![log](./images/logplot.png)

Let's set up a problem like this.

In [None]:
diamonds = sns.load_dataset('diamonds')

In [None]:
X = diamonds.select_dtypes(include=float)
y = diamonds['price']

### Build model

In [None]:
sm.OLS(endog=y, exog=X).fit().summary()

### Check distribution of target

In [None]:
y.hist();

In [None]:
y_scld = np.log(y)
y_scld.hist();

### Build model with log-scaled target

In [None]:
model_diam = sm.OLS(y_scld, X).fit()
model_diam.summary()

But with this transformed target, how do I now interpret my LR coefficients?

Before the transformation, I would have said that a one-unit increase in, say, depth results on average in a 0.0319 increase in price. But what I need to say now is that a one-unit increase in depth results on average in a 0.0319 increase *in the logarithm of price*, i.e. an increase in price by a factor of $e^{0.0319}$.

In [None]:
print(f"""
A one-unit increase in the depth variable corresponds
to an increase in price by a factor of {round(np.exp(0.0319), 3)},
or {round(np.exp(0.0319) - 1, 3)}%.
""")

## Binning

To start we'll look at some `seaborn` Pair Plots. We'll do this in two halves so that we can see things a bit more clearly:

In [None]:
# This will show the first six predictors and 'quality'
columns = [True if j < 6 or j == 11 else False for j in range(13)]

sns.pairplot(data=wine.iloc[:, columns]);

In [None]:
sns.pairplot(data=wine.iloc[:, 6:]);

### Volatile Acidity

Let's look at the distribution of the volatile acidity feature:

In [None]:
# Default Histogram

Suppose we add more bins:

In [None]:
# more bins


In [None]:
# better: sns


In [None]:
# even better: sns with kde and better binning


Now the distribution looks quite different. There seems to be a small second peak around 0.6. And we can reproduce this if we check out `seaborn`'s kernel density plot.

In [None]:
sns.kdeplot(wine['volatile acidity']);

So suppose we build a new feature that records whether a wine's volatile acidity is above 0.5.

In [None]:
# new feature!


In [None]:
# Correlation?


Not bad! We don't seem to have stumbled onto a huge connection here, but this correlation value suggests that this new feature may be helpful in a final model.

### $\bf{SO_2}$

Next we'll take a look at distribution of the sulfur dioxide feature:

In [None]:
sns.histplot(wine['total sulfur dioxide'], dke=True)

Let's try separating our wines into those with sulfur dioxide higher than 80 and those with less:

In [None]:
wine['high_so2'] = wine['total sulfur dioxide'] > 80

In [None]:
wine.corr()['quality']['high_so2']

Not great. Perhaps this is a modeling dead end.

## Products of Features

Another engineering strategy we might try is **multiplying features together**.

Let's try these two features: `residual sugar` and `total sulfur dioxide`. Note that without domain knowledge or exploration, this is really a guess that this combination will predict `quality` well.

In [None]:
# mulitply?


In [None]:
# check it


In [None]:
wine.corr()['quality']['residual sugar']

In [None]:
wine.corr()['quality']['total sulfur dioxide']

We can see these two features together have a higher correlation than each by itself!

## Polynomial Features

Instead of just multiplying features at random, we might consider trying **every possible product of features**. That's what PolynomialFeatures does:

In [None]:
# Polynomials!


In [None]:
# Create Dataframe


In [None]:
pdf.shape

In [None]:
# Get example

matching = [s for s in pdf.columns if "x10" in s]

In [None]:
matching

In [None]:
# Model it!


In [None]:
# Score it!


So: Is this a good idea? What are the potential dangers here?

## Exercise

Consider the following dataset:

In [None]:
sales = pd.read_csv('data/Advertising.csv', index_col=0)

sales.head()

We'd like to try to understand sales as a function of spending on various media (TV, radio, newspaper).

In [None]:
sales.corr()['Sales']

**Try to find the best multiplicative combination of features.**

You may use `PolynomialFeatures` or just multiply by hand.

In practice, it's not easy to tell when such products of features will be so fruitful. Moreover, there is room for concern about violating regression's demand for feature independence. At the very least, we would probably not want to include a product *and the individual features themselves* in a final model, not if our goal is to understand what's really responsible for fluctuations in our target variable.