# Feature Selection and Feature Engineering

Featuring an exploration of polynomial and interaction terms (postponed from last Thursday!)

## Objectives

- Use correlations and other algorithms to inform feature selection
- Address the problem of multicollinearity in regression problems
- Create new features for use in modeling
    - Use `PolynomialFeatures` to build compound features

## Set Up

Insurance costs data (from https://www.kaggle.com/mirichoi0218/insurance)

In [None]:
# Initial imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
df = pd.read_csv('data/insurance.csv')

In [None]:
# explore the data
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Let's quickly encode our categorical variables to use!

In [None]:
# set our X and y
X = df.drop(columns='charges')
y = df['charges']

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

In [None]:
cat_cols = ['sex', 'smoker', 'region']

# create an encoder object
encoder = OneHotEncoder(handle_unknown='error',
                        drop='first',
                        categories='auto')

# Create an columntransformer object
ct = ColumnTransformer(transformers=[('ohe', encoder, cat_cols)],
                       remainder='passthrough')
ct.fit(X_train)
X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)

In [None]:
# We can also go ahead and scale - let's use a MinMaxScaler because binaries!
scaler = MinMaxScaler()

# train on train data
scaler.fit(X_train_enc)

# transform both train and test data
X_train_scaled = scaler.transform(X_train_enc)
X_test_scaled = scaler.transform(X_test_enc)

In [None]:
X_train_sc_df = pd.DataFrame(X_train_scaled, columns= ct.get_feature_names())
X_train_sc_df.head()

# Correlation and Multicollinearity

Our first attempt might be just see which features are _correlated_ with the target to make a prediction.

We can use the correlation metric in making a decision.

In [None]:
# want to create a full train df with both X and y variables to explore
train_df = pd.DataFrame(X_train_enc, columns= ct.get_feature_names())
train_df['target'] = y_train

In [None]:
# Makign the visual bigger so we can read it
sns.set(rc={'figure.figsize':(8, 8)})

sns.heatmap(train_df.corr(), annot=True);

In [None]:
# Let's zoom in on the correlations with 'charges' (target)
train_df.corr()['target'].map(abs).sort_values(ascending=False)

You'll notice that none of these features are super strongly correlated with our target... so it's not surprising if a model of these features on this target doesn't perform particularly well!

But, let's try to model, first with just the most correlated feature (`children`) and then with all features, and see how they perform.

In [None]:
# Instantiate our simple model
lr_simple = LinearRegression()

# Run with our most strongly correlated feature
lr_simple.fit(X_train_sc_df[['children']], y_train)

# Score on train
print(f"Train R2: {lr_simple.score(X_train_sc_df[['children']], y_train):.4f}")

# Make a df version of test to score it too
X_test_sc_df = pd.DataFrame(X_test_scaled, columns= ct.get_feature_names())
print(f"Test R2: {lr_simple.score(X_test_sc_df[['children']], y_test):.4f}")

#### Evaluate 

- 


In [None]:
# Instantiate our model
lr_all = LinearRegression()

# Run with all features
lr_all.fit(X_train_scaled, y_train)

# Score on train and test
print(f"Train R2: {lr_all.score(X_train_scaled, y_train):.4f}")
print(f"Test R2: {lr_all.score(X_test_scaled, y_test):.4f}")

#### Evaluate 

- 


### Explore Coefficients

Our simple model with just one variable performed quite poorly, but our more complicated model performed much better. 

Let's explore the coefficients of that model:


In [None]:
dict(zip(ct.get_feature_names(), lr_all.coef_))

BECAUSE our data is all on the same scale, we can use coefficients to decide which are more important in this model!

Let's run another model with only the top 4 features with the largest coefficients (by absolute value)

In [None]:
# Define our top four features for train and test
top4 = None
# Easiest to do this with a dataframe
X_train_top4 = X_train_sc_df[top4]
X_test_top4 = X_test_sc_df[top4]

In [None]:
# Instantiate our model
lr_top4 = LinearRegression()

# Run with all features
lr_top4.fit(X_train_top4, y_train)

# Score on train and test
print(f"Train R2: {lr_top4.score(X_train_top4, y_train):.4f}")
print(f"Test R2: {lr_top4.score(X_test_top4, y_test):.4f}")

#### Evaluate

- 


## Recursive Feature Elimination

The idea behind recursive feature elimination is to start with all predictive features and then build down to a small set of features slowly, by eliminating the features with the lowest coefficients.

That is:

1. Start with a model with _all_ $n$ predictors
2. find the predictor with the smallest effect (coefficient)
3. throw that predictor out and build a model with the remaining $n-1$ predictors
4. set $n = n-1$ and repeat until $n-1$ has the value you want!

### Recursive Feature Elimination in Scikit-Learn

Note: MUST use on scaled data!

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

In [None]:
# import RFE
from sklearn.feature_selection import RFE

In [None]:
lr_rfe = LinearRegression()
select = RFE(lr_rfe, n_features_to_select=3)

In [None]:
select.fit(X=X_train_scaled, y=y_train)

In [None]:
select.support_

In [None]:
select.ranking_

There are more options built out in SKLearn - check out their [user guide section on feature selection](https://scikit-learn.org/stable/modules/feature_selection.html)!

-----

# Feature Engineering

## Polynomial Terms - Simple Linear Regression

Demonstrating this on a toy example, with a single x variable predicting y.

In [None]:
# 150 samples from uniform distribution between -2pi and 2pi

x = np.random.uniform(-2*np.pi, 2*np.pi, 150)

# Creating target (y) - so we know the true relationship between x and y
# But - adding some noise (error) with 'np.random'

y = np.sin(x) + np.random.normal(loc=0, scale=0.4, size=len(x))

In [None]:
# Visualize it
plt.scatter(x, y)

plt.ylabel('$\sin(x)$ plus noise')
plt.xlabel('x values are randomly chosen from $[-2\pi, 2\pi]$')
plt.show()

In [None]:
# Fitting a linear model
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y)

In [None]:
# Grabbing the predicted values
y_pred = lr.predict(x.reshape(-1, 1))

In [None]:
# Scoring our model
print(f"R2 Score: {r2_score(y, y_pred)}")

In [None]:
# Visualize it
plt.scatter(x, y) # original data

plt.plot(x, y_pred, c='red') # predicted values

plt.ylabel('$\sin(x)$ + noise')
plt.xlabel('x values randomly chosen between $-2\pi$ and $2\pi$')
plt.title("Simple Linear Regression")

plt.show()

Is this a good model? Well - of course not. It's definitely **underfit** - it is not complex enough to accurately capture the pattern and predict the target.

Let's try again, but now with polynomials!

In [None]:
# For this, we'll need some helper functions
# Shoutout to Andy for sending me these
from sklearn.preprocessing import PolynomialFeatures

def create_poly_dataset(x, degree):
    """
    returning dataset with the given polynomial degree
    """
    # Instantiate the PolynomialFeatures object with given 'degree'
    poly = PolynomialFeatures(degree=degree)

    # Now transform data to create higher order features
    new_data = poly.fit_transform(x.reshape(-1, 1))
    return new_data

def fit_linear_model(data, y):
    """
    fitting a linear model and printing model details
    """
    np.set_printoptions(precision=4, suppress=True)

    if data.ndim == 1:
        data = data.reshape(-1, 1)

    lr = LinearRegression(fit_intercept=False)
    lr.fit(data, y)
    print("-"*13)
    print("Coefficients: ", lr.coef_)
    y_pred = lr.predict(data)
    print(f"R-Squared: {lr.score(data, y):.3f}")
    return lr

def plot_predict(x, y, model):
    """
    plotting predictions against true values
    """
    plt.scatter(x, y, label='true')
    x_pred = np.linspace(x.min(), x.max(), 100)
    
    # visualize beyond this x range by uncommenting below:
#     extra = x.ptp() * .2
#     x_pred = np.linspace(x.min() - extra, x.max() + extra, 100)

    plt.plot(x_pred, model.predict(create_poly_dataset(x_pred, len(model.coef_)-1)),
             label='predicted', c='red')

    if len(model.coef_) == 1:
        plt.title(f"{len(model.coef_) - 1} Polynomial Terms \n (no slope)")
    elif (len(model.coef_) - 1) == 1:
        plt.title(f"{len(model.coef_) - 1} Polynomial Term")
    else:
        plt.title(f"{len(model.coef_) - 1} Polynomial Terms")

    plt.legend()
    plt.show()
    return

In [None]:
# visualizing an assortment of polynomial degrees
# can visualize each sequential polynomial with `range(n)`
for i in [0, 1, 2, 3, 5, 7, 9, 13, 18]:
    xi = create_poly_dataset(x, i)
    plot_predict(x, y, fit_linear_model(xi, y))

Evaluate: which of these is the best?

- 


Evaluate: so what?

- 


## Interaction Terms

When do we need interaction terms? And how do we check for them?

Well, first things first - what interactions do _you_ think would make sense? That's the easiest way to incorporate interaction terms - use domain knowledge to think through what usefully could be combined into an interaction.

As for how to check if something might be better captured as an interaction...

In [None]:
# Quick set up
df_ohe = pd.get_dummies(df, columns=cat_cols, drop_first=True)

In [None]:
# an example of no interaction term...
sns.lmplot(x='age', y='charges', hue='smoker_yes', data=df_ohe, scatter=False)
plt.show()

How do I know these two variables, `age` and `smoker_yes`, aren't interacting? 

- 


In [None]:
# now let's look at something else...
sns.lmplot(x='bmi', y='charges', hue='smoker_yes', data=df_ohe, scatter=False)
plt.show()

What do you think?

- 


## Implementing Interaction and Polynomials in Sklearn

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
# There's a sklearn function for both!
from sklearn.preprocessing import PolynomialFeatures

Let's first do Polynomials, to the 3rd degree:

In [None]:
# Set up our PolynomialFeatures with degree=3 and interaction_only=False
poly = None

In [None]:
poly.fit(X_train_scaled)

In [None]:
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

In [None]:
X_train_poly = pd.DataFrame(X_train_poly, columns = poly.get_feature_names())
X_train_poly.head()

In [None]:
X_train_poly.columns

In [None]:
X_train_poly.info()

In [None]:
# Model!
lr_poly = LinearRegression()

lr_poly.fit(X_train_poly, y_train)

train_poly_preds = lr_poly.predict(X_train_poly)
test_poly_preds = lr_poly.predict(X_test_poly)

In [None]:
# evaluate
print(f"Train R2 Score: {r2_score(y_train, train_poly_preds):.3f}")
print(f"Test R2 Score: {r2_score(y_test, test_poly_preds):.3f}")

In [None]:
# visualize residuals, for the model that now has many polynomial cols
plt.scatter(train_poly_preds, y_train-train_poly_preds, label='Train')
plt.scatter(test_poly_preds, y_test-test_poly_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

----

In [None]:
# Now let's set up interactions: degree=2, interaction_only=True
interactions = None

interactions.fit(X_train_scaled)

In [None]:
X_train_ints = interactions.transform(X_train_scaled)
X_test_ints = interactions.transform(X_test_scaled)

In [None]:
X_train_ints = pd.DataFrame(X_train_ints, columns = interactions.get_feature_names())
X_train_ints.head()

In [None]:
# Model!
lr_int = LinearRegression()

lr_int.fit(X_train_ints, y_train)

train_ints_preds = lr_int.predict(X_train_ints)
test_ints_preds = lr_int.predict(X_test_ints)

In [None]:
# evaluate
print(f"Train R2 Score: {r2_score(y_train, train_ints_preds):.3f}")
print(f"Test R2 Score: {r2_score(y_test, test_ints_preds):.3f}")

In [None]:
# visualize residuals, for the model that now has interaction cols
plt.scatter(train_ints_preds, y_train-train_ints_preds, label='Train')
plt.scatter(test_ints_preds, y_test-test_ints_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

Evaluate: What do you think? Is this blanket way of approaching polynomial or interaction terms useful?

- 


## Resources:

[Feature Engineering and Selection: A Practical Approach for Predictive Models](https://bookdown.org/max/FES/) (computing done in R, but book focuses mostly on discussing the hows and whys rather than focusing on implementation)

- And their chapter on [Detecting Interaction Effects](https://bookdown.org/max/FES/detecting-interaction-effects.html)