# Subset Selection
* approach involves identifying a subset of p features (predictors) out of k that we think are related to the response

* then fit a model using OLS on the reduced set of variables

In [5]:
import pandas as pd
import numpy as np
import patsy
import itertools
import time
import statsmodels.api as sm

# import dataset 
hprice2 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice2.dta')

# write specification
f = 'lprice ~ lnox + lproptax + crime + rooms + dist + radial + stratio + lowstat'

# create design matrices
y1, X1 = patsy.dmatrices(f, data=hprice2, return_type='dataframe')

In [6]:
# pre-processing: demean outcome and features so all models can be fitted without an intercept
y = y1.sub(y1.mean())
X = X1.sub(X1.mean()).drop('Intercept',axis=1)

## Best Subset Selection
* fit separate OLS regression best subset for each possible combination of k predictors

* fit all k models that contain exactly one predictor, then all ${k \choose 2} = \frac{k(k-1)}{2}$ models that contain exactly two predictors, etc.

* look at resulting models to identify the best one

* number of models to consider grows rapidly as k increases!

### Algorithm: Best Subset Selection

1. Let $\mathcal{M_0}$ denote null model (i.e no predictors) - predicts sample mean for each observation

2. For p = 1,2,..., k:

a) Fit all ${k \choose p}$ models that contain exactly p predictors

b) Pick best among these ${k \choose p}$ models and call it $\mathcal{M_p}$; in this case, best is defined as having the smallest RSS or, equivalently, the largest $R^2$

In [7]:
# define a function that takes the predictors selected and subsets the X design matrix
def processSubset(feature_set):
    # Fit model on feature_set
    model = sm.OLS(y,X[list(feature_set)])
    regr = model.fit()

    # calculate RSS
    RSS = regr.ssr
    return {"model":regr, "RSS":RSS}

# define a function that selects the best model with p number of predictors
def getBest(p):
    
    # empty list to collect model and RSS 
    results = []
    
    # iterate through the different predictor combinations subject to a limit of p predictors
    for combo in itertools.combinations(X.columns, p):
        results.append(processSubset(combo))
    
    # Create a dataframe of the results
    models = pd.DataFrame(results)
    
    # Choose the model with the lowest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    # Return the best model, along with some other useful information about the model
    return best_model

# dataframe where best models will be collected
models_best = pd.DataFrame(columns=["RSS", "model"])

# for loop that collects the best model for each number of predictors
for i in range(1,9):
    models_best.loc[i] = getBest(i)

In [14]:
# make graphs of results - will do later

## Stepwise Selection: Forward Stepwise Selection
* computational reasoning, previous 'best subset selection' can't be applied with very large k's

* this algorithm begins with a model containing no predictors, then adds predictors to the model one-at-a-time until all predictors are in the model