# 6.5 Lab 1: Subset Selection Methods
## 6.5.1 Best Subset Selection
We first apply the best subset selection approach to the **Hitters** data, and wish to predict a baseball player's **Salary** on the basis of various statistics associated with performance in the previous year.

First of all, we load the related data:

In [1]:
import pandas as pd
import numpy as np
import patsy

In [2]:
hitters = pd.read_csv('../data/Hitters.csv', 
                     na_values='?',
                     index_col=0).dropna();
hitters.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
-Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
-Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
-Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
-Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
-Alfredo Griffin,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750.0,A


In [3]:
hitters.shape

(263, 20)

Some of the predictors are categorical, so we'd like to generate dummy variables for them, separate out the response variable and predictors. We use the `patsy` package to generate the corresponding matrix.

In [4]:
pred_sum = '+'.join(hitters.columns.difference(['Salary']))
formula = f'Salary~{pred_sum}-1'
y, X = patsy.dmatrices(formula, hitters, return_type='matrix')

Next we can define a `best_subset` function to select the best subset.

In [57]:
class BestSubset:

    def __init__(self, model, X, y, assess):
        self.model = model
        self.X = X
        self.y = y
        self.assess = assess
        self.num_features = np.shape(self.X)[1]

    def get_subset(self, subsets):
        def compute_training_rss():
            for idx in subsets:
                model_fit = self.model.fit(self.X[:, idx], self.y)
                rss = ((model_fit.predict(self.X[:, idx]) - self.y) ** 2).sum()
                yield (idx, rss)
        return min(compute_training_rss(), key=lambda x: x[1])[0]
    
    def best_subset_cv(self, cv=5):
        from sklearn.model_selection import cross_val_score
        from itertools import combinations
        def compute_test_score():
            for k in range(self.num_features):
                subset = self.get_subset(combinations(range(self.num_features), k+1))
                score = -1*cross_val_score(self.model, 
                                        self.X[:, subset], self.y,
                                        scoring='neg_mean_squared_error',
                                        cv=cv).mean()
                yield (subset, score)
        return min(compute_test_score(), key=lambda x:x[1])[0]


    def best_subset(self):
        return self.best_subset_cv(cv=2)

In [65]:
select_model = BestSubset(reg, A, b, 'cv')
select_model.best_subset()

(0, 1, 2)

In [6]:
from sklearn import linear_model

In [63]:
reg = linear_model.LinearRegression()
reg_fit = reg.fit(A, b)

In [64]:
reg_fit.coef_

array([10.85714286, 10.85714286,  9.85714286])

In [41]:
min([(1,8),(4,3)], key=lambda x:x[1])

(4, 3)

In [62]:
A, b = np.array([[0,0,5],[1,1,3],[2,2,8]]), np.array([10,12,83])

In [24]:
reg_fit.rank_

1

In [46]:
np.shape([[1,2,3],[4,5,6]])

(2, 3)

In [52]:
A[[0,2],:]

TypeError: list indices must be integers or slices, not tuple