# 6.5 Lab 1: Subset Selection Methods
## 6.5.1 Best Subset Selection
We first apply the best subset selection approach to the **Hitters** data, and wish to predict a baseball player's **Salary** on the basis of various statistics associated with performance in the previous year.

First of all, we load the related data:

In [1]:
import pandas as pd
import numpy as np
import patsy

In [4]:
hitters = pd.read_csv('../data/Credit.csv', 
                     na_values='?',
                     index_col=0).dropna();
hitters.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


In [5]:
hitters.shape

(400, 11)

Some of the predictors are categorical, so we'd like to generate dummy variables for them, separate out the response variable and predictors. We use the `patsy` package to generate the corresponding matrix.

In [6]:
pred_sum = '+'.join(hitters.columns.difference(['Balance']))
formula = f'Balance~{pred_sum}-1'
y, X = patsy.dmatrices(formula, hitters, return_type='matrix')

Next we can define a `best_subset` function to select the best subset.

In [23]:
class BestSubset:

    def __init__(self, model, X, y, assess):
        self.model = model
        self.X = X
        self.y = y
        self.assess = assess
        self.num_features = np.shape(self.X)[1]

    def get_subset(self, subsets):
        def compute_training_rss():
            for idx in subsets:
                model_fit = self.model.fit(self.X[:, idx], self.y)
                rss = ((model_fit.predict(self.X[:, idx]) - self.y) ** 2).sum()
                yield (idx, rss)
        return min(compute_training_rss(), key=lambda x: x[1])[0]
    
    def best_subset_cv(self, cv=5, features=None):
        from sklearn.model_selection import cross_val_score
        from itertools import combinations
        if features is None:
            num = self.num_features
        else:
            num = features
        def compute_test_score():
            for k in range(num):
                subset = self.get_subset(combinations(range(self.num_features), k+1))
                score = -1*cross_val_score(self.model, 
                                        self.X[:, subset], self.y,
                                        scoring='neg_mean_squared_error',
                                        cv=cv).mean()
                yield (subset, score)
        return min(compute_test_score(), key=lambda x:x[1])[0]


    def best_subset(self, cv, features):
        return self.best_subset_cv(cv=cv, features=features)

In [27]:
select_model = BestSubset(reg, X, y, 'cv')
select_model.best_subset()

TypeError: best_subset() missing 2 required positional arguments: 'cv' and 'features'

In [9]:
from sklearn import linear_model

In [16]:
reg = linear_model.LinearRegression()
reg_fit = reg.fit(X, y)

In [19]:
X

DesignMatrix with shape (400, 12)
  Columns:
    ['Ethnicity[African American]',
     'Ethnicity[Asian]',
     'Ethnicity[Caucasian]',
     'Gender[T.Male]',
     'Married[T.Yes]',
     'Student[T.Yes]',
     'Age',
     'Cards',
     'Education',
     'Income',
     'Limit',
     'Rating']
  Terms:
    'Ethnicity' (columns 0:3)
    'Gender' (column 3)
    'Married' (column 4)
    'Student' (column 5)
    'Age' (column 6)
    'Cards' (column 7)
    'Education' (column 8)
    'Income' (column 9)
    'Limit' (column 10)
    'Rating' (column 11)
  (to view full data, use np.asarray(this_obj))

In [14]:
?linear_model.LinearRegression

[1;31mInit signature:[0m
[0mlinear_model[0m[1;33m.[0m[0mLinearRegression[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mpositive[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
to minimize the residual sum of squares between the observed targets in
the dataset, and the targets predicted by the linear approximation.

Parameters
----------
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculati

In [62]:
A, b = np.array([[0,0,5],[1,1,3],[2,2,8]]), np.array([10,12,83])

In [24]:
reg_fit.rank_

1

In [46]:
np.shape([[1,2,3],[4,5,6]])

(2, 3)

In [52]:
A[[0,2],:]

TypeError: list indices must be integers or slices, not tuple