# Homework #4: Subset Selection and Shrinkage Methods

## Background

In car sales, one of the most critical metrics is the number of days a vehicle spends on the lot. Some estimates suggest that every day a vehicle spends on the lot will cost the dealership ~$10/day in depreciation and maintenance. Multiply that by the hundreds (or thousands) of vehicles a dealership may hold in inventory and this quickly becomes one of the largest costs. A dataset provided by DriveTime, contains vehicle information as well as the number of days it spent on the lot, our task is to find any relationships that may explain the increase or decrease in days to sell.

### Relevant Datasets

`drive_time_sedans.csv`

Source: https://github.com/Fumanguyen/drivetime-sedans-used-vehicle-market/blob/master/drive_time_sedans.csv

## Task 1: Import the dataset and convert the categorical variables to dummy variables.

**Important Note**: The tasks below can be very computationally intensive. If you don't want to wait a long time for things to run or you don't feel your computer is powerful to complete these tasks in a reasonable time, I suggest dropping the `make.model`, `state`, and/or `makex` variables. Your grade will not be based on the inclusion or exclusion of any variables, I'm more interested in the methods but if you have the resources and are curious to explore more, feel free to use all variables.

In [4]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error

In [5]:
df = pd.read_csv('drive_time_sedans.csv')
df = df.drop(['make.model', 'state', 'makex'], axis=1)
df = pd.get_dummies(df, columns=['vehicle.type', 'domestic.import', 'vehicle.age.group', 'color.set','overage'])
df.head()

Unnamed: 0,data.set,total.cost,lot.sale.days,mileage,vehicle.age,vehicle.type_ECONOMY,vehicle.type_FAMILY.LARGE,vehicle.type_FAMILY.MEDIUM,vehicle.type_FAMILY.SMALL,vehicle.type_LUXURY,...,color.set_BLACK,color.set_BLUE,color.set_GOLD,color.set_GREEN,color.set_PURPLE,color.set_RED,color.set_SILVER,color.set_WHITE,overage_NO,overage_YES
0,TRAIN,4037,135,67341,8,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1,TRAIN,4662,18,69384,4,False,False,False,True,False,...,False,False,False,False,False,False,True,False,True,False
2,TRAIN,4459,65,58239,4,True,False,False,False,False,...,False,False,False,False,False,True,False,False,True,False
3,TRAIN,4279,1,58999,3,True,False,False,False,False,...,False,False,False,False,False,True,False,False,True,False
4,TRAIN,4472,37,47234,6,False,False,True,False,False,...,False,True,False,False,False,False,False,False,True,False


## Task 2: This dataset specifies which observations to use as train/test/validate. Split it into three dataframes based on these values.

If you've already converted those to dummy variables, you may have to subset slightly different. Search "*conditional subset pandas dataframe*" for a starting point or reach out to me (before the soft deadline) for guidance.

In [7]:
df_train = df[df['data.set'] == 'TRAIN'].drop('data.set', axis=1)
df_test = df[df['data.set'] == 'TEST'].drop('data.set', axis=1)
df_val = df[df['data.set'] == 'VALIDATE'].drop('data.set', axis=1)

print('Train shape:', df_train.shape)
print('Test shape:', df_test.shape)
print('Validate shape:', df_val.shape)

Train shape: (8753, 26)
Test shape: (4376, 26)
Validate shape: (4377, 26)


## Task 3: Normalize `total.cost`, `mileage`, and `vehicle.age`

In [9]:
cols_to_normalize = ['total.cost', 'mileage', 'vehicle.age']
scaler = StandardScaler()
df_train[cols_to_normalize] = scaler.fit_transform(df_train[cols_to_normalize])
df_test[cols_to_normalize] = scaler.transform(df_test[cols_to_normalize])
df_val[cols_to_normalize] = scaler.transform(df_val[cols_to_normalize])

## Task 4: Use the code from the applied lecture to perform forward stepwise selection, with the single validation set from before (as opposed to cross-validation). Return not only the AIC, BIC, and Adjusted $R^2$, as was shown in the lecture, but also the MSE on the validation set. 

In [11]:
def processSubset(X, y, predictor_variables, response_variable):
    # Fit model on feature_set and calculate RSS
    
    model = sm.OLS(y,X[list(predictor_variables)])
    regr = model.fit()
    RSS = ((regr.predict(X[list(predictor_variables)]) - y[response_variable]) ** 2).sum()
    return {"model":regr, "RSS":RSS}

def forward(X, y, predictors, response_variable):
    remaining_predictors = [p for p in X.columns if p not in predictors]
    results = []

    for p in remaining_predictors:
        results.append( processSubset(X, y, predictors + [p], response_variable))

    models = pd.DataFrame(results)
    best_model = models.loc[models['RSS'].argmin()]

    return best_model

In [12]:
X_train = df_train.drop('lot.sale.days', axis=1)
y_train = df_train[['lot.sale.days']]
X_val = df_val.drop('lot.sale.days', axis=1)
y_val = df_val[['lot.sale.days']]

models_fwd = pd.DataFrame(columns=["RSS", "model", "AIC", "BIC", "AdjR2", "MSE"])
models_fwd.drop(models_fwd.index, inplace=True)

predictors = []

In [13]:
for i in range(1,len(X_train.columns)+1):   
    models_fwd.loc[i] = forward(X_train, y_train, predictors, 'lot.sale.days')
    predictors = models_fwd.loc[i]['model'].model.exog_names
    print(y_val.shape)
    print(models_fwd.loc[i, 'model'].predict(X_val).shape)
    models_fwd.loc[i, 'AIC'] = models_fwd.loc[i, 'model'].aic
    models_fwd.loc[i, 'BIC'] = models_fwd.loc[i, 'model'].bic
    models_fwd.loc[i, 'AdjR2'] = models_fwd.loc[i, 'model'].rsquared_adj
    # print(mean_squared_error(y_val, models_fwd.loc[i, 'model'].predict(X_val)))
    print(mean_squared_error(y_val, models_fwd.loc[i, 'model'].predict(X_val)).shape)
    print(models_fwd.loc[i,'MSE'].shape)
    models_fwd.loc[i, 'MSE'] = mean_squared_error(y_val, models_fwd.loc[i, 'model'].predict(X_val))

print(models_fwd)

(4377, 1)


ValueError: shapes (4377,25) and (1,) not aligned: 25 (dim 1) != 1 (dim 0)

## Task 5: Using the code from the shrinkage methods lecture, find the optimal $\alpha$ and $\lambda$ for an Elastic Net regression using Cross-Validation.

Note: Remember that $\lambda$ is the argument `alpha` in scikit-learn and $\alpha$ is the `l1_ratio` argument. Sorry that nobody can settle on terminology.

## Question: Given all of the results you've found, which model would you choose and why? Hint: There is no right answer but you will need to justify any answer you give.