# 02: Model Exploration
---

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer


from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, Lasso, Ridge, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error

import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

---
## 2. Data

In [2]:
ames_train = pd.read_csv('../data/train_clean.csv')
ames_test = pd.read_csv('../data/test.csv')

---
## 3. Pre-Processing

### 3.1. Train/Validation Split

In [3]:
X = ames_train.drop(columns = ['pid','saleprice'])
y = ames_train['saleprice']

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

### 3.2. One-Hot-Encoding, Scaling & Transforming

In [5]:
# Creating lists of the numerical and categorical features to use in the transformer

numerical = [col for col in X_train._get_numeric_data().columns]
categorical = [col for col in X_train.columns if col not in numerical]

In [6]:
# Creating a categorical feature pipeline that will one hot encode the data
cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

# Creating a numerical feature pipeline that will force the different unit data into a common scale
num_pipe = Pipeline([('ss', StandardScaler())])


# Creating a ColumnTransformer that will be used to apply the pipelines to their respective features
transformer = ColumnTransformer([
    ('cat', cat_pipe, categorical),
    ('num', num_pipe, numerical)
])

In [7]:
# Applying the transformer to the X_train and converting back to a DataFrame

Xt_train = transformer.fit_transform(X_train)
Xt_train = pd.DataFrame(Xt_train, columns = transformer.get_feature_names_out(X_train.columns))

In [8]:
# Applying the transformer to the X_val and converting back to a DataFrame

Xt_val = transformer.transform(X_val)
Xt_val = pd.DataFrame(Xt_val, columns = transformer.get_feature_names_out(X_val.columns))

In [186]:
print('X_train features: ' + str(X_train.shape[1]))
print('Xt_train features: ' + str(Xt_train.shape[1]))

X_train features: 79
Xt_train features: 301


Originally, there were 79 features in the dataset. After one-hot-encoding the categorical features, there are now 301 features.

---
## 4. Model Exploration

### 4.1. y_train & y_val statistics

In [31]:
y_train.describe()

count      1640.000000
mean     181718.527439
std       79793.423579
min       13100.000000
25%      129900.000000
50%      163000.000000
75%      214600.000000
max      611657.000000
Name: saleprice, dtype: float64

In [29]:
y_val.describe()

count       411.000000
mean     180476.819951
std       77175.170846
min       12789.000000
25%      129225.000000
50%      160000.000000
75%      211000.000000
max      451950.000000
Name: saleprice, dtype: float64

With the intention of using the RMSE metric to evaluate the performance of the models, I've checked to see what the standard deviation of the y_train (79,793) and y_val (77,175) DataFrames are. The standard deviation indicates the amount of variability that is acceptable, and the RMSE will describe how far off the predicted values were from the actual values. Ideally, the RMSE should be less than the standard deviation.

The R2 score will also be used and explains how much variability a model is able to explain. On a scale of 0 to 1, the higher the score, the better. And when comparing the train and validation R2 scores, if the validation R2 score is higher than the train R score, then the model is likely to do well when applied to unseen data. On the other hand, if the train R2 score is higher than the validation R2 score, then the model is in danger of being overfit, and will likely not do well when applied to unseen data.

### 4.2. Linear Regression #1

In [13]:
# Instantiating and fitting a linear regression to the training data

lr = LinearRegression()

lr.fit(Xt_train, y_train)

LinearRegression()

In [17]:
# Train R2 score

lr.score(Xt_train, y_train)

0.9439858723675133

In [18]:
# Validation R2 score

lr.score(Xt_val, y_val)

-2.0394188615477022e+21

In [21]:
# Train RMSE

mean_squared_error(y_train, lr.predict(Xt_train), squared=False)

18879.193606386874

In [23]:
# Validation RMSE

mean_squared_error(y_val, lr.predict(Xt_val), squared=False)

3480982442410043.5

The linear regression model did not perform well. The train R2 score was 0.94, while the validation R2 score was in the negatives. The model is incredibly overfit, meaning that the model was trained too closely to the training data, and therefore, does not do well with unseen data. The train RMSE (18,879) was less than the train standard deviation (79,793), but the validation RMSE (3,480,982,442,410,043) way beyond the validation standard deviation (77,175).

### 4.3. RidgeCV #1

In [85]:
# Instantiating and fitting RidgeCV to the training data

ridgecv = RidgeCV(alphas = np.logspace(0, 5, 100))

ridgecv.fit(Xt_train, y_train)

RidgeCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]))

In [86]:
# Retrieving the best-performing alpha

ridgecv.alpha_

10.235310218990262

In [87]:
# Train R2 score

ridgecv.score(Xt_train, y_train)

0.912682421854184

In [88]:
# Validation R2 score

ridgecv.score(Xt_val, y_val)

0.9148016013374631

In [89]:
# Train RMSE

mean_squared_error(y_train, ridgecv.predict(Xt_train), squared=False)

23571.40623404403

In [90]:
# Validation RMSE

mean_squared_error(y_val, ridgecv.predict(Xt_val), squared=False)

22499.057884853388

The RidgeCV model applies ridge (L2) regularization to a linear regression model. This means that features with large coefficients will be shrunk toward zero, thereby reducing model complexity to help prevent overfitting. The alpha parameter dictates the strength of the regularization, making it very important to get the alpha just right. Fortunately, given a range, the RidgeCV model itself can cross validate various alphas and fit the data according to the best-performing alpha.

The best-performing alpha here is 10.24, and it allowed for a much better performing model, compared to the simple linear regression.

- The train R2 score was 0.9127, and the validation R2 score was 0.9148. Having both scores in the 0.90s indicates that the model is performing well, and having a validation R2 score that's higher than the train R2 score means that the model does very well with unseen data. 

- The train RMSE (23,571) was less than the train standard deviation (79,793), and the validation RMSE (22,499) was less than the validation standard deviation (77,175).

### 4.4. RidgeCV #2

In [81]:
# Instantiating and fitting RidgeCV to the training data

ridgecv2 = RidgeCV(alphas = np.logspace(0, 2, 100))

ridgecv2.fit(Xt_train, y_train)

RidgeCV(alphas=array([  1.        ,   1.04761575,   1.09749877,   1.149757  ,
         1.20450354,   1.26185688,   1.32194115,   1.38488637,
         1.45082878,   1.51991108,   1.59228279,   1.66810054,
         1.7475284 ,   1.83073828,   1.91791026,   2.009233  ,
         2.10490414,   2.20513074,   2.3101297 ,   2.42012826,
         2.53536449,   2.65608778,   2.7825594 ,   2.91505306,
         3.05385551,   3.19926714,   3.35160265,   3.51119173,
         3.67837977,   3.85352859,   4.03701726,   4....
        23.64489413,  24.77076356,  25.95024211,  27.18588243,
        28.48035868,  29.8364724 ,  31.2571585 ,  32.74549163,
        34.30469286,  35.93813664,  37.64935807,  39.44206059,
        41.320124  ,  43.28761281,  45.34878508,  47.50810162,
        49.77023564,  52.14008288,  54.62277218,  57.22367659,
        59.94842503,  62.80291442,  65.79332247,  68.92612104,
        72.20809018,  75.64633276,  79.24828984,  83.02175681,
        86.97490026,  91.11627561,  95.4548456

In [82]:
# Retrieving the best-performing alpha

ridgecv2.alpha_

9.770099572992255

In [83]:
# Train R2 score

ridgecv2.score(Xt_train, y_train)

0.9130516203150741

In [84]:
# Validation R2 score

ridgecv2.score(Xt_val, y_val)

0.9147778005079219

In [73]:
# Train RMSE

mean_squared_error(y_train, ridgecv2.predict(Xt_train), squared=False)

23416.82336432782

In [74]:
# Validation RMSE

mean_squared_error(y_val, ridgecv2.predict(Xt_val), squared=False)

22509.156453719017

Since the RidgeCV model did so well, I wanted to see if I could find a better performing alpha by focusing the range a bit more.

The best-performing alpha here is 9.77, and the model performed very similarly to the first RidgeCV model.
- The train R2 score increased slightly (0.9127 -> 0.9131), while the validation R2 score decreased slightly (0.9148 -> 0.91478).
- The train RMSE decreased (23,571 -> 23,417), while the validation RMSE increased (22,499 -> 22,509).

Though the train metrics improved, the opposite happened for the validation metrics. For this reason, the RidgeCV model with an alpha of 10.24 remains the best performer.

### 4.5. LassoCV #1

In [140]:
# Instantiating and fitting LassoCV to the training data

lassocv = LassoCV(alphas = np.logspace(0, 5, 100))

lassocv.fit(Xt_train, y_train)

LassoCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]))

In [141]:
# Retrieving the best-performing alpha

lassocv.alpha_

83.02175681319744

In [142]:
# Number of features zeroed out

len(list(filter(lambda x: (x == 0), lassocv.coef_)))

173

In [143]:
# Number of features kept

len(list(filter(lambda x: (x != 0), lassocv.coef_)))

128

In [144]:
# Train R2 score

lassocv.score(Xt_train, y_train)

0.9267987454892211

In [145]:
# Validation R2 score

lassocv.score(Xt_val, y_val)

0.9258018008010389

In [146]:
# Train RMSE

mean_squared_error(y_train, lassocv.predict(Xt_train), squared=False)

21582.110603190427

In [147]:
# Validation RMSE

mean_squared_error(y_val, lassocv.predict(Xt_val), squared=False)

20996.422318041197

The LassoCV model applies a lasso (L1) regularization to a linear regression model. This means that non-important feature coefficients will be zeroed out, and will be eliminated as features, allowing the model to focus on features of importance. This reduces model complexity and helps prevent overfitting. Again, the alpha parameter dictates the strength of the regularization, making it very important to get the alpha just right. Fortunately, given a range, the LassoCV model itself can cross validate various alphas and fit the data according to the best-performing alpha.

The best-performing alpha here is 83.02. Of the 301 features, 173 were zeroed out, and 128 were kept. This allowed for a better performing model, compared to all previous models.
- The train R2 score was 0.9268, and the validation R2 score was 0.9258.
- The train RMSE (21,582) was less than the train standard deviation (79,793), and the validation RMSE (20,996) was less than the validation standard deviation (77,175).

Though the validation R2 score here was not higher than the train R2 score, it wasn't off by very much, and both were in the 0.92s, which is higher than any of the previous R2 scores. The train and validation RMSEs were also decreased by about 2,000 and reached an all time low. This makes the LassoCV with an alpha of 83.02 the best-performer yet.

### 4.6. LassoCV #2

In [148]:
# Instantiating and fitting LassoCV to the training data

lassocv2 = LassoCV(alphas = np.logspace(0, 2, 100))

lassocv2.fit(Xt_train, y_train)

LassoCV(alphas=array([  1.        ,   1.04761575,   1.09749877,   1.149757  ,
         1.20450354,   1.26185688,   1.32194115,   1.38488637,
         1.45082878,   1.51991108,   1.59228279,   1.66810054,
         1.7475284 ,   1.83073828,   1.91791026,   2.009233  ,
         2.10490414,   2.20513074,   2.3101297 ,   2.42012826,
         2.53536449,   2.65608778,   2.7825594 ,   2.91505306,
         3.05385551,   3.19926714,   3.35160265,   3.51119173,
         3.67837977,   3.85352859,   4.03701726,   4....
        23.64489413,  24.77076356,  25.95024211,  27.18588243,
        28.48035868,  29.8364724 ,  31.2571585 ,  32.74549163,
        34.30469286,  35.93813664,  37.64935807,  39.44206059,
        41.320124  ,  43.28761281,  45.34878508,  47.50810162,
        49.77023564,  52.14008288,  54.62277218,  57.22367659,
        59.94842503,  62.80291442,  65.79332247,  68.92612104,
        72.20809018,  75.64633276,  79.24828984,  83.02175681,
        86.97490026,  91.11627561,  95.4548456

In [149]:
# Retrieving the best-performing alpha

lassocv2.alpha_

79.24828983539177

In [150]:
# Number of features zeroed out

len(list(filter(lambda x: (x == 0), lassocv2.coef_)))

173

In [151]:
# Number of features kept

len(list(filter(lambda x: (x != 0), lassocv2.coef_)))

128

In [152]:
# Train R2 score

lassocv2.score(Xt_train, y_train)

0.9278583545708836

In [153]:
# Validation R2 score

lassocv2.score(Xt_val, y_val)

0.9261632153716907

In [154]:
# Train RMSE

mean_squared_error(y_train, lassocv2.predict(Xt_train), squared=False)

21425.337591754054

In [155]:
# Validation RMSE

mean_squared_error(y_val, lassocv2.predict(Xt_val), squared=False)

20945.22379651986

Since the LassoCV model did so well, I wanted to see if I could find a better performing alpha by focusing the range a bit more.

The best-performing alpha here is 79.25. As in the previous LassoCV model, of the 301 features, 173 were zeroed out, and 128 were kept.

- The train R2 score increased slightly (0.9268 -> 0.9279), as did the validation R2 score (0.9258 -> 0.9262).
- The train RMSE decreased (21,582 -> 21,425), while the validation RMSE increased (20,996 -> 20,945).

Both the train and the validation metrics improved, and the LassoCV model with an alpha of 79.25 is now the best-performing model.

### 4.7. ElasticNetCV #1

In [177]:
enetcv = ElasticNetCV(alphas = np.logspace(0, 2, 100), l1_ratio=[.1, .5, .7, .9, .95, .99, 1])

In [178]:
enetcv.fit(Xt_train, y_train)

ElasticNetCV(alphas=array([  1.        ,   1.04761575,   1.09749877,   1.149757  ,
         1.20450354,   1.26185688,   1.32194115,   1.38488637,
         1.45082878,   1.51991108,   1.59228279,   1.66810054,
         1.7475284 ,   1.83073828,   1.91791026,   2.009233  ,
         2.10490414,   2.20513074,   2.3101297 ,   2.42012826,
         2.53536449,   2.65608778,   2.7825594 ,   2.91505306,
         3.05385551,   3.19926714,   3.35160265,   3.51119173,
         3.67837977,   3.85352859,   4.037017...
        28.48035868,  29.8364724 ,  31.2571585 ,  32.74549163,
        34.30469286,  35.93813664,  37.64935807,  39.44206059,
        41.320124  ,  43.28761281,  45.34878508,  47.50810162,
        49.77023564,  52.14008288,  54.62277218,  57.22367659,
        59.94842503,  62.80291442,  65.79332247,  68.92612104,
        72.20809018,  75.64633276,  79.24828984,  83.02175681,
        86.97490026,  91.11627561,  95.45484567, 100.        ]),
             l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95

In [179]:
enetcv.alpha_

79.24828983539177

In [184]:
# Number of features zeroed out

len(list(filter(lambda x: (x == 0), enetcv.coef_)))

173

In [185]:
# Number of features kept

len(list(filter(lambda x: (x != 0), enetcv.coef_)))

128

In [180]:
enetcv.score(Xt_train, y_train)

0.9278583545708836

In [181]:
enetcv.score(Xt_val, y_val)

0.9261632153716907

In [182]:
mean_squared_error(y_train, enetcv.predict(Xt_train))**0.5

21425.337591754054

In [183]:
mean_squared_error(y_val, enetcv.predict(Xt_val))**0.5

20945.22379651986

The ElasticNetCV model can be used to apply a combination of both ridge (L2) and lasso (L1) regularization to a linear regression model. Given ranges, the model can find the best-performing alpha as well as the best-performing ratio between the regularizations.

It turns out that this model performs best when using a full lasso regularization, seeing that it found the best-performing alpha to be 79.25, just as the second LassoCV model did. As a result, it returned the same metrics as that model did.

---
## 5. Summary

Of all the models, the LassoCV can be said to be the best-performer when using the alpha of 79.25. Though the ElasticNetCV model returned the same results, it only did so because it was using a full lasso regularization. As a result, the LassoCV model with the alpha of 79.25 will be used to draw insights from.