# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

## Objectives

In this lab you will: 

- Use Lasso and Ridge regression with scikit-learn 
- Compare and contrast Lasso, Ridge and non-regularized regression 

## Housing Prices Data

Let's look at yet another house pricing dataset: 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at `.info()` of the data: 

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

- First, split the data into `X` (predictor) and `y` (target) variables 
- Split the data into 75-25 training-test sets. Set the `random_state` to 10 
- Remove all columns of `object` type from `X_train` and `X_test` and assign them to `X_train_cont` and `X_test_cont`, respectively 

In [3]:
# Create X and y
y = df[['SalePrice']]
X = df.drop(['SalePrice'], axis=1)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

# Remove "object"-type features from X
cont_features = X.select_dtypes(exclude=['object'])

# Remove "object"-type features from X_train and X_test
X_train_cont = X_train.select_dtypes(exclude=['object'])
X_test_cont = X_test.select_dtypes(exclude=['object'])

In [4]:
cont_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   OverallQual    1460 non-null   int64  
 5   OverallCond    1460 non-null   int64  
 6   YearBuilt      1460 non-null   int64  
 7   YearRemodAdd   1460 non-null   int64  
 8   MasVnrArea     1452 non-null   float64
 9   BsmtFinSF1     1460 non-null   int64  
 10  BsmtFinSF2     1460 non-null   int64  
 11  BsmtUnfSF      1460 non-null   int64  
 12  TotalBsmtSF    1460 non-null   int64  
 13  1stFlrSF       1460 non-null   int64  
 14  2ndFlrSF       1460 non-null   int64  
 15  LowQualFinSF   1460 non-null   int64  
 16  GrLivArea      1460 non-null   int64  
 17  BsmtFullBath   1460 non-null   int64  
 18  BsmtHalf

## Let's use this data to build a first naive linear regression model

- Fill the missing values in data using median of the columns (use [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)) 
- Fit a linear regression model to this data 
- Compute the R-squared and the MSE for both the training and test sets 


In [5]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

# Impute missing values with median using SimpleImputer
impute = SimpleImputer(strategy='median')

X_train_imputed = impute.fit_transform(X_train_cont)
X_test_imputed = impute.transform(X_test_cont)

# Fit the model and print R2 and MSE for training and test sets
linreg = LinearRegression()
linreg.fit(X_train_imputed, y_train)

# Print R2 and MSE for training and test sets
print(f'Training R-squared:{linreg.score(X_train_imputed, y_train)}')
print(f'Test R-squared:{linreg.score(X_test_imputed, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, linreg.predict(X_train_imputed))}')
print(f'Test MSE:{mean_squared_error(y_test, linreg.predict(X_test_imputed))}')

Training R-squared:0.8069714678400263
Test R-squared:0.8203264293699014
Training MSE:1212415985.7084067
Test MSE:1146350639.880517


## Normalize your data

- Normalize your data using a `StandardScalar`  
- Fit a linear regression model to this data 
- Compute the R-squared and the MSE for both the training and test sets 


In [9]:
from sklearn.preprocessing import StandardScaler

# Scale the train and test data
ss = StandardScaler()
X_train_imputed_scaled = ss.fit_transform(X_train_imputed)
X_test_imputed_scaled = ss.transform(X_test_imputed)

# Fit the model
linreg_norm = LinearRegression()
linreg_norm = linreg_norm.fit(X_train_imputed_scaled, y_train)


# Print R2 and MSE for training and test sets
print(f'Training R-squared:{linreg_norm.score(X_train_imputed_scaled, y_train)}')
print(f'Test R-squared:{linreg_norm.score(X_test_imputed_scaled, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, linreg_norm.predict(X_train_imputed_scaled))}')
print(f'Test MSE:{mean_squared_error(y_test, linreg_norm.predict(X_test_imputed_scaled))}')

Training R-squared:0.8069836441948091
Test R-squared:0.8205528003598558
Training MSE:1212339505.7858741
Test MSE:1144906351.061218


## Include categorical variables

The above models didn't include categorical variables so far, let's include them! 


- Include all columns of `object` type from `X_train` and `X_test` and assign them to `X_train_cat` and `X_test_cat`, respectively 
- Fill missing values in all these columns with the string `'missing'` 

In [11]:
# Create X_cat which contains only the categorical variables
features_cat = X.drop(cont_features, axis=1)
X_train_cat = X_train.drop(X_train_cont, axis=1)
X_test_cat = X_test.drop(X_test_cont, axis=1)

# Fill missing values with the string 'missing'
features_cat = features_cat.fillna(value='missing')
X_train_cat = X_train_cat.fillna(value='missing')
X_test_cat = X_test_cat.fillna(value='missing')

- One-hot encode all these categorical columns using `OneHotEncoder` 
- Transform the training and test DataFrames (`X_train_cat`) and (`X_test_cat`) 
- Run the given code to convert these transformed features into DataFrames 

In [15]:
from sklearn.preprocessing import OneHotEncoder

# OneHotEncode categorical variables
ohe = OneHotEncoder(handle_unknown='ignore')

# Transform training and test sets
X_train_ohe = ohe.fit_transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

# Convert these columns into a DataFrame
columns = ohe.get_feature_names(input_features=X_train_cat.columns)
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=columns)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=columns)

- Combine `X_train_imputed_scaled` and `cat_train_df` into a single DataFrame  
- Similarly, combine `X_test_imputed_scaled` and `cat_test_df` into a single DataFrame 

In [17]:
# Your code here
X_train_all = pd.concat([pd.DataFrame(X_train_imputed_scaled), cat_train_df], axis=1)
X_test_all = pd.concat([pd.DataFrame(X_test_imputed_scaled), cat_test_df], axis=1)

Now build a linear regression model using all the features (`X_train_all`). Also, print the R-squared and the MSE for both the training and test sets. 

In [18]:
# Your code here
linreg = LinearRegression()
linreg = linreg.fit(X_train_all, y_train)

# print R2 and MSE
print(f'Training R-squared:{linreg.score(X_train_all, y_train)}')
print(f'Test R-squared:{linreg.score(X_test_all, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, linreg.predict(X_train_all))}')
print(f'Test MSE:{mean_squared_error(y_test, linreg.predict(X_test_all))}')

Training R-squared:0.9359999571870928
Test R-squared:-1.8122214882299044e+18
Training MSE:401985520.60730594
Test MSE:1.1562308554074924e+28


Notice the severe overfitting above; our training R-squared is very high, but the test R-squared is negative! Similarly, the scale of the test MSE is orders of magnitude higher than that of the training MSE.

## Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables, `X_train_all`) to build two models - one each for Lasso and Ridge regression. Each time, look at R-squared and MSE. 

## Lasso

#### With default parameter (alpha = 1)

In [20]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=1)
lasso.fit(X_train_all, y_train)

print(f'Training R-squared:{lasso.score(X_train_all, y_train)}')
print(f'Test R-squared:{lasso.score(X_test_all, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, lasso.predict(X_train_all))}')
print(f'Test MSE:{mean_squared_error(y_test, lasso.predict(X_test_all))}')

Training R-squared:0.9359681086176651
Test R-squared:0.888684112594205
Training MSE:402185562.0947692
Test MSE:710215967.2621554


#### With a higher regularization parameter (alpha = 10)

In [21]:
# Your code here
lasso_10 = Lasso(alpha=10)
lasso_10.fit(X_train_all, y_train)

print(f'Training R-squared:{lasso_10.score(X_train_all, y_train)}')
print(f'Test R-squared:{lasso_10.score(X_test_all, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, lasso_10.predict(X_train_all))}')
print(f'Test MSE:{mean_squared_error(y_test, lasso_10.predict(X_test_all))}')

Training R-squared:0.9343826511712738
Test R-squared:0.8966777526569275
Training MSE:412143851.3235965
Test MSE:659215063.9643531


## Ridge

#### With default parameter (alpha = 1)

In [22]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train_all, y_train)

print(f'Training R-squared:{ridge.score(X_train_all, y_train)}')
print(f'Test R-squared:{ridge.score(X_test_all, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, ridge.predict(X_train_all))}')
print(f'Test MSE:{mean_squared_error(y_test, ridge.predict(X_test_all))}')

Training R-squared:0.9231940244796031
Test R-squared:0.8842330485444212
Training MSE:482419834.3987995
Test MSE:738614579.833414


#### With default parameter (alpha = 10)

In [23]:
# Your code here
ridge_10 = Ridge(alpha=10)
ridge_10.fit(X_train_all, y_train)

print(f'Training R-squared:{ridge_10.score(X_train_all, y_train)}')
print(f'Test R-squared:{ridge_10.score(X_test_all, y_test)}')
print(f'Training MSE:{mean_squared_error(y_train, ridge_10.predict(X_train_all))}')
print(f'Test MSE:{mean_squared_error(y_test, ridge_10.predict(X_test_all))}')

Training R-squared:0.8990002650425939
Test R-squared:0.8834542222982167
Training MSE:634381310.5991352
Test MSE:743583635.4522301


## Compare the metrics    

Write your conclusions here: 
_________________________________

Our issue of overfitting has been solved - - with Lasso regression we end up with the highest training r-squared and best results. 

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Use 10**(-10) as an estimate that is very close to 0. 

In [29]:
# Number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [31]:
# Number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

35


In [30]:
print(len(lasso.coef_))
print(sum(abs(lasso.coef_) < 10**(-10))/ len(lasso.coef_))

296
0.11824324324324324


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Put it all together

To bring all of our work together lets take a moment to put all of our preprocessing steps for categorical and continuous variables into one function. This function should take in our features as a dataframe `X` and target as a Series `y` and return a training and test DataFrames with all of our preprocessed features along with training and test targets. 

In [None]:
def preprocess(X, y):
    '''Takes in features and target and implements all preprocessing steps for categorical and continuous features returning 
    train and test DataFrames with targets'''
    
    # Train-test split (75-25), set seed to 10

    
    # Remove "object"-type features and SalesPrice from X


    # Impute missing values with median using SimpleImputer


    # Scale the train and test data


    # Create X_cat which contains only the categorical variables


    # Fill nans with a value indicating that that it is missing


    # OneHotEncode Categorical variables

    
    # Combine categorical and continuous features into the final dataframe
    
    return X_train_all, X_test_all, y_train, y_test

### Graph the training and test error to find optimal alpha values

Earlier we tested two values of alpha to see how it effected our MSE and the value of our coefficients. We could continue to guess values of alpha for our Ridge or Lasso regression one at a time to see which values minimize our loss, or we can test a range of values and pick the alpha which minimizes our MSE. Here is an example of how we would do this:  

In [None]:
X_train_all, X_test_all, y_train, y_test = preprocess(X, y)

train_mse = []
test_mse = []
alphas = []

for alpha in np.linspace(0, 200, num=50):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train_all, y_train)
    
    train_preds = lasso.predict(X_train_all)
    train_mse.append(mean_squared_error(y_train, train_preds))
    
    test_preds = lasso.predict(X_test_all)
    test_mse.append(mean_squared_error(y_test, test_preds))
    
    alphas.append(alpha)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()
ax.plot(alphas, train_mse, label='Train')
ax.plot(alphas, test_mse, label='Test')
ax.set_xlabel('Alpha')
ax.set_ylabel('MSE')

# np.argmin() returns the index of the minimum value in a list
optimal_alpha = alphas[np.argmin(test_mse)]

# Add a vertical line where the test MSE is minimized
ax.axvline(optimal_alpha, color='black', linestyle='--')
ax.legend();

print(f'Optimal Alpha Value: {int(optimal_alpha)}')

Take a look at this graph of our training and test MSE against alpha. Try to explain to yourself why the shapes of the training and test curves are this way. Make sure to think about what alpha represents and how it relates to overfitting vs underfitting.

## Summary

Well done! You now know how to build Lasso and Ridge regression models, use them for feature selection and find an optimal value for $\text{alpha}$. 