# Lasso and Ridge Regression

Linear models may bring overfitting issues. We add $L_1$ or $L_2$ penalty to loss function of linear regression to regularizes the coefficient estimates towards zero. This technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

This notebook will use a linear regression engine `LinearModel_Engine`, which I built to do hyperparameter tuning with Bayesian optimization. 

In [1]:
import utils
import gc
import numpy as np
import pandas as pd
import pickle
from Linear_Models import LinearModel_Engine
from hyperopt import hp

## Read saved data sets

Read saved pickle format data sets from `1_Data_Exploration.ipynb`

* training data set (`X_train`, `y_train`), 60% of full `train_features.csv`: used with validation data set for hyperparameter tuning 
* validation data set (`X_val`, `y_val`), 20% of full `train_features.csv`: used with training data set for hyperparameter tuning 
* testing data set (`X_test`, `y_test`), 20% of full `train_features.csv`: used for comparing to other models

In [2]:
X_train = pd.read_pickle("X_train.pkl")
X_val = pd.read_pickle("X_val.pkl")
X_test = pd.read_pickle("X_test.pkl")
y_train = pd.read_pickle("y_train.pkl")
y_val = pd.read_pickle("y_val.pkl")
y_test = pd.read_pickle("y_test.pkl")

Drop `major_new` and the check data sets before doing hyperparameter tuning

In [3]:
vars_drop = ['major_new']
X_train_1, X_val_1, X_test_1 = utils.drop_vars(vars_drop, X_train, X_val, X_test) 

In [4]:
X_train_1.head()

Unnamed: 0,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
68609,VICE_PRESIDENT,MASTERS,MATH,EDUCATION,5,82
924598,CTO,DOCTORAL,MATH,AUTO,22,66
918523,CEO,HIGH_SCHOOL,NONE,EDUCATION,24,67
213733,CFO,HIGH_SCHOOL,NONE,FINANCE,22,90
246703,VICE_PRESIDENT,MASTERS,BUSINESS,SERVICE,18,68


In [5]:
var_cate = utils.get_categorical_variables(X_train_1)
X_train_hot_encode, X_valid_hot_encode = utils.encoding('one hot', var_cate, X_train_1, X_val_1)
_, X_test_hot_encode = utils.encoding('one hot', var_cate, X_train_1, X_test_1)

In [6]:
X_test_hot_encode.head()

Unnamed: 0,yearsExperience,milesFromMetropolis,jobType_CEO,jobType_CFO,jobType_CTO,jobType_JANITOR,jobType_JUNIOR,jobType_MANAGER,jobType_SENIOR,jobType_VICE_PRESIDENT,...,major_MATH,major_NONE,major_PHYSICS,industry_AUTO,industry_EDUCATION,industry_FINANCE,industry_HEALTH,industry_OIL,industry_SERVICE,industry_WEB
67354,22,30,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
346428,10,69,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
983385,14,75,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,1,0
773169,21,51,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
709215,0,65,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


Create two functions below for training and hyperparameter tuning

* `run_LM`: manual tuning to find hyperparameter space

In [7]:
def run_LM(encoding, LM, hyperparameters, test_data):
    LM = LinearModel_Engine(LM, X_train_1, y_train, X_val_1, y_val, True)
    var_cate, features_DEV, features_OOT, labels_DEV, labels_OOT = LM.get_datasets()

    features_DEV, features_OOT, feature_names = LM.encoding(var_cate, features_DEV, features_OOT)
    model, hyperparameters, DEV_metric, OOT_metric, run_time = \
                LM.train(features_DEV, y_train, features_OOT, y_val, hyperparameters)  
    
    pred = model.predict(test_data)
    
    
    print("Train RMSE is {}".format(DEV_metric))
    print("Valid RMSE is {}".format(OOT_metric))

    model_name = "Test data is"
    utils.check_RMSE(model_name, y_test, pred)
    print("________________")
    print(hyperparameters)


In [8]:
run_LM(True, 'lasso', {'alpha': 0.0009}, X_test_hot_encode)

Train RMSE is 19.839228473877707
Valid RMSE is 19.880274376320973
RMSE of model Test data is is:  19.86789704095302
________________
{'alpha': 0.0009}


In [9]:
run_LM(True, 'ridge', {'alpha': 0.0009}, X_test_hot_encode)

Train RMSE is 19.56730125965712
Valid RMSE is 19.608589653677345
RMSE of model Test data is is:  19.598284879387354
________________
{'alpha': 0.0009}


* `tune_LM`: apply bayesian optimization to find best hyperparameter in hyperparameter space

In [10]:
def tune_LM(encoding, LM, space, test_data, records, model_dir):
    LM = LinearModel_Engine(LM, X_train_1, y_train, X_val_1, y_val, True)
    table, hyperparameters, best_results = LM.evaluation(space, records, model_dir)
    
    best_model = pickle.load(open(model_dir, 'rb'))
    gbm_pred = best_model.predict(test_data)
    
    print("Train RMSE is {}".format(table[1]))
    print("Valid RMSE is {}".format(table[2]))
    model_name = "Test data is"
    utils.check_RMSE(model_name, y_test, gbm_pred)
    print("________________")
    print(hyperparameters)

In [11]:
space = {
            'alpha':  hp.uniform('alpha', 0.0, 0.0001),
        } 
   
tune_LM(True, 'ridge', space, X_test_hot_encode, 'LM_records.csv', 'Best_ridge.sav')    

100%|██████████| 50/50 [01:44<00:00,  2.09s/it, best loss: 19.608447221101493]
Train RMSE is 19.567285096155036
Valid RMSE is 19.608447221101493
RMSE of model Test data is is:  19.59826517753992
________________
{'alpha': 1.538754779735333e-08}


In [12]:
space = {
            'alpha':  hp.uniform('alpha', 0.0, 0.0001),
        } 
    
tune_LM(True, 'lasso', space, X_test_hot_encode, 'LM_records.csv', 'Best_lasso.sav') 

 40%|████      | 20/50 [06:44<11:57, 23.91s/it, best loss: 19.60845529066424] 

  positive)



 42%|████▏     | 21/50 [08:41<25:05, 51.91s/it, best loss: 19.608447196937732]

  positive)



 44%|████▍     | 22/50 [10:33<32:35, 69.82s/it, best loss: 19.608444362133923]

  positive)



 48%|████▊     | 24/50 [13:29<33:08, 76.49s/it, best loss: 19.608444362133923]

  positive)



 52%|█████▏    | 26/50 [15:53<28:15, 70.64s/it, best loss: 19.608444362133923]

  positive)



 68%|██████▊   | 34/50 [21:35<09:24, 35.30s/it, best loss: 19.608444362133923]

  positive)



 86%|████████▌ | 43/50 [26:23<03:04, 26.42s/it, best loss: 19.608444362133923]

  positive)



 88%|████████▊ | 44/50 [28:09<05:02, 50.40s/it, best loss: 19.60844033234874] 

  positive)



100%|██████████| 50/50 [31:51<00:00, 38.24s/it, best loss: 19.60844033234874]
Train RMSE is 19.56728980344768
Valid RMSE is 19.60844033234874
RMSE of model Test data is is:  19.598279940427293
________________
{'alpha': 3.459924132577914e-06}


## Summary

Both Lasso and Rigde Regression have very similar results that RMSE of test data is close to 19.598. This RMSE is really close to the RMSE of baseline model Linear Regression. Thus, ethier Lasso or Rigde Regression will be selected.