# IS4303 IT-MEDIATED FINANCIAL SOLUTIONS AND PLATFORMS

> ## Week 5 Tutorial - LASSO and Ridge Regression

<div class="alert alert-danger">
<b>Python Version:</b> 2.7+<br>Create a virtual environment in Anaconda if needed.
</div>

## Sections:
* [0. Goal](#0)
* [1. Dataset](#1)
* [2. Data Preprocessing](#2)
* [3. Regularization: Penalized Regression](#3)

## Summary of Regularization

<div class="alert alert-success">
<b>Resources:</b> 
<a href="https://youtu.be/u73PU6Qwl1I" target="_blank" style="text-decoration: none"><span class="label label-primary">Overfit</span></a>
<a href="https://youtu.be/KvtGD37Rm5I" target="_blank" style="text-decoration: none"><span class="label label-info">Cost Function</span></a>
<a href="https://youtu.be/qbvRdrd0yJ8" target="_blank" style="text-decoration: none"><span class="label label-warning">Regularization</span></a>
<a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank" style="text-decoration: none"><span class="label label-danger">Cross Validation</span></a>    
</div>

In [1]:
# You can also watch this youtube video in this notebook
from IPython.display import HTML, IFrame
IFrame(src="https://www.youtube.com/embed/qbvRdrd0yJ8", width="853", height="480")

***

<a id="0"></a>

## 0 Goal

#### The goal of this tutorial is to understand: 
* Regularization: Penalized Regression
* Lasso and Ridge
* Confusion Matrix: True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN)
* Cross-Validation

<a id="1"></a>

## 1 Dataset

<br><div class="btn-group"> 
    <a href="https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients" target="_blank" class="btn btn-primary" role="button" style="text-decoration: none">Introduction</a>
    <a href="#overview" class="btn btn-success" role="button" style="text-decoration: none">Overview</a>
    <a href="#task" class="btn btn-warning" role="button" style="text-decoration: none">Tasks</a>
</div>

<a id="overview"></a>
#### Overview
The file <b><code>"default of credit card clients.xls"</code></b> contains information about customers' default payments. <br>

The dataset can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00350/). Information on the columns and features can be found [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). <br>

**Output variable** is binary, default payment (Yes = 1, No = 0). We have the following 23 variables as **explanatory variables**: 
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
* X2: Gender (1 = male; 2 = female). 
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
* X4: Marital status (1 = married; 2 = single; 3 = others). 
* X5: Age (year). 
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
* X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
* X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

<a id="task"></a>
#### Tasks: Lasso Regression, Ridge Regression and Cross Validation
* Understand the differences between Lasso and Ridge
* Estimate and predict credit default behavior using penalized regression methods
* Compare model performance with cross-validation

<a id="2"></a> 

## 2 Data Preprocessing

In [2]:
#!usr/bin/env python
#-*- coding:utf-8 -*-
from __future__ import division, print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt, log
from functools import reduce
from collections import defaultdict
from IPython.display import HTML
%matplotlib inline

<div class="alert alert-warning">
<b>Step 2.1: Read data into python pandas and named as "default".</b><br><br>
    
<div class="btn-group">    
    <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html" target="_blank" class="btn btn-primary" role="button" style="text-decoration: none">Read Excel</a>
    <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" target="_blank" class="btn btn-info" role="button" style="text-decoration: none">Read CSV</a>
    <a href="https://pandas.pydata.org/pandas-docs/stable/reference/io.html" target="_blank" class="btn btn-success" role="button" style="text-decoration: none">Read Others</a>
</div><br> 
</div>

In [3]:
%pwd
default = pd.read_excel("./default of credit card clients.xls", header=1)
default.drop(['ID', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'], inplace=True, axis=1)

In [4]:
default.head(n=10)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,50000,1,1,2,37,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,500000,1,1,2,29,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,100000,2,2,2,23,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0
8,140000,2,3,1,28,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,20000,1,3,2,35,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0


<div class="alert alert-warning">
<b>Step 2.2: Rename dependent/response variable.</b>
<br><br>    
<div class="btn-group">    
    <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html" target="_blank" class="btn btn-info" role="button" style="text-decoration: none">Rename</a>
</div>
</div>

In [5]:
default = default.rename(columns={'default payment next month': 'default'})
default.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'BILL_AMT1',
       'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
       'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default'],
      dtype='object')

<div class="alert alert-warning">
<b>Step 2.3: Detect missing values. If there are not many missing values, you can simply delete them. If not, you can (1) delete missing values, (2) do data imputation, or (3) drop features.</b>
<br><br>
<div class="btn-group">    
    <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html" target="_blank" class="btn btn-info" role="button" style="text-decoration: none">Missing Data</a>
</div>
</div>

In [6]:
# The proportion of missing values for each variable
default.isnull().sum()/len(default)

LIMIT_BAL    0.0
SEX          0.0
EDUCATION    0.0
MARRIAGE     0.0
AGE          0.0
BILL_AMT1    0.0
BILL_AMT2    0.0
BILL_AMT3    0.0
BILL_AMT4    0.0
BILL_AMT5    0.0
BILL_AMT6    0.0
PAY_AMT1     0.0
PAY_AMT2     0.0
PAY_AMT3     0.0
PAY_AMT4     0.0
PAY_AMT5     0.0
PAY_AMT6     0.0
default      0.0
dtype: float64

In [7]:
# Luckily, we do not have missing values in this example. But suppose we have some, we can delete missing values
default.dropna(inplace=True)
default.shape

(30000, 18)

In [8]:
# Or we can impute with substituted values (e.g., mean values)
# Suppose there are some missing values in variable "LIMIT_BAL"
default['LIMIT_BAL'].fillna(default['LIMIT_BAL'].mean(), inplace=True)
default.head(10)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,50000,1,1,2,37,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,500000,1,1,2,29,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,100000,2,2,2,23,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0
8,140000,2,3,1,28,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,20000,1,3,2,35,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0


<div class="alert alert-warning">
<b>Step 2.4: Binarize categorical variables.</b>
<br><br>
<div class="btn-group">    
    <a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/" target="_blank" class="btn btn-info" role="button" style="text-decoration: none">One-hot Encoding</a>
    <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html" target="_blank" class="btn btn-success" role="button" style="text-decoration: none">get_dummies</a>
</div>
</div>

In [9]:
# ['SEX', 'EDUCATION', 'MARRIAGE'] are categorical variables, so we need to binarize them by one-hot encoding
categorical_feature_list = ['SEX', 'EDUCATION', 'MARRIAGE']
default[categorical_feature_list] = default[categorical_feature_list].applymap(str)
default.dtypes

LIMIT_BAL     int64
SEX          object
EDUCATION    object
MARRIAGE     object
AGE           int64
BILL_AMT1     int64
BILL_AMT2     int64
BILL_AMT3     int64
BILL_AMT4     int64
BILL_AMT5     int64
BILL_AMT6     int64
PAY_AMT1      int64
PAY_AMT2      int64
PAY_AMT3      int64
PAY_AMT4      int64
PAY_AMT5      int64
PAY_AMT6      int64
default       int64
dtype: object

In [10]:
dummies = pd.get_dummies(default[categorical_feature_list], drop_first=True) # dummy variables
dummies.head(10)

Unnamed: 0,SEX_2,EDUCATION_1,EDUCATION_2,EDUCATION_3,EDUCATION_4,EDUCATION_5,EDUCATION_6,MARRIAGE_1,MARRIAGE_2,MARRIAGE_3
0,1,0,1,0,0,0,0,1,0,0
1,1,0,1,0,0,0,0,0,1,0
2,1,0,1,0,0,0,0,0,1,0
3,1,0,1,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,1,0,0
5,0,1,0,0,0,0,0,0,1,0
6,0,1,0,0,0,0,0,0,1,0
7,1,0,1,0,0,0,0,0,1,0
8,1,0,0,1,0,0,0,1,0,0
9,0,0,0,1,0,0,0,0,1,0


In [11]:
# Merge dummies with original dataset, and drop original categorical variables
default_data = default.join(dummies)
default_data.drop(categorical_feature_list, axis=1, inplace=True)
default_data.head(10)

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,SEX_2,EDUCATION_1,EDUCATION_2,EDUCATION_3,EDUCATION_4,EDUCATION_5,EDUCATION_6,MARRIAGE_1,MARRIAGE_2,MARRIAGE_3
0,20000,24,3913,3102,689,0,0,0,0,689,...,1,0,1,0,0,0,0,1,0,0
1,120000,26,2682,1725,2682,3272,3455,3261,0,1000,...,1,0,1,0,0,0,0,0,1,0
2,90000,34,29239,14027,13559,14331,14948,15549,1518,1500,...,1,0,1,0,0,0,0,0,1,0
3,50000,37,46990,48233,49291,28314,28959,29547,2000,2019,...,1,0,1,0,0,0,0,1,0,0
4,50000,57,8617,5670,35835,20940,19146,19131,2000,36681,...,0,0,1,0,0,0,0,1,0,0
5,50000,37,64400,57069,57608,19394,19619,20024,2500,1815,...,0,1,0,0,0,0,0,0,1,0
6,500000,29,367965,412023,445007,542653,483003,473944,55000,40000,...,0,1,0,0,0,0,0,0,1,0
7,100000,23,11876,380,601,221,-159,567,380,601,...,1,0,1,0,0,0,0,0,1,0
8,140000,28,11285,14096,12108,12211,11793,3719,3329,0,...,1,0,0,1,0,0,0,1,0,0
9,20000,35,0,0,0,0,13007,13912,0,0,...,0,0,0,1,0,0,0,0,1,0


In [12]:
default_data.describe()

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,SEX_2,EDUCATION_1,EDUCATION_2,EDUCATION_3,EDUCATION_4,EDUCATION_5,EDUCATION_6,MARRIAGE_1,MARRIAGE_2,MARRIAGE_3
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,167484.322667,35.4855,51223.3309,49179.075167,47013.15,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,...,0.603733,0.352833,0.467667,0.1639,0.0041,0.009333,0.0017,0.4553,0.532133,0.010767
std,129747.661567,9.217904,73635.860576,71173.768783,69349.39,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,...,0.489129,0.477859,0.498962,0.370191,0.063901,0.096159,0.041197,0.498006,0.498975,0.103204
min,10000.0,21.0,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,28.0,3558.75,2984.75,2666.25,2326.75,1763.0,1256.0,1000.0,833.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,140000.0,34.0,22381.5,21200.0,20088.5,19052.0,18104.5,17071.0,2100.0,2009.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,240000.0,41.0,67091.0,64006.25,60164.75,54506.0,50190.5,49198.25,5006.0,5000.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
max,1000000.0,79.0,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0,873552.0,1684259.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


<div class="alert alert-warning">
<b>Step 2.5: Save a copy.</b>
</div>

In [13]:
output = 'default'
X = default_data.drop(output, axis=1) # Here no need to set inplace=True
y = default_data[output]

In [14]:
# Save a copy
default_data.to_csv('./dataset.csv', index=None)

<a id="3"></a>

## 3 Regularization: Penalized Regression

<div class="alert alert-warning">
<b>Note: </b>You might get convergence warning. For this assignment, you can ignore them, or increase the number of iterations.
</div>

### 3.1 Lasso Regression

In [15]:
# Import libraries
import sklearn
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# Read data
data = pd.read_csv("./dataset.csv")
data.head(n=10)

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,SEX_2,EDUCATION_1,EDUCATION_2,EDUCATION_3,EDUCATION_4,EDUCATION_5,EDUCATION_6,MARRIAGE_1,MARRIAGE_2,MARRIAGE_3
0,20000,24,3913,3102,689,0,0,0,0,689,...,1,0,1,0,0,0,0,1,0,0
1,120000,26,2682,1725,2682,3272,3455,3261,0,1000,...,1,0,1,0,0,0,0,0,1,0
2,90000,34,29239,14027,13559,14331,14948,15549,1518,1500,...,1,0,1,0,0,0,0,0,1,0
3,50000,37,46990,48233,49291,28314,28959,29547,2000,2019,...,1,0,1,0,0,0,0,1,0,0
4,50000,57,8617,5670,35835,20940,19146,19131,2000,36681,...,0,0,1,0,0,0,0,1,0,0
5,50000,37,64400,57069,57608,19394,19619,20024,2500,1815,...,0,1,0,0,0,0,0,0,1,0
6,500000,29,367965,412023,445007,542653,483003,473944,55000,40000,...,0,1,0,0,0,0,0,0,1,0
7,100000,23,11876,380,601,221,-159,567,380,601,...,1,0,1,0,0,0,0,0,1,0
8,140000,28,11285,14096,12108,12211,11793,3719,3329,0,...,1,0,0,1,0,0,0,1,0,0
9,20000,35,0,0,0,0,13007,13912,0,0,...,0,0,0,1,0,0,0,0,1,0


In [16]:
# Train-Test Split: 90/10
output = 'default'
X = data.drop(output, axis=1) # Here no need to set inplace=True
y = data[output]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=12345)
print(y_train.sum()/y_train.count(),y_test.sum()/y_test.count())

0.22144444444444444 0.219


<div class="alert alert-warning">
<b>Step 3.1: Lasso Regression </b> 
<p>Please fit training data with <b>Logistic Regression with L1-Penalty</b>.</p>
<p>Please report/print parameter values on the <code><b>train</b></code> dataset.</p>
<p><b>Remember: </b>Set <code><b>fit_intercept=True</b></code> and <code><b>penalty='l1'</b></code> and <code><b>C=10**(-9)</b></code> and <code><b>solver='liblinear'</b></code> and <code><b>random_state=12345</b></code> so that results can be replicated. The default value of <code><b>C=1</b></code>.</p> 
<p>You can refer to: </p>
<div class="btn-group">    
    <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" target="_blank" class="btn btn-primary" role="button" style="text-decoration: none">Scikit-learn</a>
    <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)" target="_blank" class="btn btn-warning" role="button" style="text-decoration: none">Lasso Regression</a>
</div>
</div>

In [17]:
lasso = LogisticRegression(fit_intercept=True, penalty='l1', C=10**(-7), solver='liblinear', random_state=12345)
lasso_model = lasso.fit(X=X_train, y=y_train)
print('Intercept: \n', lasso_model.intercept_, '\nFeatures: \n', lasso_model.coef_)

Intercept: 
 [0.] 
Features: 
 [[-6.67542969e-06  0.00000000e+00 -3.90430098e-07  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  -2.10763027e-07  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]]


In [18]:
# Get predicted labels for test data
y_pred_lasso = lasso_model.predict(X_test)

# Performance of model on test data
print("Test Accuracy of Lasso Model: ", accuracy_score(y_test, y_pred_lasso))
print("Test Error of Lasso Model: ", 1 - accuracy_score(y_test, y_pred_lasso))

Test Accuracy of Lasso Model:  0.781
Test Error of Lasso Model:  0.21899999999999997


### 3.2 Ridge Regression

<div class="alert alert-warning">
<b>Step 3.2: Ridge Regression </b> 
<p>Please fit training data with <b>Logistic Regression with L2-Penalty</b>.</p>
<p>Please report/print parameter values on the <code><b>train</b></code> dataset.</p>
<p><b>Remember: </b>Set <code><b>fit_intercept=True</b></code> and <code><b>penalty='l2'</b></code> and <code><b>C=10**(-9)</b></code> and <code><b>solver='liblinear'</b></code> and <code><b>random_state=12345</b></code>.</p> 
<p>You can refer to: </p>
<div class="btn-group">    
    <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" target="_blank" class="btn btn-primary" role="button" style="text-decoration: none">Scikit-learn</a>
    <a href="http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf" target="_blank" class="btn btn-warning" role="button" style="text-decoration: none">Ridge Regression</a>
</div>
</div>

In [19]:
ridge = LogisticRegression(fit_intercept=True, penalty='l2', C=10**(-7), solver='liblinear', random_state=12345)
ridge_model = ridge.fit(X=X_train, y=y_train)
print('Intercept: \n', ridge_model.intercept_, '\nFeatures: \n', ridge_model.coef_)

Intercept: 
 [-6.73297194e-05] 
Features: 
 [[-5.01839387e-06 -2.03391101e-03 -1.02598497e-05  4.92715769e-06
   3.00062055e-06  3.06859781e-07  4.15826180e-06  1.82579409e-06
  -3.46659933e-05 -2.92433160e-05 -9.34400581e-06 -9.70342092e-06
  -7.18729748e-06 -7.95774932e-07 -5.00967956e-05 -1.47628228e-05
  -3.42116269e-05 -1.42878537e-05 -8.17904391e-07 -2.85645837e-06
  -2.67140003e-07 -1.05032579e-05 -5.52120645e-05 -1.07848905e-06]]


In [20]:
# Get predicted labels for test data
y_pred_ridge = ridge_model.predict(X_test)

# Performance of model on test data
print("Test Accuracy of Ridge Model: ", accuracy_score(y_test, y_pred_ridge))
print("Test Error of Ridge Model: ", 1 - accuracy_score(y_test, y_pred_ridge))

Test Accuracy of Ridge Model:  0.781
Test Error of Ridge Model:  0.21899999999999997


### 3.3 Cross-Validation

<div class="alert alert-warning">
<b>Step 3.3: Normal Logistic Regression </b> 
<p>Please fit training data with <b>Logistic Regression</b> model.</p>
<p>Please report/print parameter values on the <code><b>train</b></code> dataset, and report/print test accuracy on the <code><b>test</b></code> dataset</p>
<p><b>Remember: </b>Set <code><b>fit_intercept=True</b></code> and <code><b>solver='liblinear'</b></code> and <code><b>C=10**10</b></code> and <code><b>random_state=12345</b></code> so that results can be replicated.</p> 
<p>You can refer to: </p>
<div class="btn-group">    
    <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" target="_blank" class="btn btn-success" role="button" style="text-decoration: none">Logistic Regression</a>
</div>
<div class="alert alert-danger">
<b>Note:</b><p>Set <code><b>C=10**10</b></code> so that regularization strength $\lambda$ ($=\frac{1}{C}$) will be almost 0 (i.e., regularization effect is minimal).</p>
</div>
</div>

In [21]:
# Fit logistic regression model
lr = LogisticRegression(fit_intercept=True, solver='liblinear', C=10**10, random_state=12345)
lr_model = lr.fit(X=X_train, y=y_train)
print('Intercept: \n', lr_model.intercept_, '\nFeatures: \n', lr_model.coef_)

Intercept: 
 [-0.00064832] 
Features: 
 [[-3.21660409e-06 -1.58911772e-02 -8.12534469e-06  4.48114606e-06
   2.78780314e-06  2.62020109e-07  3.11756441e-06  1.85508360e-06
  -2.70161903e-05 -2.28555791e-05 -7.38365815e-06 -7.85282279e-06
  -6.30607082e-06 -1.30737154e-06 -5.53902211e-04 -1.71803661e-04
  -3.19437410e-04 -1.00669404e-04 -1.20633403e-05 -3.97383529e-05
  -2.81990878e-06  3.56439735e-05 -6.70699546e-04 -6.36596799e-06]]


In [22]:
# Get predicted labels for test data
y_pred_lr = lr_model.predict(X_test)

# Performance of model on test data
print("Test Accuracy of Normal Logistic Model: ", accuracy_score(y_test, y_pred_lr))
print("Test Error of Normal Logistic Model: ", 1 - accuracy_score(y_test, y_pred_lr))

Test Accuracy of Normal Logistic Model:  0.781
Test Error of Normal Logistic Model:  0.21899999999999997


<div class="alert alert-warning">
<b>Step 3.4: Cross-Validation </b> 
<div>Please report/print <b>10-fold</b> cross-validation <b>accuracy</b> scores of: 
<ol>    
<li><code><b>Logistic Regression model</b></code>;</li>  
<li><code><b>Lasso model</b></code>;</li>
<li><code><b>Ridge model</b></code>.</li>
</ol>
</div>
<p>on the whole dataset. Compare their accuracy scores. Do you think penalized regression techniques improve model accuracy?</p>
<p>You can refer to: </p>
<div class="btn-group">
    <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold" target="_blank" class="btn btn-success" role="button" style="text-decoration: none">Scikit-learn KFold</a>
    <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score" target="_blank" class="btn btn-primary" role="button" style="text-decoration: none">Scikit-learn CV</a>
    <a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank" class="btn btn-warning" role="button" style="text-decoration: none">Cross-validation</a>
</div>
<div class="alert alert-danger">
<b>Note: </b>Remember to set <code>n_splits=10</code> and <code>shuffle=True</code> and <code>random_state=12345</code> for <code>KFold</code>.
</div>
</div>

In [23]:
# K-Fold Cross Validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=12345)

In [24]:
# Logistic regression
lr = LogisticRegression(fit_intercept=True, solver='liblinear', C=10**10, random_state=12345)
cv_lr = cross_val_score(lr, # Cross-validation on logistic regression
                        X, # Feature matrix
                        y, # Output vector
                        cv=kf, # Cross-validation technique
                        scoring='accuracy' # Model performance metrics: accuracy
                        )

# Lasso Regression
lasso = LogisticRegression(fit_intercept=True, penalty='l1', C=10**(-7), solver='liblinear', random_state=12345)
cv_lasso = cross_val_score(lasso, # Cross-validation on Lasso
                           X, # Feature matrix
                           y, # Output vector
                           cv=kf, # Cross-validation technique
                           scoring='accuracy' # Model performance metrics: accuracy
                           )

# Ridge Regression
ridge = LogisticRegression(fit_intercept=True, penalty='l1', C=10**(-7), solver='liblinear', random_state=12345)
cv_ridge = cross_val_score(ridge, # Cross-validation on Ridge
                           X, # Feature matrix
                           y, # Output vector
                           cv=kf, # Cross-validation technique
                           scoring='accuracy' # Model performance metrics: accuracy
                           )

In [25]:
# Model performance: Cross-validation

# Report average cross-validation accuracy of Logistic regression
print('Mean Accuracy of Logistic regression model: ', cv_lr.mean())

# Report average cross-validation accuracy of Lasso regression
print('Mean Accuracy of Lasso model: ', cv_lasso.mean())

# Report average cross-validation accuracy of Ridge regression
print('Mean Accuracy of Ridge model: ', cv_ridge.mean())

Mean Accuracy of Logistic regression model:  0.7787333333333334
Mean Accuracy of Lasso model:  0.7787999999999999
Mean Accuracy of Ridge model:  0.7787999999999999


<div class="alert alert-danger">
<b>Findings:</b>
<p>Penalized regression techniques can really improve model performance, although sometimes the improvement is small. But given by a large data size, a slight improvement in model performance is still highly welcomed by practitioners or companies.</p>
<br>
<p>Here we didn't try to search the best regularization parameter $\lambda$. In the assignment 2, you will be asked to find and select the best regularization parameter to run Lasso/Ridge. You will see how test accuracy improves a lot.</p>
</div>

<a id="4"></a>