# Adding Regularization to the Model

In this activity, we will utilize the same logistic regression model from the scikit-learn package. This time, however, we will add regularization to the model and search for the optimum regularization parameter — a process often called hyperparameter tuning. After training the models, we will test the predictions and compare the model evaluation metrics to those produced by the baseline model and the model without regularization.

### 1. Load in the feature and target datasets of the online shoppers purchasing intention dataset from '../data/OSI_feats_e3.csv' and '../data/OSI_target_e2.csv'.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

In [2]:
df_x = pd.read_csv("../data/OSI_feats_e3.csv")
df_y = pd.read_csv("../data/OSI_target_e2.csv")
print(df_x.shape)
print(df_y.shape)

(12330, 68)
(12330, 1)


### 2. Create training and test datasets for each of the feature and target datasets. The training datasets will be used to train on, and the models will be evaluated using the test datasets.

In [3]:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=13)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(9864, 68)
(2466, 68)
(9864, 1)
(2466, 1)


### 3. Instantiate a model instance of the LogisticRegressionCV class of scikit-learn's linear_model package.

In [4]:
np.logspace(-2, 6, 9)

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06])

In [9]:
Cs = np.logspace(-2, 6, 9)

model_l1 = LogisticRegressionCV(
    Cs=Cs,
    penalty='l1',
    cv=10,
    solver='liblinear',
    max_iter=5000,
    random_state=42,
)
model_l2 = LogisticRegressionCV(
    Cs=Cs,
    penalty='l2',
    cv=10,
    solver='lbfgs',
    max_iter=5000,
    random_state=42,
)

### 4. Fit the model to the training data.

In [10]:
model_l1.fit(x_train, y_train['Revenue'])
model_l2.fit(x_train, y_train['Revenue'])

LogisticRegressionCV(Cs=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06]),
                     cv=10, max_iter=5000, random_state=42)

In [7]:
print(f'Best hyperparameter for l1 regularization model: {model_l1.C_[0]}')
print(f'Best hyperparameter for l2 regularization model: {model_l2.C_[0]}')

Best hyperparameter for l1 regularization model: 10.0
Best hyperparameter for l2 regularization model: 0.1


### 6.Evaluate the models by comparing how they scored against the true values using the evaluation metrics.

In [8]:
y_l1_pred = model_l1.predict(x_test)
y_l2_pred = model_l2.predict(x_test)

In [11]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [12]:
print(accuracy_score(y_pred=y_l1_pred, y_true=y_test))
print(accuracy_score(y_pred=y_l2_pred, y_true=y_test))

0.8917274939172749
0.8921330089213301


In [15]:
precision, recall, fscore, support = precision_recall_fscore_support(
    y_pred=y_l1_pred, 
    y_true=y_test,
    average='binary'
)
print(precision)
print(recall)
print(fscore)

0.7286432160804021
0.40502793296089384
0.5206463195691203


In [16]:
precision, recall, fscore, support = precision_recall_fscore_support(
    y_pred=y_l2_pred, 
    y_true=y_test,
    average='binary'
)
print(precision)
print(recall)
print(fscore)

0.73
0.40782122905027934
0.5232974910394265
