Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# REGULARIZATION AND SGD

Regularization is a technique that allows us to avoid overfitting by penalizing excessive feature weights. Several classifiers, such as [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html),  include the option for choosing which regularization term to use.

In this notebook we'll explore the usage of different regularization terms. For that, we'll use a restaurant reviews classification task.

In [1]:
# Loading the data

import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

print(dataset['Liked'].value_counts())
dataset.head()

1    500
0    500
Name: Liked, dtype: int64


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [2]:
# Cleaning the text

import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

corpus = []
ps = PorterStemmer()
for i in range(0,1000):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    # to lower-case and tokenize
    review = review.lower().split()
    # stemming and stop word removal
    review = ' '.join([ps.stem(w) for w in review if not w in set(stopwords.words('english'))])
    corpus.append(review)

In [3]:
# Creating a bag-of-words model

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1500)
X = vectorizer.fit_transform(corpus).toarray()
y = dataset['Liked']

print(X.shape, y.shape)

(1000, 1500) (1000,)


In [4]:
# Splitting the dataset into training and test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print(y_train.value_counts())
print(y_test.value_counts())

(800, 1500) (800,)
(200, 1500) (200,)
1    400
0    400
Name: Liked, dtype: int64
0    100
1    100
Name: Liked, dtype: int64


## Logistic Regression

Scikit-learn's [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) includes both L1 and L2 regularizations. L2 is the default.

In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score

clf = LogisticRegression(penalty='l2') # l2 regularization is the default
clf.fit(X_train, y_train)

In [52]:
y_pred = clf.predict(X_test)

print("LogisticRegression with L2 regularization")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')
print(f'F1: {f1_score(y_test, y_pred)}')

LogisticRegression with L2 regularization
Accuracy: 0.805
Precision: 0.8210526315789474
Recall: 0.78
F1: 0.8


Print the feature weights that we've obtained.

In [53]:
# your code here
fw = clf.coef_[0]
fw

array([ 0.41686797,  0.17697687,  0.        , ..., -0.20321363,
        0.64383437, -0.61415454])

How many features are actually being used? (I.e., how many non-zero weights are there?)

In [54]:
# your code here
len([w for w in fw if w != 0])

1311

L1 regularization typically obtains sparser weight vectors. Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?

In [55]:
# your code here
clf2 = LogisticRegression(penalty='l1', solver='liblinear')
clf2.fit(X_train, y_train)

fw2 = clf2.coef_[0]
len([w for w in fw2 if w != 0])

149

In [56]:
y_pred2 = clf2.predict(X_test)

print("LogisticRegression with L1 regularization")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(f'Precision: {precision_score(y_test, y_pred2)}')
print(f'Recall: {recall_score(y_test, y_pred2)}')
print(f'F1: {f1_score(y_test, y_pred2)}')

LogisticRegression with L1 regularization
Accuracy: 0.795
Precision: 0.8641975308641975
Recall: 0.7
F1: 0.7734806629834253


You can also try using a mix of L1 and L2 (check the documentation for how to do it).

In [57]:
# your code here
clf3 = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=10000)
clf3.fit(X_train, y_train)
y_pred = clf3.predict(X_test)

fw3 = clf3.coef_[0]
len([w for w in fw3 if w != 0])

380

In [58]:
y_pred3 = clf3.predict(X_test)

print("LogisticRegression with L1 and L2 regularization with an l1_ratio of 0.5")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred3)}')
print(f'Precision: {precision_score(y_test, y_pred3)}')
print(f'Recall: {recall_score(y_test, y_pred3)}')
print(f'F1: {f1_score(y_test, y_pred3)}')

LogisticRegression with L1 and L2 regularization with an l1_ratio of 0.5
Accuracy: 0.78
Precision: 0.8181818181818182
Recall: 0.72
F1: 0.7659574468085107


## SVM

Scikit-learn's [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) also includes both L1 and L2 regularizations. L2 is the default.

In [68]:
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix

clf = LinearSVC(penalty='l2') # l2 regularization is the default

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

[[82 18]
 [20 80]]


In [69]:
y_pred = clf.predict(X_test)

print("SVM with L2 regularization")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')
print(f'F1: {f1_score(y_test, y_pred)}')

SVM with L2 regularization
Accuracy: 0.81
Precision: 0.8163265306122449
Recall: 0.8
F1: 0.8080808080808082


How many features are actually being used? (I.e., how many non-zero weights are there?)

In [60]:
# your code here
len([w for w in clf.coef_[0] if w != 0])

1083

Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?

In [70]:
# your code here
from sklearn.svm import LinearSVC

clf = LinearSVC(penalty='l1', loss='squared_hinge', dual=False) # l2 regularization is the default

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
len([w for w in clf.coef_[0] if w != 0])

[[86 14]
 [29 71]]


418

In [71]:
y_pred = clf.predict(X_test)

print("SVM with L1 regularization")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')
print(f'F1: {f1_score(y_test, y_pred)}')

SVM with L1 regularization
Accuracy: 0.785
Precision: 0.8352941176470589
Recall: 0.71
F1: 0.7675675675675675


## SGD Classifier

Scikit-learn's [SGD Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements regularized linear models (such as SVM and Logistic Regression) with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing learning rate.

Several loss functions can be used, namely *hinge loss* (which corresponds to SVM) and *log loss* (which corresponds to Logistic Regression). And as before, you can use L1 and/or L2 regularization.

The *max_iter* parameter allows you to set the maximum number of epochs, where an epoch corresponds to going through the whole dataset for training. Also, *learning_rate* allows you to set a learning rate schedule.

Several parameters allow you to define stopping criteria: *tol* specifies a tolerance loss value or stopping criterion, while *n_iter_no_change* indicates the number of iterations with no improvement that should be observed before stopping; *early_stopping* allows us to use a validation set (a fraction *validation_fraction* of the training data) on which the stopping criterion will be checked (instead of checking the loss on the training data).

The *verbose* parameter allows you to set a verbosity (output) level.

Try using SGD, and explore different parameters!

In [78]:
# your code here
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss='perceptron', verbose=1)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
len([w for w in clf.coef_[0] if w != 0])

-- Epoch 1
Norm: 168.60, NNZs: 690, Bias: -5.551294, T: 800, Avg. loss: 3.824504
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 147.52, NNZs: 886, Bias: -0.005331, T: 1600, Avg. loss: 1.362331
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 123.74, NNZs: 951, Bias: -0.024917, T: 2400, Avg. loss: 0.507045
Total training time: 0.01 seconds.
-- Epoch 4
Norm: 106.00, NNZs: 978, Bias: 2.325033, T: 3200, Avg. loss: 0.262562
Total training time: 0.01 seconds.
-- Epoch 5
Norm: 92.60, NNZs: 991, Bias: -0.171952, T: 4000, Avg. loss: 0.156892
Total training time: 0.01 seconds.
-- Epoch 6
Norm: 82.38, NNZs: 1008, Bias: -0.256581, T: 4800, Avg. loss: 0.105431
Total training time: 0.01 seconds.
-- Epoch 7
Norm: 74.95, NNZs: 1022, Bias: -1.821244, T: 5600, Avg. loss: 0.060162
Total training time: 0.01 seconds.
-- Epoch 8
Norm: 68.41, NNZs: 1027, Bias: -0.339899, T: 6400, Avg. loss: 0.049913
Total training time: 0.02 seconds.
-- Epoch 9
Norm: 62.88, NNZs: 1031, Bias: -0.319476, T: 7200, Avg. 

1037

In [79]:
y_pred = clf.predict(X_test)

print("SGD with L2 regularization and tolereance of 1e-3")

# Assess the accuracy, precision, recall, and F1 score of the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')
print(f'F1: {f1_score(y_test, y_pred)}')

SGD with L2 regularization and tolereance of 1e-3
Accuracy: 0.8
Precision: 0.7884615384615384
Recall: 0.82
F1: 0.803921568627451


Stochastic gradient descent updates the model weights base on one example at a time. Instead, we can compute the gradient over batches of training instances before updating the weights.

SGDClassifier allows us to do so via [*partial_fit*](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit), which corresponds to training the model with a specific set of examples for a single epoch. To properly use this method, we need to split our data into mini-batches and then iterate through them for as many epochs as we want.
Matters such as objective convergence, early stopping, and learning rate adjustments must be handled manually.

Try it out!

In [82]:
n_iter = 20

In [94]:
# your code here
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def batch(iterable_X, iterable_y, n=1):
    l = len(iterable_X)
    for ndx in range(0, l, n):
        yield iterable_X[ndx:min(ndx + n, l)], iterable_y[ndx:min(ndx + n, l)]

clf = SGDClassifier(alpha=.0001, loss='log_loss', penalty='l2', n_jobs=-1, shuffle=True, max_iter=100, verbose=0, tol=0.001)
 
ROUNDS = 20
for _ in range(ROUNDS):
    batcherator = batch(X_train, y_train, 10)
    for index, (chunk_X, chunk_y) in enumerate(batcherator):
        clf.partial_fit(chunk_X, chunk_y, classes=[0, 1])
 
        y_predicted = clf.predict(X_test)
        print(accuracy_score(y_test, y_predicted))

0.53
0.54
0.635
0.57
0.63
0.64
0.565
0.645
0.635
0.605
0.645
0.675
0.655
0.585
0.7
0.625
0.61
0.605
0.63
0.675
0.615
0.705
0.685
0.705
0.705
0.675
0.68
0.715
0.71
0.71
0.69
0.695
0.735
0.7
0.69
0.68
0.715
0.68
0.65
0.715
0.665
0.67
0.625
0.675
0.715
0.64
0.69
0.665
0.655
0.685
0.685
0.695
0.715
0.7
0.73
0.725
0.725
0.705
0.72
0.735
0.755
0.725
0.705
0.7
0.71
0.745
0.76
0.75
0.745
0.71
0.725
0.735
0.72
0.745
0.695
0.71
0.755
0.735
0.735
0.725
0.67
0.72
0.705
0.72
0.69
0.705
0.705
0.665
0.72
0.69
0.72
0.73
0.72
0.73
0.725
0.755
0.755
0.735
0.745
0.75
0.68
0.68
0.725
0.705
0.71
0.75
0.75
0.74
0.745
0.725
0.75
0.735
0.73
0.75
0.745
0.74
0.755
0.73
0.695
0.72
0.73
0.755
0.745
0.775
0.765
0.72
0.765
0.745
0.765
0.745
0.69
0.705
0.745
0.78
0.775
0.78
0.78
0.69
0.755
0.74
0.75
0.77
0.76
0.73
0.73
0.73
0.74
0.745
0.71
0.755
0.71
0.715
0.745
0.76
0.775
0.695
0.765
0.765
0.755
0.745
0.74
0.755
0.74
0.74
0.725
0.72
0.72
0.74
0.735
0.71
0.77
0.77
0.775
0.775
0.77
0.765
0.765
0.775
0.765
0.75
0.705
