# Logistic Regression using SGDClassifier
* We'll implement Logistic Regression using the SGDClassifier module of
scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
* We'll set up 'log' as the loss parameter which indicates that the cost
function is log loss.
* Penalty is the regularization term to reduce overfitting.
* Learning_rate can be set to 'optimal', where the learning is slightly
decreased as more and more updates are made.

## Step 1: Preprocessing the data

In [19]:
import timeit

# Processing the data
import pandas as pd
n_rows = 1_000_000
df = pd.read_csv("./dataset/train.csv", nrows = n_rows)

# Splitting the column features from the target values
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values
y = df['click'].values

# We will only train the model using 100,000 samples
n_train = 800_000
X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]
# Performing one-hot encoding
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown = "ignore")
X_train_enc = enc.fit_transform(X_train)
X_test_enc = enc.transform(X_test)


## Step 2: Preparing the model

In [20]:
from sklearn.linear_model import SGDClassifier
# Preparing the classifier:
sgd_lr = SGDClassifier(loss = 'log_loss', penalty = None, fit_intercept = True,
                       max_iter = 30, learning_rate = 'constant', eta0 = 0.01,
                       verbose = 1)


## Step 3: Training and evaluating

In [21]:
# Training the model:
sgd_lr.fit(X_train_enc.toarray(), y_train)

sgd_lr.get_params()


-- Epoch 1
Norm: 3.72, NNZs: 5725, Bias: -0.299019, T: 100000, Avg. loss: 0.423011
Total training time: 1.27 seconds.
-- Epoch 2
Norm: 5.19, NNZs: 5725, Bias: -0.328566, T: 200000, Avg. loss: 0.416346
Total training time: 2.60 seconds.
-- Epoch 3
Norm: 6.29, NNZs: 5725, Bias: -0.340531, T: 300000, Avg. loss: 0.414005
Total training time: 3.82 seconds.
-- Epoch 4
Norm: 7.28, NNZs: 5725, Bias: -0.292990, T: 400000, Avg. loss: 0.412099
Total training time: 5.04 seconds.
-- Epoch 5
Norm: 8.14, NNZs: 5725, Bias: -0.323408, T: 500000, Avg. loss: 0.411123
Total training time: 6.26 seconds.
-- Epoch 6
Norm: 8.93, NNZs: 5725, Bias: -0.310441, T: 600000, Avg. loss: 0.410106
Total training time: 7.47 seconds.
-- Epoch 7
Norm: 9.63, NNZs: 5725, Bias: -0.325266, T: 700000, Avg. loss: 0.409001
Total training time: 8.69 seconds.
-- Epoch 8
Norm: 10.33, NNZs: 5725, Bias: -0.321460, T: 800000, Avg. loss: 0.408316
Total training time: 9.91 seconds.
-- Epoch 9
Norm: 10.99, NNZs: 5725, Bias: -0.328541, T:

{'alpha': 0.0001,
 'average': False,
 'class_weight': None,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.01,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'constant',
 'loss': 'log_loss',
 'max_iter': 30,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'penalty': None,
 'power_t': 0.5,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 1,
 'warm_start': False}

In [22]:
from sklearn.metrics import roc_auc_score

pred = sgd_lr.predict_proba(X_test_enc.toarray())[:, 1]
print(f'Training samples: {n_train}, AUC on testing set: {roc_auc_score(y_test, pred):.3f}')


Training samples: 100000, AUC on testing set: 0.725


## Feature selection using L1 Regularization
* Regularization type is specified using the penalty parameter in scikit-learn.
* L1 - Lasso enables feature selection by allowing some weights with a
significantly small value and some with a significantly large value (L2
penalizes small and large values), which makes it easy to identify those
features that do not have much effect on minimizing the cost function.
* The parameter $\alpha$ provides a trade-off between log loss and
generalization. If $\alpha$ is too small, it is not able to compress large
weights and the model may suffer from high variance or overfitting; on the other hand, if α is too large, the model may become over generalized and perform poorly in terms of fitting the dataset,

In [23]:
# We need to retrain the model to find the features and enable the penalty 'l1'
sgd_lr_l1 = SGDClassifier(loss = 'log_loss', penalty = 'l1', alpha = 0.0001,
                          fit_intercept = True, max_iter = 10, learning_rate
                          = 'constant', eta0 = 0.01)
sgd_lr_l1.fit(X_train_enc.toarray(), y_train)


In [24]:
import numpy as np
# Checking the absolute values of the coefficients
coef_abs = np.abs(sgd_lr_l1.coef_)

# Getting the bottom 10 coefficients
bottom_10 = np.argsort(coef_abs)[0][:10]

# Printing the values
print(np.sort(coef_abs)[0][:10])

# Printing the feature names
feature_names = enc.get_feature_names_out()
print(f"The 10 least important features are:\n {feature_names[bottom_10]}")

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
The 10 least important features are:
 ['x0_1001' 'x8_84a9d4ba' 'x8_84915a27' 'x8_8441e1f3' 'x8_840161a0'
 'x8_83fbdb80' 'x8_83fb63cd' 'x8_83ed0b87' 'x8_83cd1c10' 'x8_83ca6fdb']


They are 1001 from the 0 column (that is the C1 column) in X_train, "851897aa" from the 8 column (that is the device_model column), and so on and so forth.

In [25]:
# Now getting the top 10 coefficients
print(np.sort(coef_abs)[0][-10:])

# Getting the values in a variable
top_10 = np.argsort(coef_abs)[0][-10:]

# Printing the feature names
print(f"The 10 most important features are:\n {feature_names[top_10]}")


[0.71606761 0.72316631 0.78642518 0.83673987 0.92622106 1.05117715
 1.0697904  1.09063128 1.10990209 1.32322116]
The 10 most important features are:
 ['x3_7687a86e' 'x4_28905ebd' 'x18_15' 'x18_61' 'x5_5e3f096f' 'x5_9c13b419'
 'x2_763a42b5' 'x3_27e3c518' 'x2_d9750ee7' 'x5_1779deee']


They are "cef3e649" from the 7 column (that is app_category) in X_train, "7687a86e" from the third column (that is site_domain), and so on and so forth.



## Online Learning
* We'll use the partial_fit() method to train the model with 100_000 samples
at a time, which reduces the computational effort of feeding the complete
dataset (meaning that we don't have to retrain the model entirely if we want
to add new data).
* It allows to train models with real-time data.
* This time, we'll feed the model 1_000_000 samples, so we need to redefine
our training and testing data set.


In [39]:
# Redefining our variables
n_rows = 100_000 * 11
df = pd.read_csv("./dataset/train.csv", nrows = n_rows)
# Splitting the features from the target
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values
y = df['click'].values

# Splitting in training and testing
Y = df['click'].values
n_train = 100000 * 10
X_train = X[:n_train]
Y_train = Y[:n_train]
X_test = X[n_train:]
Y_test = Y[n_train:]

In [40]:
# One hot encoding
enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(X_train)

In [41]:
# Initializing the SGD model. max_iter is set to 1 for online learning
sgd_lr_online = SGDClassifier(loss = 'log_loss', penalty = 'l1',
                              fit_intercept = True, max_iter = 1,
                              learning_rate = 'constant', eta0 = 0.01)


In [42]:
import timeit
# Building a loop (10 times). We need to specify the classes in online learning
start_time = timeit.default_timer()
for i in range(10):
    x_train = X_train[i * 100_000: (i+1) * 100_000]
    y_train = Y_train[i * 100_000: (i+1) * 100_000]
    x_train_enc = enc.transform(x_train)
    sgd_lr_online.partial_fit(x_train_enc.toarray(), y_train, classes = [0, 1])

print(f"--- {(timeit.default_timer() - start_time)}.3fs seconds ---")

--- 105.16404340000008.3fs seconds ---


In [44]:
# Applying the trained model on the testing set, the final 100_000 samples
x_test_enc = enc.transform(X_test)

pred = sgd_lr_online.predict_proba(x_test_enc.toarray())[:, 1]
print(f'Training samples: {n_train * 10}, AUC on testing set: {roc_auc_score(Y_test, pred):.3f}')

Training samples: 10000000, AUC on testing set: 0.754
