## Regularisation for Logistic Regression

We will demonstrate how **regularisation** can help address overfitting in binary logistic regression using a dataset of network information, where the goal is to predict whether a network attack is detected

In addition, we will carry out **hyperparameter tuning** to identify the optimal regularisation strength

Let's start by importing the necessary libraries

In [None]:
from sklearn.linear_model import LogisticRegression  # Logistic regression model
from sklearn.preprocessing import MinMaxScaler  # Scaling
from sklearn.model_selection import train_test_split, GridSearchCV  # Train-test split and grid search for hyperparameter tuning
from sklearn.metrics import accuracy_score  # Classification performance
import numpy as np; import pandas as pd; import matplotlib.pyplot as plt; import seaborn as sns  # Data processing and visualisation
import warnings; warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('network_data.csv', index_col = 'session_id').iloc[:99, :]; df.head()

In [None]:
df.info()

In [None]:
df['encryption_used'].unique()

When there's no encryption used (i.e., `'encryption_used'` contains null values), we can consider it as a category of its own; so let's set this to the string `'None'`

In [None]:
df['encryption_used'] = df['encryption_used'].fillna('None')  # Assume null values mean there is no encryption

We can see that our data contains a mix of numerical and categorical predictors; let's perform one-hot encoding on the categorical predictors

In [None]:
df_with_dummies = pd.get_dummies(df, drop_first = True)  # One-hot encoding
df_with_dummies.info()

Let's proceed by performing a train-test split on our data

In [None]:
X = df_with_dummies.drop('attack_detected', axis = 1); y = df_with_dummies['attack_detected']  # Predictors and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)  # Train-test split
print('Dimensions of X_train:', X_train.shape); print('Dimensions of y_train:', y_train.shape)
print('Dimensions of X_test:', X_test.shape); print('Dimensions of y_test:', y_test.shape)

Let's now check to see if our numerical predictors are on the same scale

In [None]:
X.describe().T

We can see here that our predictors are not on the same scale

In general, and **especially while regularising our data**, it is important that our predictors be on the same scale; we will use min-max scaling to achieve this

Remember, you should fit the scaler only on the training data and then apply that fitted transformation to the testing data. This ensures the testing set remains unseen and unbiased, while still being scaled consistently with the training data.

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Let's now fit our data to a logistic regression model and evaluate the accuracy of its predictions

In [None]:
logreg_model = LogisticRegression(); logreg_model.fit(X_train, y_train)  # Fitting logistic regression model on training data
y_pred_train = logreg_model.predict(X_train); y_pred_test = logreg_model.predict(X_test)  # Training and testing predictions
acc_train = accuracy_score(y_train, y_pred_train); acc_test = accuracy_score(y_test, y_pred_test)  # Training and testing accuracies
print('Training accuracy =', np.round(acc_train, 2)); print('Testing accuracy =', np.round(acc_test, 2))

### Lasso (L1) Regularisation

Let's now apply lasso regularisation using the `'liblinear'` solver; this solver supports L1 regularisation

The parameter `C` is the inverse of regularisation strength $\alpha$, i.e., a larger `C` means we are trusting our data more. We will start with a `C = 100`

In [None]:
logreg_model = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = 100); logreg_model.fit(X_train, y_train)  # Model fit
y_pred_train = logreg_model.predict(X_train); y_pred_test = logreg_model.predict(X_test)  # Predictions
acc_train = accuracy_score(y_train, y_pred_train); acc_test = accuracy_score(y_test, y_pred_test)  # Accuracies
print('Training accuracy =', np.round(acc_train, 2)); print('Testing accuracy =', np.round(acc_test, 2))

`C` is a **hyperparameter** as it is set before training begins. Let's try to find an ideal `C` value by performing hyperparameter tuning using `GridSearchCV()` from `sklearn.model_selection`. This function will search a 'grid' of parameters, which is essentially an array of different `C` values.

In [None]:
# Define parameter grid
param_grid = {'C': np.logspace(-3, 3, num = 7)}  # from 0.001 to 1000 (logarithmic)
param_grid

In [None]:
logreg = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter = 1000)  # Setting a maximum number of iterations to prevent infinite search

# GridSearch with CV
grid_search = GridSearchCV(logreg, param_grid, cv = 5, scoring = 'accuracy', return_train_score = True)
grid_search.fit(X_train, y_train)

# Extract results
results = grid_search.cv_results_
C_values = results['param_C'].data.astype(float)
mean_test = results['mean_test_score']
mean_train = results['mean_train_score']

# Plot with a logarithmic X-axis
plt.figure(figsize = (6, 4))
plt.semilogx(C_values, mean_train, marker = 'o', label = 'Training Accuracy', color = 'orange')
plt.semilogx(C_values, mean_test, marker = 'o', label = 'Validation Accuracy', color = 'blue')

plt.suptitle('Lasso (L1) Regularisation')
plt.title('Accuracy vs Inverse of Regularisation Strength (C)'); plt.xlabel('C (log scale)'); plt.ylabel('Accuracy'); plt.legend();

Choosing the right `C` value is ultimately a judgement call for the data scientist. It involves balancing the trade-off between fitting the training data well and achieving good validation performance, since too large or too small a value can either overfit or underfit the model.

### Ridge (L2) Regularisation  

Let’s now apply ridge regularisation using the `'liblinear'` solver; this solver also supports L2 regularisation

In [None]:
logreg_model = LogisticRegression(penalty = 'l2', solver = 'liblinear', C = 100); logreg_model.fit(X_train, y_train)  # Model fit
y_pred_train = logreg_model.predict(X_train); y_pred_test = logreg_model.predict(X_test)  # Predictions
acc_train = accuracy_score(y_train, y_pred_train); acc_test = accuracy_score(y_test, y_pred_test)  # Accuracies
print('Training accuracy =', np.round(acc_train, 2)); print('Testing accuracy =', np.round(acc_test, 2))

We will use the same parameter grid for `C` as before

In [None]:
param_grid

In [None]:
logreg = LogisticRegression(penalty = 'l2', solver = 'liblinear', max_iter = 1000)  # Setting a maximum number of iterations to prevent infinite search

# GridSearch with CV
grid_search = GridSearchCV(logreg, param_grid, cv = 5, scoring = 'accuracy', return_train_score = True)
grid_search.fit(X_train, y_train)

# Extract results
results = grid_search.cv_results_
C_values = results['param_C'].data.astype(float)
mean_test = results['mean_test_score']
mean_train = results['mean_train_score']

# Plot with a logarithmic X-axis
plt.figure(figsize = (6, 4))
plt.semilogx(C_values, mean_train, marker = 'o', label = 'Training Accuracy', color = 'orange')
plt.semilogx(C_values, mean_test, marker = 'o', label = 'Validation Accuracy', color = 'blue')

plt.suptitle('Ridge (L2) Regularisation')
plt.title('Accuracy vs Inverse of Regularisation Strength (C)'); plt.xlabel('C (log scale)'); plt.ylabel('Accuracy'); plt.legend();

Once again, the final choice of `C` depends on you as the data scientist

### Comparison

Let's compare how our model coefficients change depending on the kind of regularisation; let's go for `C = 1` in both cases here

In [None]:
# Fit L1 and L2 models
logreg_l1 = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = 1, max_iter = 1000).fit(X_train, y_train)
logreg_l2 = LogisticRegression(penalty = 'l2', solver = 'liblinear', C = 1, max_iter = 1000).fit(X_train, y_train)

# Coefficients
coef_l1 = logreg_l1.coef_.flatten()
coef_l2 = logreg_l2.coef_.flatten()

# Plot coefficient values against feature index
plt.figure(figsize = (6, 4))
plt.plot(np.sort(coef_l1), marker = 'o', label = 'L1 (Lasso)')
plt.plot(np.sort(coef_l2), marker = 'x', label = 'L2 (Ridge)')

plt.title('Coefficient Magnitudes: L1 vs L2 Regularisation'); plt.xlabel('Feature Index (sorted)'); plt.ylabel('Coefficient Value'); plt.legend()
plt.grid(True, linestyle = '--', alpha = 0.7);

You can observe how lasso regression leads to harsher edges, while ridge regression tends to be smooth

Here are the differences between these two regularisation techniques

| Aspect                  | L1 Regularisation (Lasso)              | L2 Regularisation (Ridge)           |
|--------------------------|-----------------------------------------|--------------------------------------|
| Penalty Term            | Sum of absolute values of coefficients | Sum of squared values of coefficients |
| Effect on Coefficients  | Forces some coefficients to exactly 0 (feature selection) | Shrinks coefficients but rarely makes them 0 |
| Model Interpretation    | Produces sparse models, easier to interpret | Retains all features, less sparse |
| When Useful             | High-dimensional data, need feature selection | Multicollinearity, need stability |
| Solver Support (`sklearn`)| `'liblinear'`, `'saga'`                 | Most solvers (`'liblinear'`, `'lbfgs'`, `'saga'`, etc.) |