<a href="https://colab.research.google.com/github/Youssef-ElBakry/Logistic-Regression-ML/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
Logistic Regression

The dataset: The dataset used in this model is from kaggle. It is a set of data that holds 4 values about the physical appearance of a banknote per row:
- variance
- skewness
- curtosis
- entropy

 These variables can be used to predict whether a banknote is real or forged. This data was collected using an industrial camera and so while technically not a toy data set, it is fairly simple like a toy dataset.

More information about the dataset can be found here: https://www.kaggle.com/datasets/shanks0465/banknoteauthentication/data


The model used in following program is logistic regression. This model predicts a binary value, in this instance, real or forged based on a set of numerical values.



#Logistic Regression

In [27]:
#Import libraries
import pandas as pd
import numpy as np

#Open Dataset
url = "https://raw.githubusercontent.com/Youssef-ElBakry/Logistic-Regression-ML/main/data_banknote_authentication.csv"
cols = ["Variance", "Skewness", "Curtosis", "Entropy", "Class"] #Adding headers to file as dataset has no headers
df = pd.read_csv(url, header=None, names=cols)
df["Class"] = df["Class"].astype(int)

#Define global values
trainingRatio = 0.8
seed = 450 #Fixed seed for rng as it makes results reproducable


#Split data into training and testing
#80/20 used as there is no hyperparameter tuning and therefore no need for validation
rng = np.random.default_rng(seed)
test_I = []
train_I = []

#Shuffle and sort data entries into training and testing
for key, row in df.groupby("Class", sort="False"):
  index = row.index.to_numpy()
  rng.shuffle(index)
  trainingRows = int(round(len(index) * trainingRatio)) #Determine size of training set in each class. This allows for a more even split. (80% of the Authentic class and 80% of the forged class)
  train_I.extend(index[:trainingRows])
  test_I.extend(index[trainingRows:])

#Shuffle training and testing set again as they are currently grouped by class
train = df.loc[train_I].sample(frac=1, random_state=seed).reset_index(drop=True)
test = df.loc[test_I].sample(frac=1, random_state=seed).reset_index(drop=True)

FEATS = ["Variance", "Skewness", "Curtosis", "Entropy"]

X_train = train[FEATS].to_numpy(dtype=np.float64)
y_train = train["Class"].to_numpy(dtype=np.float64)
X_test  = test[FEATS].to_numpy(dtype=np.float64)
y_test  = test["Class"].to_numpy(dtype=np.float64)

#Add a column of ones, to allow for intercept
def add_bias(X):
  return np.c_[np.ones((X.shape[0], 1)), X]  # shape: (N, D+1)

Xtr = add_bias(X_train)   # (n_train, 1+4)
Xte = add_bias(X_test)

#Sigmoid function
def sigmoid(z):
  #Cuts off z so that it's value its >500 or <-500 as those numbers to the power of e is an unnecessarily large number
  z = np.clip(z, -500, 500)
  return 1.0 / (1.0 + np.exp(-z))

def gradient(W, X, y):
  #Get number of features in array
  N = X.shape[0]

  z = X @ W   #Matrix multiplication of array and weights
  p = sigmoid(z)    #Map each score to a point in the sigmoid graph

  # gradient of the loss w.r.t. W
  err  = p - y    #Difference between actual value and predicted value
  grad = (X.T @ err) / N  #Vector form for finding gradient

  return grad

def fit_logreg(X, y, lr=0.1, epochs=2000, seed=450):
    rng = np.random.default_rng(seed)
    W = rng.normal(scale=0.01, size=X.shape[1])   # small random init, shape (D,)
    for t in range(epochs):
        grad = gradient(W, X, y)
        W -= lr * grad #Gradient decent
        if (t % 200 == 0 or t == epochs - 1):
            print(f"epoch {t:4d}")

    return W

W = fit_logreg(Xtr, y_train, lr=0.1, epochs=2000, seed=450)

def predict_proba(X, W):
    return sigmoid(X @ W)

def predict_label(X, W, thresh=0.5):
    return (predict_proba(X, W) >= thresh).astype(int)

proba = predict_proba(Xte, W)
pred  = predict_label(Xte, W)

# metrics
tp = int(((pred == 1) & (y_test == 1)).sum())
tn = int(((pred == 0) & (y_test == 0)).sum())
fp = int(((pred == 1) & (y_test == 0)).sum())
fn = int(((pred == 0) & (y_test == 1)).sum())

acc  = (tp + tn) / (tp + tn + fp + fn) #% of true negatives and positives out of all the values
prec = tp / (tp + fp) if (tp + fp) else 0.0 #Number of true positives in all the true and false positives.
rec  = tp / (tp + fn) if (tp + fn) else 0.0 #%of true positives compared to all true positives and false negatives (All real positives)
f1   = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0

print(f"\nPositives [{tp:3}  {fp:3}]")
print(f"Negatives [{tn:3}  {fn:3}]")
print(f"\nTest Accuracy={acc:.3f} Recall={rec:.3f}  Precision={prec:.3f}  F1={f1:.3f}")

# inspect learned weights
feat_names = ["(bias)", "Variance", "Skewness", "Curtosis", "Entropy"]
for name, w in zip(feat_names, W):
    print(f"{name:>10s}: {w:+.4f}")

epoch    0
epoch  200
epoch  400
epoch  600
epoch  800
epoch 1000
epoch 1200
epoch 1400
epoch 1600
epoch 1800
epoch 1999

Positives [122    3]
Negatives [149    0]

Test Accuracy=0.989 Recall=1.000  Precision=0.976  F1=0.988
    (bias): +2.6974
  Variance: -2.5511
  Skewness: -1.5168
  Curtosis: -1.7762
   Entropy: -0.2371


The above model is very effective at differentiating between real and forged banknotes. With an accuracy of 98.9%, there is only a small amount of room for improvement.

The model tends to predict that a note is real more often than forged, with its recall of 1. This means that very single real banknote was correctly determined to be real. In a real scenario however, this would be bad. Under realistic circumstances, false positives are worse than false negatives as adding fake banknotes into circulation would have far more negatives consequences than throwing away real banknotes. This can be adjusted by changing the threshold from 0.5 with tuning, which will be explored in the next example

#Logistic regression with parameter tuning

In [23]:
#Import libraries
import pandas as pd
import numpy as np

#Open Dataset
url = "https://raw.githubusercontent.com/Youssef-ElBakry/Logistic-Regression-ML/main/data_banknote_authentication.csv"
cols = ["Variance", "Skewness", "Curtosis", "Entropy", "Class"] #Adding headers to file as dataset has no headers
df = pd.read_csv(url, header=None, names=cols)
df["Class"] = df["Class"].astype(int)

#Define global values
trainingRatio = 0.8
seed = 450 #Fixed seed for rng as it makes results reproducable


#Split data into training and testing
rng = np.random.default_rng(seed)
test_I = []
train_I = []

#Shuffle and sort data entries into training and testing
for key, row in df.groupby("Class", sort="False"):
  index = row.index.to_numpy()
  rng.shuffle(index)
  trainingRows = int(round(len(index) * trainingRatio))
  train_I.extend(index[:trainingRows])
  test_I.extend(index[trainingRows:])

#Shuffle training and testing set again as they are currently grouped by class
train = df.loc[train_I].sample(frac=1, random_state=seed).reset_index(drop=True)
test = df.loc[test_I].sample(frac=1, random_state=seed).reset_index(drop=True)

FEATS = ["Variance", "Skewness", "Curtosis", "Entropy"]

X_train = train[FEATS].to_numpy(dtype=np.float64)
y_train = train["Class"].to_numpy(dtype=np.float64)
X_test  = test[FEATS].to_numpy(dtype=np.float64)
y_test  = test["Class"].to_numpy(dtype=np.float64)

#Add a column of ones, to allow for intercept
def add_bias(X):
  return np.c_[np.ones((X.shape[0], 1)), X]  # shape: (N, D+1)

Xtr = add_bias(X_train)   # (n_train, 1+4)
Xte = add_bias(X_test)

#Sigmoid function
def sigmoid(z):
  z = np.clip(z, -500, 500)
  return 1.0 / (1.0 + np.exp(-z))

def gradient(W, X, y):
  N = X.shape[0]

  z = X @ W
  p = sigmoid(z)

  err  = p - y
  grad = (X.T @ err) / N

  return grad

def fit_logreg(X, y, seed, lr, epochs=2000):
    rng = np.random.default_rng(seed)
    W = rng.normal(scale=0.01, size=X.shape[1])
    for t in range(epochs):
        grad = gradient(W, X, y)
        W -= lr * grad
    return W

def predict_proba(X, W):
    return sigmoid(X @ W)

def predict_label(X, W, thresh):
    return (predict_proba(X, W) >= thresh).astype(int)

#Grid search of threshold
#Define threshold grid
threshold_grid = [0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75]
#Define learning rate grid
lr_grid = [0.001, 0.01, 0.1, 0.15, 0.2, 0.25, 0.3]
bestF1 = bestLr = bestThreshold = bestAcc = bestPrec = bestRec = bestW = bestTP = bestFP = bestTN = bestFN = 0
#Try each learning rate in each threshold. This will run the model N number of times
#N being (length of Lr * Length of threshold)
for threshold in threshold_grid:
  for lr in lr_grid:
    W = fit_logreg(Xtr, y_train, seed, lr, epochs=2000)
    proba = predict_proba(Xte, W)
    pred  = predict_label(Xte, W, threshold)
    tp = int(((pred == 1) & (y_test == 1)).sum())
    tn = int(((pred == 0) & (y_test == 0)).sum())
    fp = int(((pred == 1) & (y_test == 0)).sum())
    fn = int(((pred == 0) & (y_test == 1)).sum())
    acc  = (tp + tn) / (tp + tn + fp + fn)
    prec = tp / (tp + fp) if (tp + fp) else 0.0
    rec  = tp / (tp + fn) if (tp + fn) else 0.0
    f1   = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
    if f1 > bestF1:
      bestThreshold = threshold
      bestLr = lr
      bestPrec = prec
      bestAcc = acc
      bestRec = rec
      bestF1 = f1
      bestW = W
      bestTP = tp
      bestFP = fp
      bestTN = tn
      bestFN = fn



print(f"\nPositives [{bestTN:3}  {bestFP:3}]")
print(f"Negatives [{bestTN:3}  {bestFN:3}]")
print("Learning rate: ", bestLr)
print("Threshold: ", bestThreshold)
print(f"\nTest Accuracy={bestAcc:.3f} Recall={bestRec:.3f}  Precision={bestPrec:.3f}  F1={bestF1:.3f}")

# inspect learned weights
feat_names = ["(bias)", "Variance", "Skewness", "Curtosis", "Entropy"]
for name, w in zip(feat_names, bestW):
    print(f"{name:>10s}: {w:+.4f}")


Positives [151    1]
Negatives [151    0]
Learning rate:  0.15
Threshold:  0.7

Test Accuracy=0.996 Recall=1.000  Precision=0.992  F1=0.996
    (bias): +3.0164
  Variance: -2.8490
  Skewness: -1.6796
  Curtosis: -1.9863
   Entropy: -0.2407


#Logistic regression with L2 & lambda tuning

In [24]:
#Import libraries
import pandas as pd
import numpy as np

#Open Dataset
url = "https://raw.githubusercontent.com/Youssef-ElBakry/Logistic-Regression-ML/main/data_banknote_authentication.csv"
cols = ["Variance", "Skewness", "Curtosis", "Entropy", "Class"] #Adding headers to file as dataset has no headers
df = pd.read_csv(url, header=None, names=cols)
df["Class"] = df["Class"].astype(int)

#Define global values
trainingRatio = 0.8
seed = 450 #Fixed seed for rng as it makes results reproducable


#Split data into training and testing
rng = np.random.default_rng(seed)
test_I = []
train_I = []

#Shuffle and sort data entries into training and testing
for key, row in df.groupby("Class", sort="False"):
  index = row.index.to_numpy()
  rng.shuffle(index)
  trainingRows = int(round(len(index) * trainingRatio)) #Determine size of training set in each class. This allows for a more even split. (80% of the Authentic class and 80% of the forged class)
  train_I.extend(index[:trainingRows])
  test_I.extend(index[trainingRows:])

#Shuffle training and testing set again as they are currently grouped by class
train = df.loc[train_I].sample(frac=1, random_state=seed).reset_index(drop=True)
test = df.loc[test_I].sample(frac=1, random_state=seed).reset_index(drop=True)

FEATS = ["Variance", "Skewness", "Curtosis", "Entropy"]

X_train = train[FEATS].to_numpy(dtype=np.float64)
y_train = train["Class"].to_numpy(dtype=np.float64)
X_test  = test[FEATS].to_numpy(dtype=np.float64)
y_test  = test["Class"].to_numpy(dtype=np.float64)

#Add a column of ones, to allow for intercept
def add_bias(X):
  return np.c_[np.ones((X.shape[0], 1)), X]  # shape: (N, D+1)

Xtr = add_bias(X_train)   # (n_train, 1+4)
Xte = add_bias(X_test)

#Sigmoid function
def sigmoid(z):
  z = np.clip(z, -500, 500)
  return 1.0 / (1.0 + np.exp(-z))

def L2_gradient(W, X, y,Lam):
  N = X.shape[0]

  z = X @ W
  p = sigmoid(z)

  err  = p - y
  grad = (X.T @ err) / N
  # add L2 gradient on non-bias weights
  grad[1:] += Lam * W[1:]
  return grad

def fit_logreg(X, y, seed, Lam, lr=0.1, epochs=2000):
    rng = np.random.default_rng(seed)
    W = rng.normal(scale=0.01, size=X.shape[1])
    for t in range(epochs):
        grad = L2_gradient(W, X, y,Lam)
        W -= lr * grad
    return W


def predict_proba(X, W):
    return sigmoid(X @ W)

def predict_label(X, W, thresh=0.5):
    return (predict_proba(X, W) >= thresh).astype(int)

#Grid search of lambda
Lam_grid = [0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
bestF1 = bestLam = bestAcc = bestPrec = bestRec = bestW = bestTP = bestFP = bestTN = bestFN = 0
for Lam in Lam_grid:
  W = fit_logreg(Xtr, y_train, seed, Lam, lr=0.1, epochs=2000)
  proba = predict_proba(Xte, W)
  pred  = predict_label(Xte, W)
  tp = int(((pred == 1) & (y_test == 1)).sum())
  tn = int(((pred == 0) & (y_test == 0)).sum())
  fp = int(((pred == 1) & (y_test == 0)).sum())
  fn = int(((pred == 0) & (y_test == 1)).sum())
  acc  = (tp + tn) / (tp + tn + fp + fn)
  prec = tp / (tp + fp) if (tp + fp) else 0.0
  rec  = tp / (tp + fn) if (tp + fn) else 0.0
  f1   = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
  print(f1)
  if f1 > bestF1:
    bestLam = Lam
    bestPrec = prec
    bestAcc = acc
    bestRec = rec
    bestF1 = f1
    bestW = W
    bestTP = tp
    bestFP = fp
    bestTN = tn
    bestFN = fn



print(f"\nPositives [{bestTN:3}  {bestFP:3}]")
print(f"Negatives [{bestTN:3}  {bestFN:3}]")
print("Lambda: ", bestLam)
print(f"\nTest Accuracy={bestAcc:.3f} Recall={bestRec:.3f}  Precision={bestPrec:.3f}  F1={bestF1:.3f}")

# inspect learned weights
feat_names = ["(bias)", "Variance", "Skewness", "Curtosis", "Entropy"]
for name, w in zip(feat_names, W):
    print(f"{name:>10s}: {w:+.4f}")

0.9878542510121457
0.976
0.9838709677419354
0.9878542510121457
0.9878542510121457
0.9878542510121457

Positives [149    3]
Negatives [149    0]
Lambda:  0

Test Accuracy=0.989 Recall=1.000  Precision=0.976  F1=0.988
    (bias): +2.6968
  Variance: -2.5490
  Skewness: -1.5155
  Curtosis: -1.7747
   Entropy: -0.2366


Given that the lamda is 0. It is clear that the grid search determined that L2 regularization is not effecitve in the context of this dataset. This is expected as the data set is highly linear. If you adjust the training split of the original model to 0.2, the accuracy remains high. It is therefore clear that the dataset is very linearly seperable and thus overfitting is not a risk, hence, regularization is not really needed.

#Final results
Basic Logistic regression
  - Accuracy:0.989
  - Precision:0.976
  - Recall:1.000
  - F1:0.988

Logistic regression with threshold tuning
  - Accuracy:0.996
  - Precision: 0.992
  - Recall: 1.000
  - F1: 0.996
  - Threshold: 0.7
  - Learning rate: 0.15

Logistic regression W/ L2 & Lamda tuning
  - Accuracy:0.989
  - Precision:0.976
  - Recall:1.000
  - F1:0.988
  - Lambda: 0