Logistic Regression Benchmarking

Handle Nulls: Replace null values in the dataframe with the median of all values in that column. This ensures that NaN values don't throw errors in the model, and we don't replace these values with outliers, or values that would affect the performance of the model.

In [6]:
import numpy as np
import pandas as pd
from statsbox.logistic_regression import LogisticRegression
from sklearn.linear_model import LogisticRegression as sklearn_lr

from sklearn.metrics import accuracy_score

# Logistic Regression Code Reference: https://towardsdatascience.com/logistic-regression-from-scratch-in-python-ec66603592e2
# Data source: https://www.kaggle.com/code/dyasin/week24ml-weather-dataset-rattle-package-weatheraus/data 

def handle_nulls(df):
    for col in df:
        col_median=df[col].median()
        df[col].fillna(col_median, inplace=True)

Run Sklearn Linear Regression model as a benchmark

In [7]:

def run_sklearn_model(X_train, y_train, X_test, y_test):
    sklearn_model = sklearn_lr(solver='liblinear', random_state=0)
    sklearn_model.fit(X_train, y_train)
    y_pred_test = sklearn_model.predict(X_test)
    print('Sklearn model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

In [8]:


def test_logistic_regression(X, y):
    # # Split into train and test
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=44)

    logregmodel = LogisticRegression()

    logregmodel.fit(X_train, y_train, lr=0.01, epochs=50, batchsize=1000)
    y_pred = logregmodel.predict(X_test)

    print("Logistic Regression Accuracy: ",logregmodel.accuracy(y_test, y_pred))

    logregmodel_ada = LogisticRegression()

    logregmodel_ada.fit(X_train, y_train, useAdagrad=True, lr=0.01, epochs=50, batchsize=1000)
    y_pred = logregmodel_ada.predict(X_test)

    print("Logistic Regression Accuracy (with Adagrad): ",logregmodel_ada.accuracy(y_test, y_pred))

    run_sklearn_model(X_train, y_train, X_test, y_test)

Test a sample Kaggle dataset with ~14k samples 
Output: Accuracies of toolbox logistic regression model, toolbox logistic regression model with ADA gradient descent, and sklearn logistic regression model 

In [9]:
def process_weatherAUS(df):
    y = pd.get_dummies(df.RainTomorrow, drop_first=True)
    y = y.values.reshape(-1,1)

    # Drop categorical columns
    df.drop(['Date', 'Location', 'WindGustDir', 'WindDir9am', 'Evaporation', 'Sunshine', 'WindDir3pm', 'RainToday',  "RainTomorrow"],  axis=1, inplace=True)
    handle_nulls(df)

    # Normalize Data
    df = (df-df.mean())/df.std()
    X = df.values
    return X,y
def test_weatherAUS():
    df = pd.read_csv("weatherAUS.csv")
    X, y = process_weatherAUS(df)
    test_logistic_regression(X,y)


test_weatherAUS()

Logistic Regression Accuracy:  0.8419840505981026
Logistic Regression Accuracy (with Adagrad):  0.8425340299738759


  return f(*args, **kwargs)


Sklearn model accuracy score: 0.8426


Test a small dataset generated by sklearn
Output: Accuracies of toolbox logistic regression model, toolbox logistic regression model with ADA gradient descent, and sklearn logistic regression model 

Normalization: It normalizes X matrix so parameter values that are too large or too small don't influence the outcome of the model too much

In [10]:
def normalize(X):
    m, n = X.shape
    for i in range(n):
        X = (X - X.mean(axis=0))/X.std(axis=0)
    return X

In [11]:
from sklearn.datasets import make_classification

def test_sklearn_ds1():
    X, y = make_classification(n_features=2, n_redundant=0, 
                           n_informative=2, random_state=1, n_samples=100,
                           n_clusters_per_class=1)
    X = normalize(X)
    y = y.reshape(X.shape[0],1)
    test_logistic_regression(X,y)  

test_sklearn_ds1()

Logistic Regression Accuracy:  1.0
Logistic Regression Accuracy (with Adagrad):  0.95
Sklearn model accuracy score: 1.0000


  return f(*args, **kwargs)


Test a large dataset generated by sklearn
Output: Accuracies of toolbox logistic regression model, toolbox logistic regression model with ADA gradient descent, and sklearn logistic regression model 

In [12]:
def test_sklearn_ds2():
    X, y = make_classification(n_features=14, n_redundant=0, 
                           n_informative=5, random_state=1, n_samples=100000,
                           n_clusters_per_class=1)
    X = normalize(X)
    y = y.reshape(X.shape[0],1)
    test_logistic_regression(X,y)  

test_sklearn_ds2()

Logistic Regression Accuracy:  0.94565
Logistic Regression Accuracy (with Adagrad):  0.9465


  return f(*args, **kwargs)


Sklearn model accuracy score: 0.9574
