Linear Regression Benchmarking

Handle Nulls: Replace null values in the dataframe with the median of all values in that column. This ensures that NaN values don't throw errors in the model, and we don't replace these values with outliers, or values that would affect the performance of the model.

In [1]:
import numpy as np
import pandas as pd
from statsbox.linear_regression import LinearRegression
from sklearn.linear_model import LinearRegression as sklearn_lin
from sklearn.metrics import mean_squared_error

def handle_nulls(df):
    for col in df:
        col_median=df[col].median()
        df[col].fillna(col_median, inplace=True)

Run Sklearn Linear Regression model as a benchmark

In [2]:
def run_sklearn_model(X_train, y_train, X_test, y_test):
    sklearn_model = sklearn_lin()
    sklearn_model.fit(X_train, y_train)
    y_pred_test = sklearn_model.predict(X_test)
    print('Sklearn mean squared error:', mean_squared_error(y_test, y_pred_test))


In [3]:
def test_linear_regression(X, y):
    # Linear Regression Code Reference: https://towardsdatascience.com/coding-linear-regression-from-scratch-c42ec079902 
    # Data source: https://github.com/kumudlakara/Medium-codes/blob/main/linear_regression/house_price_data.txt
    
    # split dataset into test and train
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=44)

    # initialize model
    linregmodel = LinearRegression()
    # fit
    linregmodel.fit(X_train, y_train, lr=0.1, epochs=400)
    # predict
    y_pred = linregmodel.predict( X_test)
    print("Model mean squared error: ", linregmodel.mse(y_test, y_pred))


    run_sklearn_model(X_train, y_train, X_test, y_test)

Test a sample Kaggle dataset to predict housing price data 
Output: Accuracies of toolbox linear regression model, and sklearn linear regression model 

In [4]:
def test_housePriceData():
    df = pd.read_csv("house_price_data.txt", index_col=False)
    df.columns = ["housesize", "rooms", "price"]
    handle_nulls(df)

    # Normalize the data
    df = (df-df.mean())/df.std()
    # store non-label values into matrix X
    X = df.iloc[:, :-1].values

    # store labels into y
    y = df.iloc[:, -1].values.reshape(-1,1)

    test_linear_regression(X,y)

test_housePriceData()

Model mean squared error:  0.26668027516712967
Sklearn mean squared error: 0.2666802712838493


Test a small dataset generated by sklearn
Output: Accuracies of toolbox linear regression model, and sklearn linear regression model 

Normalization: It normalizes X matrix so parameter values that are too large or too small don't influence the outcome of the model too much

In [5]:

def normalize(X):
    n = X.shape[1]
    for i in range(n):
        X = (X - X.mean(axis=0))/X.std(axis=0)
    return X
    

In [6]:
from sklearn.datasets import make_regression


def test_sklearn_ds1():
    X, y = make_regression(n_features=2,
                           n_informative=2, random_state=1, n_samples=100
                          )
    X = normalize(X)
    # y = normalize(y)
    y = y.reshape(X.shape[0],1)
    test_linear_regression(X,y)  

test_sklearn_ds1()

Model mean squared error:  1.0088111028147344e-26
Sklearn mean squared error: 2.6513615024478207e-27


Test a large dataset generated by sklearn
Output: Accuracies of toolbox linear regression model, and sklearn linear regression model 

In [7]:
def test_sklearn_ds2():
    X, y = make_regression(n_features=12,
                           n_informative=12, random_state=1, n_samples=10000
                          )
    X = normalize(X)
    # y = normalize(y)
    y = y.reshape(X.shape[0],1)
    test_linear_regression(X,y)  

test_sklearn_ds2()

Model mean squared error:  1.7137122797156253e-26
Sklearn mean squared error: 9.529641673600798e-26
