As it is explained in Trevor Hastie's book, the logistic regression and LDA models are quite similar in form in the sense that the log-odds is linear in X (parameter vector) in both. The subtelty here is that if our assumption that the parameters (X) are normally distributed holds, then LDA <i>might</i> perform better than logistic regression especially with the smaller training set (```n_train=50```). So ultimately, since we know that we generated normally distributed parameters, we expect LDA to at least match logistic regression if not predict better.

In [2]:
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from statistics import fmean

In [11]:
p = 15

def genData(n):

    n1 = n2 = n//2
    cov_1 = np.diag(np.repeat(1, p)) + 0.2
    x_class1 = np.random.multivariate_normal(
        mean=np.repeat(3.5, p),
        cov=cov_1,
        size=n1
    )
    x_class2 = np.random.multivariate_normal(
        mean=np.repeat(2, p),
        cov=cov_1,
        size=n2
    )
    y = np.repeat((1, 2), (n1, n2))
    data_set = pd.concat([pd.DataFrame(x_class1), pd.DataFrame(x_class2)])
    data_set.columns = [f'x_{i+1}' for i in range(p)]
    data_set['y'] = y

    return data_set

In [5]:
def predictLDA(train_set, test_set):

    LDA = LinearDiscriminantAnalysis(store_covariance=True)
    LDA.fit(
        X=train_set[[f'x_{i+1}' for i in range(p)]], 
        y=train_set['y']
    )
    y_pred = LDA.predict(X=test_set[[f'x_{i+1}' for i in range(p)]])

    return y_pred

def predictLogistic(train_set, test_set):

    logit = LogisticRegression()
    logit.fit(
        X=train_set[[f'x_{i+1}' for i in range(p)]],
        y=train_set['y']
    )
    y_pred = logit.predict(X=test_set[[f'x_{i+1}' for i in range(p)]])

    return y_pred

In [6]:
def classifyPreds(row):
    
    if row.y == 2 and row.y_hat == 2:
        return 'TP'
    elif row.y == 2 and row.y_hat == 1:
        return 'FN'
    elif row.y == 1 and row.y_hat == 2:
        return 'FP'
    elif row.y == 1 and row.y_hat == 1:
        return 'TN'
    
    
def getAccuracy(df_labels):

    pos_negs = df_labels.apply(classifyPreds, axis=1)
    cf_matrix = pos_negs.value_counts()
    return (cf_matrix['TP'] + cf_matrix['TN']) / sum(cf_matrix)


def getBalancedAccuracy(df_labels):

    pos_negs = df_labels.apply(classifyPreds, axis=1)
    cf_matrix = pos_negs.value_counts()
    tpr = cf_matrix['TP'] / (cf_matrix['TP'] + cf_matrix['FN'])
    tnr = cf_matrix['TN'] / (cf_matrix['TN'] + cf_matrix['FP'])
    return (tpr + tnr) / 2

In [7]:
def classificationAnalysis(n_train):
    
    classifier_acc = {'LDA': [], 'logit': []}

    for _ in range(100):
        train = genData(n_train)
        test = genData(10_000)

        y_hat_LDA = predictLDA(train, test)
        y_hat_logit = predictLogistic(train, test)

        df_LDA = pd.DataFrame({'y': test['y'], 'y_hat': y_hat_LDA})
        df_logit = pd.DataFrame({'y': test['y'], 'y_hat': y_hat_logit})

        classifier_acc['LDA'].append(getAccuracy(df_LDA))
        classifier_acc['logit'].append(getAccuracy(df_logit))

    return pd.DataFrame({'LDA': [round(fmean(classifier_acc['LDA']), 4)], 'logit': [round(fmean(classifier_acc['logit']), 4)]})

In [8]:
df_train_50 = classificationAnalysis(50)
df_train_10k = classificationAnalysis(10_000)

df_results = pd.concat([df_train_50, df_train_10k])
df_results.index = pd.Index(['n_train=50', 'n_train=10^4'])
df_results

Unnamed: 0,LDA,logit
n_train=50,0.7564,0.7812
n_train=10^4,0.8333,0.8333


With an ```n_train=50``` training set, logistic regression clearly outperforms LDA. Since the two class distributions we generated are very close in mean (mu=2 vs mu=3) and have the same covariance matrix, we can hypothesise that in this kind of setting a logit model does a better job at classifying outcomes that overlap strongly when our training set is relatively small, even though the Gaussian assumption is 100% correct. Let's now test that assumption to see if it turns out to be correct.

I have changed the mean of class 1 from 3 to 3.5, let's now run the experiment again and see how it comes out. We can expect the accuracy numbers to go up since the distributions are now further apart, but what we're more interested in is if there is still a similar gap in accuracy between logit and LDA with the small training set:

In [12]:
df_train_50 = classificationAnalysis(50)
df_train_10k = classificationAnalysis(10_000)

df_results = pd.concat([df_train_50, df_train_10k])
df_results.index = pd.Index(['n_train=50', 'n_train=10^4'])
df_results

Unnamed: 0,LDA,logit
n_train=50,0.8756,0.9025
n_train=10^4,0.9267,0.9266


As we can see the gap is almost identical (~0.025), which does not seem to favour my assumption that it is due to distribution overlap. Then the gap could perhaps be explained by the way these two different methods optimise the decision boundary. Logistic regression uses maximum likelihood, while LDA calculates the discriminant function with mean and covariance estimates. There's a general saying that logistic regression is more sensitive to outliers than LDA, and that might be the real reason why logit performs better on small data sets: if LDA is more sensitive to outliers then it can lead to a higher variance especially in small data sets, which is not really compensated by any reduction in bias, so ultimately this means lower accuracy. In order to verify that, maybe we could try and calculate the 15-dimensional distance of every point to the decision boundary (for both LDA and logit) and study the variance of that distance, see if it's maybe indeed higher for LDA than for logit (which would basically confirm that LDA is more sensitive to outliers than logit).