The training dataset is opened as a Pandas Dataframe. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
train_data = pd.read_csv('Train_Set.csv')

The dependent variables are split into 2 categories, healthy and any form of cancer. Healthy outputs are renamed as 0 and the rest as 1. However, the proportion of the training data which represents the healthy population is severely outmatched by the proportion that represents the sample population with cancer. This may affect the model such that the accuracy and precision of the model is overinflated.

In [2]:
train_data['class_label'] = train_data['class_label'].replace(['healthy', 'early stage cancer', 'screening stage cancer', 'mid stage cancer', 'late stage cancer'], [0, 1, 1, 1, 1])

  train_data['class_label'] = train_data['class_label'].replace(['healthy', 'early stage cancer', 'screening stage cancer', 'mid stage cancer', 'late stage cancer'], [0, 1, 1, 1, 1])


The RandomUnderSampler from the 'imblearn' package is used to randomly undersample the majority class, which is the proportion of the population with the cancer, such that the ratio of healthy to cancer is 1:1. Logistic Regression model is initialised.

In [3]:
undersample = RandomUnderSampler(sampling_strategy = 1)
logreg_model = LogisticRegression(solver='liblinear', random_state = 0)
x, y = train_data.iloc[:,:-1], train_data['class_label']
X_over, y_over = undersample.fit_resample(x, y)
logreg_model.fit(X_over, y_over)
logreg_model.coef_, logreg_model.intercept_

(array([[-2.31026869e-02, -2.41671700e-02, -2.47977511e-02,
         -2.60006399e-02, -2.70068904e-02, -2.83784334e-02,
         -2.84825001e-02, -2.73462121e-02, -2.33666489e-02,
         -2.17674777e-02, -2.31958202e-02, -2.50389623e-02,
         -2.39950322e-02, -2.49219185e-02, -2.55092351e-02,
         -2.52030030e-02, -2.56102167e-02, -2.42518375e-02,
         -2.16476340e-02, -1.69525441e-02, -1.61735659e-02,
         -1.74607116e-02, -1.91489557e-02, -2.04136067e-02,
         -1.93788303e-02, -2.24166093e-02, -2.45887219e-02,
         -2.40448826e-02, -2.45167191e-02, -1.95379633e-02,
         -1.68830639e-02, -2.14862064e-02, -2.04685981e-02,
         -1.88425195e-02, -1.57786495e-02, -1.36261785e-02,
         -1.28534158e-02, -1.31150395e-02, -1.16794587e-02,
         -3.53569120e-03,  2.19576494e-03,  1.13123912e-03,
         -2.29324214e-03,  1.01707616e-04, -1.60017177e-03,
         -3.53878257e-03, -7.55993556e-03, -1.14258292e-02,
         -1.52869245e-02, -1.00919889e-0

In [4]:
test_data = pd.read_csv('Test_Set.csv')
test_data['class_label'] = test_data['class_label'].replace(['healthy', 'early stage cancer', 'screening stage cancer', 'mid stage cancer', 'late stage cancer'], [0, 1, 1, 1, 1])
testX, testY = test_data.iloc[:,:-1], test_data['class_label']
testX_over, testY_over = undersample.fit_resample(testX, testY)

  test_data['class_label'] = test_data['class_label'].replace(['healthy', 'early stage cancer', 'screening stage cancer', 'mid stage cancer', 'late stage cancer'], [0, 1, 1, 1, 1])


The summary of the logistic regression model is as shown below.

In [5]:
logreg_model.predict_proba(testX_over)
logreg_model.score(testX_over, testY_over)
confusion_matrix(testY_over, logreg_model.predict(testX_over))
print(classification_report(testY_over, logreg_model.predict(testX_over)))

              precision    recall  f1-score   support

           0       0.63      0.63      0.63        41
           1       0.63      0.63      0.63        41

    accuracy                           0.63        82
   macro avg       0.63      0.63      0.63        82
weighted avg       0.63      0.63      0.63        82

