# Logistic Regression Model
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Approach
In this notebook, I will try a simple logistic regression model on the dataset of flattened images. I will not consider class imbalance at this time, but will examain the results of the model at the end.

In [7]:
import os
import pickle
import sys
sys.path.append('..')
from time import time

import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from src.data_setup import make_dataset as md

# %load_ext autoreload
# %autoreload 2
%reload_ext autoreload

## Load Data
Load the pickled training and test data.

In [8]:
X_train, X_test, y_train, y_test = md.load_train_test()

In [9]:
X_train.shape

(14692, 10000)

## Model
Train a very simplistic logistic regression model. I will not consider rescaling at this time, either.

In [10]:
num_rows = 100
start = time()
logreg = LogisticRegression(solver='saga', multi_class='multinomial', n_jobs=-1, max_iter=10000)
logreg.fit(X_train[:num_rows, :], y_train[:num_rows])
end = time()
elapsed = end - start
time_unit = 'seconds'
if elapsed > 60:
    elapsed = elapsed / 60
    time_unit = 'minutes'
print(f'It took the model {elapsed:0.3f} {time_unit} to run.')

It took the model 36.678 seconds to run.


Notes:
* RGB:
    * It took 130.384 minutes to train on 1000 rows.
    * If the training time scales up linearly, I computed it would take 31 hours to train!
* Gray scale:
    * It took 27.499 minutes to train on 100 rows.
    * It would take 67 hours to train on full training dataset.
* Gray scale, rescale by 50%:
    * It took 2.529 minutes to train on 100 rows.
    * It would take 6.2 hours to train on full training dataset.
* Gray scale, rescale by 25%:
    * It took 36.678 second to train on 100 rows.
    * It would take 1.5 hours to train on training dataset.

In [12]:
(((14692/100)*36.678)/60)/60

1.496869933333333

In [18]:
((14692/100)*27.499)/60

67.33588466666666

In [20]:
((14692/10)*1.552)/60

38.00330666666667



In [5]:
y_pred_train_logreg = logreg.predict(X_train)
y_pred_logreg = logreg.predict(X_test)

In [6]:
print(classification_report(y_train, y_pred_train_logreg))

              precision    recall  f1-score   support

         BAS       0.00      0.00      0.00        63
         EBO       0.00      0.00      0.00        62
         EOS       0.00      0.00      0.00       339
         KSC       0.00      0.00      0.00        12
         LYA       0.00      0.00      0.00         9
         LYT       0.00      0.00      0.00      3150
         MMZ       0.00      0.00      0.00        12
         MOB       0.00      0.00      0.00        21
         MON       0.00      0.00      0.00      1431
         MYB       0.00      0.00      0.00        34
         MYO       0.79      0.03      0.06      2615
         NGB       0.00      0.00      0.00        87
         NGS       0.47      1.00      0.64      6787
         PMB       0.00      0.00      0.00        14
         PMO       0.00      0.00      0.00        56

    accuracy                           0.47     14692
   macro avg       0.08      0.07      0.05     14692
weighted avg       0.35   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(classification_report(y_test, y_pred_logreg))