# Logistic Regression Model
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Approach
In this notebook, I will try a simple logistic regression model on the dataset of flattened images. I will not consider class imbalance at this time, but will examain the results of the model at the end.

In [1]:
import os
import pickle
import sys
sys.path.append('..')
from time import time

import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from src import constants as con
from src.data_setup import make_dataset as md
from src.models import train_model as tm

%load_ext autoreload
%autoreload 2
# %reload_ext autoreload

## Load Data
Load the pickled training and test data.

In [2]:
X_train, X_test, y_train, y_test = md.load_train_test()

In [3]:
X_train.shape

(123195, 2304)

## Model
Train a very simplistic logistic regression model. I will not consider rescaling at this time, either.

In [4]:
logreg = LogisticRegression(solver='saga', multi_class='multinomial', n_jobs=-1, max_iter=10000)
num_rows = -1
tm.run_model(logreg, X_train[:num_rows, :], y_train[:num_rows], X_test, y_test)

model type: LogisticRegression
It took the model 11.443 hours to run.


Notes:
* RGB:
    * It took 130.384 minutes to train on 1000 rows.
    * If the training time scales up linearly, I computed it would take 31 hours to train!
* Gray scale:
    * It took 27.499 minutes to train on 100 rows.
    * It would take 67 hours to train on full training dataset.
* Gray scale, rescale by 50%:
    * It took 2.529 minutes to train on 100 rows.
    * It would take 6.2 hours to train on full training dataset.
* Gray scale, rescale by 25%:
    * It took 36.678 second to train on 100 rows.
    * It would take 1.5 hours to train on training dataset.
    * It actually took 12.83 hours.
* Gray scale, rescale by 12%
    * It took 8.763 second to train on 100 rows.
    * It would take 21 minutes to train on the training dataset.
    * It actually took 2.20 hours.

In [5]:
y_pred_train_logreg = logreg.predict(X_train)
y_pred_logreg = logreg.predict(X_test)

In [6]:
print(classification_report(y_train, y_pred_train_logreg))

              precision    recall  f1-score   support

         BAS       1.00      1.00      1.00        63
         EBO       1.00      0.95      0.98        62
         EOS       1.00      0.93      0.96       339
         KSC       1.00      1.00      1.00        12
         LYA       1.00      1.00      1.00         9
         LYT       0.82      0.78      0.80      3150
         MMZ       1.00      1.00      1.00        12
         MOB       1.00      1.00      1.00        21
         MON       0.85      0.73      0.79      1431
         MYB       1.00      1.00      1.00        34
         MYO       0.84      0.85      0.85      2615
         NGB       1.00      0.99      0.99        87
         NGS       0.88      0.92      0.90      6787
         PMB       1.00      1.00      1.00        14
         PMO       1.00      1.00      1.00        56

    accuracy                           0.86     14692
   macro avg       0.96      0.94      0.95     14692
weighted avg       0.86   

In [7]:
print(classification_report(y_test, y_pred_logreg))

              precision    recall  f1-score   support

         BAS       0.00      0.00      0.00        16
         EBO       0.00      0.00      0.00        16
         EOS       0.03      0.02      0.03        85
         KSC       0.00      0.00      0.00         3
         LYA       0.00      0.00      0.00         2
         LYT       0.56      0.58      0.57       787
         MMZ       0.00      0.00      0.00         3
         MOB       0.00      0.00      0.00         5
         MON       0.27      0.27      0.27       358
         MYB       0.00      0.00      0.00         8
         MYO       0.51      0.55      0.53       653
         NGB       0.00      0.00      0.00        22
         NGS       0.74      0.73      0.73      1697
         PMB       0.00      0.00      0.00         4
         PMO       0.25      0.14      0.18        14

    accuracy                           0.59      3673
   macro avg       0.16      0.15      0.15      3673
weighted avg       0.58   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
