# Random Forest Model
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctorâ€™s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Approach
In this notebook, I will try a random forest model on the dataset of flattened images.

In [1]:
import os
import pickle
import sys
sys.path.append('..')
from time import time

import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from src import constants as con
from src.data_setup import make_dataset as md
from src.models import train_model as tm

%load_ext autoreload
%autoreload 2
# %reload_ext autoreload

## Load Data
Load the pickled training and test data.

In [2]:
X_train, X_test, y_train, y_test = md.load_train_test()

In [3]:
X_train.shape

(123195, 2304)

## Model
Train a very simplistic random forest model. I will not consider rescaling at this time, either.

In [9]:
rf = RandomForestClassifier(random_state = 42, max_depth=90, min_samples_split=3, n_estimators=1600, min_samples_leaf=1, max_features='auto', n_jobs=-1)
num_rows = -1
tm.run_model(rf, X_train[:num_rows, :], y_train[:num_rows], X_test, y_test)

model type: RandomForestClassifier
It took the model 10.163 minutes to run.


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [10]:
y_pred_train_rf = rf.predict(X_train)
y_pred_rf = rf.predict(X_test)

In [11]:
print(classification_report(y_train, y_pred_train_rf))

              precision    recall  f1-score   support

         BAS       1.00      1.00      1.00      8213
         EBO       1.00      1.00      1.00      8213
         EOS       1.00      1.00      1.00      8213
         KSC       1.00      1.00      1.00      8213
         LYA       1.00      1.00      1.00      8213
         LYT       1.00      1.00      1.00      8213
         MMZ       1.00      1.00      1.00      8213
         MOB       1.00      1.00      1.00      8213
         MON       1.00      1.00      1.00      8213
         MYB       1.00      1.00      1.00      8213
         MYO       1.00      1.00      1.00      8213
         NGB       1.00      1.00      1.00      8213
         NGS       1.00      1.00      1.00      8213
         PMB       1.00      1.00      1.00      8213
         PMO       1.00      1.00      1.00      8213

    accuracy                           1.00    123195
   macro avg       1.00      1.00      1.00    123195
weighted avg       1.00   

In [12]:
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

         BAS       0.00      0.00      0.00        16
         EBO       1.00      0.06      0.12        16
         EOS       0.00      0.00      0.00        85
         KSC       0.00      0.00      0.00         3
         LYA       0.00      0.00      0.00         2
         LYT       0.90      0.75      0.82       787
         MMZ       0.00      0.00      0.00         3
         MOB       0.00      0.00      0.00         5
         MON       0.78      0.27      0.40       358
         MYB       0.00      0.00      0.00         8
         MYO       0.64      0.89      0.74       653
         NGB       0.00      0.00      0.00        22
         NGS       0.84      0.98      0.90      1697
         PMB       0.00      0.00      0.00         4
         PMO       0.00      0.00      0.00        14

    accuracy                           0.80      3673
   macro avg       0.28      0.20      0.20      3673
weighted avg       0.77   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
