# XGBoost Model
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Approach
In this notebook, I will try a XGBoost model on the dataset of flattened images.

In [1]:
import os
import pickle
import sys
sys.path.append('..')
from time import time

import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from xgboost import XGBClassifier

from src import constants as con
from src.data_setup import make_dataset as md
from src.models import train_model as tm

%load_ext autoreload
%autoreload 2
# %reload_ext autoreload

## Load Data
Load the pickled training and test data.

In [2]:
X_train, X_test, y_train, y_test = md.load_train_test()

In [3]:
X_train.shape

(123195, 2304)

## Encoding Labels
XGBoost requires that the labels are encoded as numerical values.

In [4]:
label_encodings = {value: i for i, value in enumerate(np.unique(y_train))}

In [5]:
label_encodings

{'BAS': 0,
 'EBO': 1,
 'EOS': 2,
 'KSC': 3,
 'LYA': 4,
 'LYT': 5,
 'MMZ': 6,
 'MOB': 7,
 'MON': 8,
 'MYB': 9,
 'MYO': 10,
 'NGB': 11,
 'NGS': 12,
 'PMB': 13,
 'PMO': 14}

In [6]:
y_train = pd.Series(y_train).replace(label_encodings).values
y_test = pd.Series(y_test).replace(label_encodings).values

## Model
Train a very simplistic XGBoost model.

In [9]:
xgb = XGBClassifier(n_estimators=1000)

In [10]:
num_rows = -1
tm.run_model(xgb, X_train[:num_rows, :], y_train[:num_rows], X_test, y_test)

model type: XGBClassifier




It took the model 3.195 hours to run.


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [11]:
y_pred_train_xgb = xgb.predict(X_train)
y_pred_xgb = xgb.predict(X_test)

In [12]:
print(classification_report(y_train, y_pred_train_xgb))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8213
           1       1.00      1.00      1.00      8213
           2       1.00      1.00      1.00      8213
           3       1.00      1.00      1.00      8213
           4       1.00      1.00      1.00      8213
           5       1.00      1.00      1.00      8213
           6       1.00      1.00      1.00      8213
           7       1.00      1.00      1.00      8213
           8       1.00      1.00      1.00      8213
           9       1.00      1.00      1.00      8213
          10       1.00      1.00      1.00      8213
          11       1.00      1.00      1.00      8213
          12       1.00      1.00      1.00      8213
          13       1.00      1.00      1.00      8213
          14       1.00      1.00      1.00      8213

    accuracy                           1.00    123195
   macro avg       1.00      1.00      1.00    123195
weighted avg       1.00   

In [13]:
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        16
           1       1.00      0.06      0.12        16
           2       0.83      0.35      0.50        85
           3       0.00      0.00      0.00         3
           4       0.00      0.00      0.00         2
           5       0.89      0.89      0.89       787
           6       0.00      0.00      0.00         3
           7       0.00      0.00      0.00         5
           8       0.67      0.61      0.64       358
           9       0.00      0.00      0.00         8
          10       0.74      0.87      0.80       653
          11       0.00      0.00      0.00        22
          12       0.94      0.97      0.95      1697
          13       0.00      0.00      0.00         4
          14       0.00      0.00      0.00        14

    accuracy                           0.86      3673
   macro avg       0.34      0.25      0.26      3673
weighted avg       0.84   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
