# Deep Learning Model: CNN - Reduced Classes
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Approach
This notebook will use the previously built model, but only a subset of the training data the includes just a binary class. From this data, I will be able to assess whether this model has difficulty with the large class imbalance between all 15 classes.

In [None]:
import sys
sys.path.append('..')
from time import time

from keras import layers
from keras import metrics
from keras import models
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

from src.data_setup import make_dataset as md
from src.modeling import evaluate_model as em

%matplotlib inline

## Load Data
Load the pickled training and test data.

In [None]:
X_train, X_test, y_train, y_test = md.load_train_test('gray_rescale12')

In [None]:
X_train.shape

In [None]:
pd.Series(y_train).value_counts()

In [None]:
pd.Series(y_test).value_counts()

## Data Preparation
### Unflatten
Unflatten the feature arrays, converting them back into arrays of 2-dimensional images.

In [None]:
def unflatten(X):
    dimension = int(np.sqrt(X.shape[1]))
    return X.reshape((len(X), dimension, dimension, 1))

In [None]:
X_train_unflatten = unflatten(X_train)
X_train_unflatten.shape

In [None]:
X_test_unflatten = unflatten(X_test)
X_test_unflatten.shape

### Normalize
Normalize the features, to values between 0 and 1.

In [None]:
print(f'The maximum value for the training set is {X_train_unflatten.max()}.')
print(f'The maximum value for the test set is {X_test_unflatten.max()}.')

In [None]:
X_train_normalized = X_train_unflatten / X_train_unflatten.max()
X_test_normalized = X_test_unflatten / X_test_unflatten.max()

In [None]:
print(f'The maximum value for the normalized training set is {X_train_normalized.max()}.')
print(f'The maximum value for the normalized test set is {X_test_normalized.max()}.')

### Categories
First, encode the labels to integer values.

In [None]:
label_encodings = {value: i for i, value in enumerate(np.unique(y_train))}

In [None]:
label_encodings

In [None]:
pd.Series(y_train).value_counts()

In [None]:
y_train_encoded = pd.Series(y_train).replace(label_encodings).values
y_test_encoded = pd.Series(y_test).replace(label_encodings).values

In [None]:
np.unique(y_train_encoded)

In [None]:
np.unique(y_test_encoded)

Second, encode the integer labels as one-hot vectors.

In [None]:
y_train_one_hot = to_categorical(y_train_encoded)
y_test_one_hot = to_categorical(y_test_encoded)

In [None]:
y_train_one_hot.shape

In [None]:
y_train_one_hot[0:5, :]

In [None]:
y_test_one_hot.shape

In [None]:
y_test_one_hot[0:5, :]

## Validation Set
Now that we have preprocessed the training data, I will create a validation set. This will be used to evaluate how the deep learning model is training.

In [None]:
X_train_normalized, X_val, y_train_one_hot, y_val = train_test_split(X_train_normalized, y_train_one_hot, test_size=0.1, random_state=42)

## Define Model

In [None]:
input_shape = X_train_unflatten.shape[1:]
print(f'The input shape is {input_shape}.')

In [None]:
model_1 = models.Sequential([
    layers.Conv2D(8, kernel_size=(3, 3), activation='relu', padding='same', input_shape=input_shape),
    layers.MaxPooling2D(pool_size=(2, 2), strides=2),
    layers.Conv2D(16, kernel_size=(7, 7), activation='relu'),
    layers.MaxPooling2D(pool_size=(2, 2), strides=2),
    layers.Flatten(),
    layers.Dense(600, activation='relu'),
    layers.Dense(150, activation='relu'),
    layers.Dense(38, activation='relu'),
    layers.Dense(15, activation='softmax')
])

In [None]:
model_1.summary()

### Train Model

In [None]:
model_1.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=[
                  metrics.Accuracy(),
                  metrics.categorical_accuracy,
                  metrics.Precision(),
                  metrics.Recall()
              ])
results_1 = model_1.fit(X_train_normalized, y_train_one_hot, validation_data=(X_val, y_val), epochs=10, batch_size=64)

#### Predictions
Make class predictions using the model.

In [None]:
y_pred_train_cnn = model_1.predict(X_train_normalized)
y_pred_cnn = model_1.predict(X_test_normalized)

### Evaluate the Model

In [None]:
em.plot_train_val_losses(results_1)

In [None]:
print(classification_report(np.argmax(y_train_one_hot, axis=1), np.argmax(y_pred_train_cnn, axis=1)))

In [None]:
print(classification_report(np.argmax(y_test_one_hot, axis=1), np.argmax(y_pred_cnn, axis=1)))

In [None]:
em.plot_confusion_matrix(y_test_one_hot, y_pred_cnn, label_encodings)

### Train Model - Use Weighted Classes
To counter the class imbalance, I will try weighting the classes by importance. More weight will be given to the classes with less representation.

In [None]:
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_encoded), y=y_train_encoded)
class_weights_dict = dict(enumerate(class_weights))

In [None]:
class_weights_dict

In [None]:
model_1.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=[
                    metrics.Accuracy(),
                    metrics.categorical_accuracy,
                    metrics.Precision(),
                    metrics.Recall()
                ])
results_2 = model_1.fit(X_train_normalized, y_train_one_hot, validation_data=(X_val, y_val), epochs=50, batch_size=64, class_weight=class_weights_dict)

In [None]:
results_2.history.keys()

In [None]:
em.plot_train_val_losses(results_2)

**Observations:** The chaotic loss indicates that the model is unable to learn anything useful from the training data.

## Summary
I created a deepling model using a convolutional neural network (CNN) to predict the 15 different classes of leukocite. The model used weighted classes to counter class imbalance. The dataset of images was rescaled by 12% and converted to grayscale.

After examining the training performance by comparing the validation and training loss over epoch, I determined that the model is having difficulty learning anything useful from the data. There are several factors that could contribute to this poor model performance. A few factors include:

1. Class imbalance issues.
2. Insufficient features due to rescaled images.
3. Wrong model architecture.

## Future Direction

I will begin addressing these factors by starting with factor 1. My approach will be to select a subset of the data that includes leukocite morphologies with roughly equal class counts. Then, I will test my model on this subset and evaluate the performance.