# Model Evaluation
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

## Initial Approach
After preprocessing the image data, I chose to evaluate four different models to see how well each could predict leukocite class from the blood smear images. These were:

* Logistic Regression.
* Random Forest.
* XGBoost.
* Support Vector Classifier (LinearSVC).

During training, the data was modified in the following manner:

* Converted to gray scale.
* Rescaled by 12%.
* Flattened into 1-dimensional areas.
* Oversampled using boostrapping to account for class imbalance.

The first two modifications were chosen to improve model performance and reduce training time.

### Rescaling
When rescaling the images to improve training time, information is lost during the transformation. Since I am not a domain expert, I am uncertain if the lost information could be important to identifying leukocyte type.

### Analysis
I discovered that all models strongly over fit and did poorly on the test set. However, no model was able to give adequate F1 scores for the individual classes. In many cases, the F1 score was zero.

### Next Steps
Deep Learning: Convolutional neural networks have proven success in image classification. I will build out a deep learning architecture from scratch and investigate if this strategy will improve model performance.

## Deep Learning

I created a deepling model using a convolutional neural network (CNN) to predict the 15 different classes of leukocite. The model used weighted classes to counter class imbalance. The dataset of images was rescaled by 12% and converted to grayscale.

After examining the training performance by comparing the validation and training loss over epoch, I determined that the model was having difficulty learning anything useful from the data. There are several factors that could contribute to this poor model performance. A few factors include:

1. Class imbalance issues.
2. Insufficient features due to rescaled images.
3. Wrong model architecture.

In the next step, I decided to address factor 1. My approach was to select a subset of the data that included leukocite morphologies with roughly equal class counts. Then, I tested my model on this subset and evaluated the performance.

After fitting the CNN model to the updated dataset, I discovered that the model was still underfitting. This could indicate that I will need a more complicated model to predict leukocite type. 

## Future Work
As future work, I will extend my project to include the following improvements:
* Develop more complex architecture.
* Explore data augmentation to counter class imbalance.

## Notebooks
Since each model took several hours to run, I created separate notebooks for each model. These are:

* [Logistic Regression](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-logistic-regression.ipynb).
* [Random Forest](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-random-forest.ipynb).
* [XGBoost](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-xgboost.ipynb).
* [LinearSVC](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-linearsvc.ipynb).
* [CNN](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-deep-learning.ipynb).
* [CNN with Two Classes](https://github.com/dmclark53/Springboard/blob/main/Capstone-Project-Three/notebooks/1.0-dc-cnn-reduced-classes.ipynb).