## Project Design Writeup and Approval 

Alice Nix

### Project Problem & Hypothesis

The Problem:  Clinical diagnosis of a chest x-ray can be challenging and time-consuming (in non-emergency settings), sometimes more difficult than diagnosis via chest CT imaging.  I will try and solve the problem by training a classifier to predict the type of disease given an x-ray image.  Essentially I will be using deep learning with convolutional neural networks (CNNs) which will be used to detect and diagnose disease.  Ultimately, this type of machine learning mechanism can lead to clinicians making better diagnostic decisions for patients.  Further, it can help in identifying slow changes occurring over the course of multiple chest x-rays that might otherwise be overlooked.

Goal: By using this free dataset, the hope is that academic and research institutions across the country will be able to teach a computer to read and process extremely large amounts of scans, to confirm the results radiologists have found and potentially identify other findings that may have been overlooked.

Built model will classify chest x-ray images as either an indication of the presence of a nodule or mass and measure accuracy.  This can then be a binary classification problem.

### Datasets

In [73]:
# read in the sample labels as a csv file
labels = pd.read_csv('/Users/alicevnix/Desktop/sample_labels.csv')
labels.head(10)


Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImageWidth,OriginalImageHeight,OriginalImagePixelSpacing_x,OriginalImagePixelSpacing_y
0,00000013_005.png,Emphysema|Infiltration|Pleural_Thickening|Pneu...,5,13,060Y,M,AP,3056,2544,0.139,0.139
1,00000013_026.png,Cardiomegaly|Emphysema,26,13,057Y,M,AP,2500,2048,0.168,0.168
2,00000017_001.png,No Finding,1,17,077Y,M,AP,2500,2048,0.168,0.168
3,00000030_001.png,Atelectasis,1,30,079Y,M,PA,2992,2991,0.143,0.143
4,00000032_001.png,Cardiomegaly|Edema|Effusion,1,32,055Y,F,AP,2500,2048,0.168,0.168
5,00000040_003.png,Consolidation|Mass,3,40,068Y,M,PA,2500,2048,0.168,0.168
6,00000042_002.png,No Finding,2,42,071Y,M,AP,3056,2544,0.139,0.139
7,00000057_001.png,No Finding,1,57,071Y,M,AP,3056,2544,0.139,0.139
8,00000061_002.png,Effusion,2,61,077Y,M,PA,2992,2991,0.143,0.143
9,00000061_019.png,No Finding,19,61,077Y,M,AP,3056,2544,0.139,0.139


In [74]:
# Total count or number of records

len(labels)


5606

In [41]:
# Count for each disease category for the feature 'Finding Labels'

labels['Finding Labels'].value_counts()


No Finding                                                          3044
Infiltration                                                         503
Effusion                                                             203
Atelectasis                                                          192
Nodule                                                               144
Pneumothorax                                                         114
Mass                                                                  99
Consolidation                                                         72
Effusion|Infiltration                                                 69
Pleural_Thickening                                                    65
Atelectasis|Infiltration                                              57
Atelectasis|Effusion                                                  55
Cardiomegaly                                                          50
Infiltration|Nodule                                

Data consists of a random sample of 5,606 images and labels from the National Institute of Health (NIH) Chest X-ray Dataset; obtained from Kaggle and ~2.1GB file size with sample labels in CSV file.  Sample.zip contains 5,606 images.

Original dataset, namely "ChestX-ray8", which comprises 108,948 frontal-view x-ray images of 32,717 (collected from the year 1992 to 2015) unique patients with text-mined eight common disease labels, mined from the text radiological reports via Natural Language Processing (NLP).  Consists of 15 classes(14 diseases, and one for "No findings") before cleaning the data and turning this into a binary classification problem.

Article discussing the dataset can be found here:  https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

### Domain Knowledge

After some research I obtained information regarding the differences between a lung nodule and mass:  https://my.clevelandclinic.org/health/diseases/14799-pulmonary-nodules

Definition of disease classes:  A pulmonary nodule is a small round or oval-shaped growth in the lung.  They are smaller than three centimeters (around 1.2 inches) in diameter.  If the growth is larger than that, it is called a pulmonary mass and is more likely to represent cancer than a nodule.

Over 90% of pulmonary nodules that are smaller than two centimeters in diameter are benign.  Being able to differentiate between a mass or nodule using computer vision technology in deep learning could greatly reduce cost and increase accuracy compared to shallow methods of detection alone.

Example project and code:  https://blog.athelas.com/classifying-white-blood-cells-with-convolutional-neural-networks-2ca6da239331


### Project Concerns

It might be best to keep this as a multi-class classification problem but with fewer classes than the original data in order to maintain a sufficient sample size.

Deep learning requires a massive amount of training data as classification accuracy of a deep learning classifier is largely dependent on the quality and size of the dataset.

As an alternative, and as far as identifying an appropriate classifier goes, I could simply separate the data based on which x-rays show lung nodules and which x-rays do not.  

How to decompress a zip file and import it into Python.  It is different from importing an xlsx or csv file?

When building a network architecture I am not sure about two things:  1) How many layers to use 2) How many hidden units to choose for each layer and how you determine what is best given your data.

During the preprocessing phase I might need to turn images into Numpy arrays?  

I have a fairly good idea that during the compilation step my code will look as follows given my problem statement: 

network.compile(optimizer='rmsprop',
                loss='binary_crossentropy',
                metrics=['accuracy'])
                
* I will choose categorical_crossentropy as a loss function for a multi-class classification problem.

During the fit method when the neural network will start to iterate over the training data using mini-batches and epochs (each interation over all of the training data) I am not sure what values to use in order to prevent over-fitting?  i.e., epochs = 5, batch_size=150.  I need to avoid overoptimizing on the training data.

Project limitation:  The image labels are NLP extracted so there coud be some erroneous labels but the accuracy rate is estimated to be >90%.

### Outcomes

I hope to achieve at least 80% accuracy.  In other words, the trained classifier was able to predict the correct class with 80% accuracy.  

In order to become more confident with training neural networks I will go through several practice problems in the book, 'Deep Learning with Python' by Francois Chollet. 

Splitting the data into a training set, validation set, and a test set should help to achieve success during this project.