# Chest X-Ray Medical Diagnosis with Deep Learning

## Table of Contents

- [1. Import Packages](#1)
- [2. Load the Datasets](#2)
    - [2.1 Loading the Data](#2-1)
    - [2.2 Preventing Data Leakage](#2-2)
        - [Check for leakage](#check)

<a name='1'></a>
## 1. Import Packages

In [8]:
#import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from keras.preprocessing.image import ImageDataGenerator
from keras.applications.densenet import DenseNet121
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
from keras import backend as K

from keras.models import load_model

import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

<a name='2'></a>
## 2. Load the Datasets

We will use the [ChestX-ray8 dataset](https://arxiv.org/abs/1705.02315). This dataset contains 108,948 frontal-view X-ray images of 32,717 unique patients.
- Each image in the datset contains multiple text-mined labels identifying 14 different pathological conditions.
- These can be used by physicians to diagnose 8 different diseases.
- We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies.
- It will predict 'positive' or 'negative' for each of the pathologies.

The entire dataset can be downloaded [here](https://nihcc.app.box.com/v/ChestXray-NIHCC).
- For now we will use ~1000 image subset

The dataset includes a CSV file that provides the labels for each X-ray. 
Here, we have three csv files:

1. `nih/train-small.csv`: 875 images from our dataset to be used for training.
2. `nih/valid-small.csv`: 109 images from our dataset to be used for validation.
3. `nih/test.csv`: 420 images from our dataset to be used for testing.

<a name='2-1'></a>
### 2.1 Loading the Data

In [11]:
train_df = pd.read_csv("data/nih/train-small.csv")
valid_df = pd.read_csv("data/nih/valid-small.csv")

test_df = pd.read_csv("data/nih/test.csv")

train_df.head()

Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00008270_015.png,0,0,0,0,0,0,0,0,0,0,0,8270,0,0,0
1,00029855_001.png,1,0,0,0,1,0,0,0,1,0,0,29855,0,0,0
2,00001297_000.png,0,0,0,0,0,0,0,0,0,0,0,1297,1,0,0
3,00012359_002.png,0,0,0,0,0,0,0,0,0,0,0,12359,0,0,0
4,00017951_001.png,0,0,0,0,0,0,0,0,1,0,0,17951,0,0,0


In [12]:
labels = ['Cardiomegaly','Emphysema', 'Effusion', 'Hernia', 'Inflitration',
         'Mass', 'Nodule', 'Atelectasis', 'Pneumothorax', 'Pleural_Thickening',
         'Pneumonia', 'Fibrosis', 'Edema', 'Consolidation']

<a name='2-2'></a>
### 2.2 Preventing Data Leakage
Our dataset contains multiple images for each patient. A patient could have taken multiple X-ray images at different times during their hospital visits. The data split is done on the patient level so that there is no data "leakage" between the train, validation, and test datasets.

<a name='check'></a>
### Check for leakage
We will write a function to check whether there is a leakage between two datasets. We will use this to make sure there are no patients in the test set that are also present in either the train or validation sets. 