# Identifying Disease with Deep Learning

Neural networks have shown incredible capabilities in image classification, which is the task of labeling an image based on its content. While most of the intial work on this space was for identifying common items and objects - like cars, plants, refrigerators, dogs, and cats - this proof of concept capability is only a stepping stone to being able to classify images which can have a dramatic impact on society.

One industry that will be transformed through the use of neural networks is medicine. The ability for computers to classify disease based on xray/CT/MRI images will make medicine more efficient, more economical, and more accurate.

## Building a Neural Network to Identify Disease in Chest Xrays

We'll be using a common neural network topology (ResNet50) to identify pneumonia, emphysema, and other thoracic conditions in chest xray images. For this exercise, we will be using the Keras framework, which is an abstraction on top of TensorFlow which provides a simpler method for describing neural networks.

## Preamble

The first thing we'll do is import the Python packages we will need to build and train our neural network: 

In [1]:
import os
import pickle
import numpy as np
import tensorflow
import tensorflow.keras as keras
import time
import PIL.Image as pil
import PIL.ImageOps

tensorflow.logging.set_verbosity(tensorflow.logging.ERROR)

## Setting Hyperparameter and Global Variables

Setting all of the hyperparameters and global variables at the beginning will make it easier for us to change the experiment later. Typically a data scientist will spend a fair amount of time tuning variables and retraining the network in order to achieve the highest possible accuracy from the trained model.

In [2]:
EXPERIMENT_OUTPUT_PATH = os.environ.get('HOME') + "/notebooks/output/experiment"
IMAGES_PATH = os.environ.get('HOME') + "/notebooks/images_all"
TRAINING_LABELS = os.environ.get('HOME') + "/notebooks/training_labels_new.pkl"
VALIDATION_LABELS = os.environ.get('HOME') + "/notebooks/validation_labels_new.pkl"

IMAGE_SIZE = 256
BATCH_SIZE = 16
NUM_EPOCHS = 1
LEARNING_RATE = 0.001


## Helper Functions

In order to train the neural network on chest xray images, we need some helper functions that:
* attach labels to training and validation images
* resize images to fit in the neural network
* load batches of training images for training the network
* load batches of validation images for determining the network's accuracy

In [3]:
with open(TRAINING_LABELS, 'rb') as f:
  training_labels = pickle.load(f)
training_files = np.asarray(list(training_labels.keys()))

with open(VALIDATION_LABELS, 'rb') as f:
  validation_labels = pickle.load(f)
validation_files = np.asarray(list(validation_labels.keys()))
labels = dict(list(training_labels.items()) + list(validation_labels.items()))

def load_batch(batch_of_files,is_training=False):
  batch_images = []
  batch_labels = []
  for filename in batch_of_files:
    img = pil.open(os.path.join(IMAGES_PATH, filename))
    img = img.convert('RGB')
    img = img.resize((IMAGE_SIZE, IMAGE_SIZE),pil.NEAREST)
    if is_training and np.random.randint(2):
      img = PIL.ImageOps.mirror(img)
    batch_images.append(np.asarray(img))
    batch_labels.append(labels[filename])
  return keras.applications.resnet50.preprocess_input(np.float32(np.asarray(batch_images))), np.asarray(batch_labels)

def train_generator(num_of_steps):
  while True:
    np.random.shuffle(training_files)
    for i in range(num_of_steps):
      batch_of_files = training_files[i*BATCH_SIZE: i*BATCH_SIZE + BATCH_SIZE]
      batch_images, batch_labels = load_batch(batch_of_files, True)
      yield batch_images, batch_labels

def val_generator(num_of_steps):
  while True:
    np.random.shuffle(validation_files)
    for i in range(num_of_steps):
      batch_of_files = validation_files[i*BATCH_SIZE: i*BATCH_SIZE + BATCH_SIZE]
      batch_images, batch_labels = load_batch(batch_of_files, True)
      yield batch_images, batch_labels

## Building the Neural Network

Now we can build the neural network. We will start by using a ResNet-50 topology pretrained on the ImageNet dataset. We use a pretrained model in order to reduce the time to train. This concept - called _transfer learning_ - is broadly applicable to many use cases, especially in image classification.

Keras allows us to generate a standard topology (in this case ResNet-50), populate it with pretrained weights, and remove the classification layer so that we can attach a new classifier for our own needs.

In this case we don't need a classifier for 1000 different common objects. We need a classifier for 14 different thoracic pathologies.

In [4]:
base_model = keras.applications.ResNet50(include_top=False, weights='imagenet', input_shape=(IMAGE_SIZE,IMAGE_SIZE,3))
feature_extractors = keras.layers.GlobalAvgPool2D(data_format='channels_last')(base_model.output)
predictions = keras.layers.Dense(14, activation='sigmoid', bias_initializer='ones')(feature_extractors)

model = keras.Model(inputs=base_model.input, outputs=predictions)



## Compiling the Neural Network

Next, we will compile the neural network description and associate a loss function, an optimizer function, and the metrics which will be used for optimization. In this case, we use:
* Binary Crossentropy loss
* Adam optimizer - this optimizer works well for quickly optimizing networks (however, this optimizer does not work well for parallelized training)
* Optimize on network accuracy

In [5]:
model.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(lr=LEARNING_RATE),
    metrics=['accuracy'])

## Train the Disease Identification Model

Now we can 'fit' the compiled neural network to the training data. This will produce a _model_ (or a trained neural network) which can accurately identify diseases such as pneumonia and emphysema.

In [None]:
weights_file= EXPERIMENT_OUTPUT_PATH + '/lr_{:.3f}_bz_{:d}'.format(0.001, 16) + '_loss_{val_loss:.3f}_epoch_{epoch:02d}.h5'

#steps_per_epoch = 77871 // BATCH_SIZE
#val_steps = 8653 // BATCH_SIZE

steps_per_epoch = 77871 // BATCH_SIZE
val_steps = 8653 // BATCH_SIZE


model.fit_generator(
    train_generator(steps_per_epoch),
    steps_per_epoch=steps_per_epoch,
    epochs=NUM_EPOCHS,
    validation_data=val_generator(val_steps),
    validation_steps=val_steps,
    callbacks=[ ],
    verbose=1)

## Training a model can take a *LONG* time

Deep neural networks take a long time to train. In many cases, especially when the problem space is new and not well understood, they might need days or weeks to produce accurate models. If you have to do hyperparameter tuning, means you have to iteratively repeat this process many times over.

This means that producing a production model could take months!