# Assignment

In this assignment we will create a high sensitivity detector for pulmonary infection (pneumonia) on chest radiographs. The goal is to optimize for model sensitivity while preseverving a minimum Dice score performance for overall spatial overalp.

This assignment is part of the class **Introduction to Deep Learning for Medical Imaging** at University of California Irvine (CS190); more information can be found: https://github.com/peterchang77/dl_tutor/tree/master/cs190.

### Submission

Once complete, the following items must be submitted:

* final `*.ipynb` notebook
* final trained `*.hdf5` model file
* final compiled `*.csv` file with performance statistics

# Google Colab

The following lines of code will configure your Google Colab environment for this assignment.

### Enable GPU runtime

Use the following instructions to switch the default Colab instance into a GPU-enabled runtime:

```
Runtime > Change runtime type > Hardware accelerator > GPU
```

### Mount Google Drive

The Google Colab environment is transient and will reset after any prolonged break in activity. To retain important and/or large files between sessions, use the following lines of code to mount your personal Google drive to this Colab instance:

In [76]:
try:
    # --- Mount gdrive to /content/drive/My Drive/
    from google.colab import drive
    drive.mount('/content/drive')
    
except: pass

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Throughout this assignment we will use the following global `MOUNT_ROOT` variable to reference a location to store long-term data. If you are using a local Jupyter server and/or wish to store your data elsewhere, please update this variable now.

In [0]:
# --- Set data directory
MOUNT_ROOT = '/content/drive/My Drive'

### Select Tensorflow library version

This assignment will use the (new) Tensorflow 2.0 library. Use the following line of code to select this updated version:

In [1]:
# --- Select Tensorflow 2.0 (only in Google Colab)
% tensorflow_version 2.x
% pip install tensorflow-gpu==2.1

Collecting tensorflow-gpu==2.1
[?25l  Downloading https://files.pythonhosted.org/packages/0a/93/c7bca39b23aae45cd2e85ad3871c81eccc63b9c5276e926511e2e5b0879d/tensorflow_gpu-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl (421.8MB)
[K     |████████████████████████████████| 421.8MB 26kB/s 
Collecting tensorflow-estimator<2.2.0,>=2.1.0rc0
[?25l  Downloading https://files.pythonhosted.org/packages/18/90/b77c328a1304437ab1310b463e533fa7689f4bfc41549593056d812fab8e/tensorflow_estimator-2.1.0-py2.py3-none-any.whl (448kB)
[K     |████████████████████████████████| 450kB 55.4MB/s 
[?25hCollecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<2.2.0,>=2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d9/41/bbf49b61370e4f4d245d4c6051dfb6db80cec672605c91b1652ac8cc3d38/tensorboard-2.1.1-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.9MB 4

# Environment

### Jarvis library

In this notebook we will Jarvis, a custom Python package to facilitate data science and deep learning for healthcare. Among other things, this library will be used for low-level data management, stratification and visualization of high-dimensional medical data.

In [2]:
# --- Install jarvis (only in Google Colab or local runtime)
% pip install jarvis-md

Collecting jarvis-md
[?25l  Downloading https://files.pythonhosted.org/packages/ce/21/66d19a89c4f2a1119d204ad3422a26655840987649d2cdd11c5cc27cfb02/jarvis_md-0.0.1a9-py3-none-any.whl (69kB)
[K     |████▊                           | 10kB 28.1MB/s eta 0:00:01[K     |█████████▌                      | 20kB 2.8MB/s eta 0:00:01[K     |██████████████▏                 | 30kB 3.4MB/s eta 0:00:01[K     |███████████████████             | 40kB 3.5MB/s eta 0:00:01[K     |███████████████████████▊        | 51kB 3.7MB/s eta 0:00:01[K     |████████████████████████████▍   | 61kB 3.9MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.2MB/s 
Collecting pyyaml>=5.2
[?25l  Downloading https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz (269kB)
[K     |████████████████████████████████| 276kB 6.2MB/s 
Building wheels for collected packages: pyyaml
  Building wheel for pyyaml (setup.py) ... [?25l[?25hdone


### Imports

Use the following lines to import any additional needed libraries:

In [0]:
import numpy as np, pandas as pd
from tensorflow import losses, optimizers
from tensorflow.keras import Input, Model, models, layers, metrics
from jarvis.train import datasets, custom
from jarvis.train.client import Client
from jarvis.utils.general import overload, tools as jtools
from jarvis.utils.display import imshow

# Data

The data used in this tutorial will consist of (frontal projection) chest radiographs from the RSNA / Kaggle pneumonia challenge (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge). The chest radiograph is the standard screening exam of choice to identify and trend changes in lung disease including infection (pneumonia). 

The custom `datasets.download(...)` method can be used to download a local copy of the dataset. By default the dataset will be archived at `/data/raw/xr_pna`; as needed an alternate location may be specified using `datasets.download(name=..., path=...)`. 

In [4]:
# --- Download dataset
datasets.download(name='xr/pna')



{'code': '/data/raw/xr_pna', 'data': '/data/raw/xr_pna'}

# Training

In order to create a high sensitivity classifier for pnuemonia, the following stratgies should be implemented in this assigment:

* stratified sampling
* pixel-level class weights
* pixel-level masked loss

As described in the tutorial, care must be taken to balance both the effects of using class weights to optimize for maximum sensitivity while preserving a minimum overall classifier performance as assessed via Dice score.

### Overload the `Client` object

*Hint*: Ensure to customize the `arrays['xs']['msk']` object to reflect both class weights and masked values.

In [0]:
@overload(Client)
def preprocess(self, arrays, **kwargs):
    """
    Method to create a custom msk array for class weights and/or masks
    
    """
    # --- Create msk
    msk = np.zeros(arrays['xs']['dat'].shape)

    lng = arrays['xs']['msk'] > 0
    pna = arrays['ys']['pna'] > 0

    msk[lng] = 1
    msk[pna] = 5

    arrays['xs']['msk'] = msk
    
    return arrays

### Create `Client` object

After manually overloading the `Client` object, manually create a new client object.

*Hint*: Ensure to use stratified sampling when initializing the client object.

In [0]:
# --- Find client yml file
yml = '{}/data/ymls/client-seg.yml'.format(jtools.get_paths('xr/pna')['code'])

# --- Configs dict
configs = {
    'batch': {'size': 8},
    'sampling': {
        'cohort-neg': 0.5,
        'cohort-pna': 0.5}}

# --- Manually create Client
client = Client(yml, configs=configs)


### Create inputs and generators

In [0]:
# --- Manually create generators
gen_train, gen_valid = client.create_generators()

# --- Create inputs
inputs = client.get_inputs(Input)

### Define the model

In [0]:
# --- Define kwargs dictionary
kwargs = {
    'kernel_size': (1, 3, 3),
    'padding': 'same'}

# --- Define lambda functions
conv = lambda x, filters, strides : layers.Conv3D(filters=filters, strides=strides, **kwargs)(x)
norm = lambda x : layers.BatchNormalization()(x)
relu = lambda x : layers.LeakyReLU()(x)
tran = lambda x, filters, strides : layers.Conv3DTranspose(filters=filters, strides=strides, **kwargs)(x)

# --- Define stride-1, stride-2 blocks
conv1 = lambda filters, x : relu(norm(conv(x, filters, strides=1)))
conv2 = lambda filters, x : relu(norm(conv(x, filters, strides=(1, 2, 2))))
tran2 = lambda filters, x : relu(norm(tran(x, filters, strides=(1, 2, 2))))

# --- Define contracting layers
l1 = conv1(8, inputs['dat'])
l2 = conv1(16, conv2(16, l1))
l3 = conv1(32, conv2(32, l2))
l4 = conv1(48, conv2(48, l3))
l5 = conv1(64, conv2(64, l4))
l6 = conv1(80, conv2(80, l5))

# --- Define expanding layers
l7  = tran2(64, l6)
l8  = tran2(48, conv1(64, l7  + l5))
l9  = tran2(32, conv1(48, l8  + l4))
l10 = tran2(16, conv1(32, l9  + l3))
l11 = tran2(8,  conv1(16, l10 + l2))
l12 = conv1(8,  conv1(8,  l11 + l1))

# --- Create logits
logits = {}
logits['pna'] = layers.Conv3D(filters=2, name='pna', **kwargs)(l12)

# --- Create model
model = Model(inputs=inputs, outputs=logits) 

### Compile the model

*Hint*: Ensure that custom loss functions are used as described in the tutorial to properly adjust the loss function for weights and masks. In addition it may be useful to track metrics such as Dice score and sensitivity to gauge real time performance.

In [0]:
# --- Create custom weighted loss
loss = {'pna': custom.sce(inputs['msk'])}

# --- Create metrics
metrics = custom.dsc(weights=inputs['msk'])
metrics += [custom.softmax_ce_sens(weights=inputs['msk'])]
metrics = {'pna': metrics}

# --- Compile the model
model.compile(
    optimizer=optimizers.Adam(learning_rate=2e-4),
    loss=loss,
    metrics=metrics,
    experimental_run_tf_function=False)

### Train the model

Use the following cell block to train your model.

In [12]:
# --- Train model
model.fit(
    x=gen_train, 
    steps_per_epoch=50, 
    epochs=120,
    validation_data=gen_valid,
    validation_steps=50,
    validation_freq=4,
    use_multiprocessing=True)

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 1/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 1/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 1/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 1/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 1/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 1/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 1/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 1/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 1/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 1/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 1/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 1/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 1/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 1/120
Epoch 57/120
Epoch 58/120
Epoch 59/120
Epoch 60/120
Epoch 1/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/1

<tensorflow.python.keras.callbacks.History at 0x7fcafc53f208>

# Evaluation

Based on the tutorial discussion, use the following cells to calculate model performance. The following metrics should be calculated:

* pixel-wise sensitivity
* Dice score coefficient

### Performance

The following minimum performance metrics must be met for full credit:

* pixel-wise sensitivity: >0.75
* Dice score coefficient: >0.50

In [0]:
# --- Create validation generator
test_train, test_valid = client.create_generators(test=True)

In [0]:
def calculate_sens(pred, true, epsilon=1):
    """
    Method to calculate sensitivity from pred and true masks
    
    """
    assert pred.shape == true.shape
    tp = (pred == 1) & (true == 1)
    ap = (true == 1)

    return np.count_nonzero(tp) / (np.count_nonzero(ap) + epsilon)


def calculate_dice(y_true, y_pred, c=1, epsilon=1):
    """
    Method to calculate the Dice score coefficient for given class
    
    :params
    
      (np.ndarray) y_true : ground-truth label
      (np.ndarray) y_pred : predicted logits scores
      (int)             c : class to calculate DSC on
    
    """
    assert y_true.ndim == y_pred.ndim
    
    true = y_true[..., 0] == c
    pred = np.argmax(y_pred, axis=-1) == c 

    A = np.count_nonzero(true & pred) * 2
    B = np.count_nonzero(true) + np.count_nonzero(pred) + epsilon
    
    return A / B

In [72]:
# --- Create validation generator
test_train, test_valid = client.create_generators(test=True)

dice_array = []
sens_array = []
for x, y in test_valid:
    # --- Create prediction
    logits = model.predict(x)
    pred = np.argmax(logits[0], axis=-1)
    
    # --- Clean up pred using mask
    pred[x['msk'][0, ..., 0] == 0] = 0

    # --- Calculate Dice
    dice = calculate_dice(y['pna'][0], logits[0])
    dice_array.append(dice)
    
    
    # --- Calculate sens
    sens = calculate_sens(pred=pred, true=y['pna'][0, ..., 0])
    sens_array.append(sens)



### Results

When ready, create a `*.csv` file with your compiled **validation** cohort sensitivity and Dice score statistics. There is no need to submit training performance accuracy.

In [0]:
dice_array = np.array(dice_array)
sens_array = np.array(sens_array)

# --- Define columns
df = pd.DataFrame(index=np.arange(sens_array.size))
df['dice'] = dice_array
df['sens'] = sens_array

In [0]:
import os
fname = '{}/models/lesion_segmentation/gej4_results.csv'.format(MOUNT_ROOT)
os.makedirs(os.path.dirname(fname), exist_ok=True)
df.to_csv(fname)

# Submission

Use the following line to save your model for submission (in Google Colab this should save your model file into your personal Google Drive):

In [0]:
# --- Serialize a model
fname = '{}/models/lesion_segmentation/gej4_final.hdf5'.format(MOUNT_ROOT)
os.makedirs(os.path.dirname(fname), exist_ok=True)
model.save(fname)

### Canvas

Once you have completed this assignment, download the necessary files from Google Colab and your Google Drive. You will then need to submit the following items:

* final (completed) notebook: `[UCInetID]_assignment.ipynb`
* final (results) spreadsheet: `[UCInetID]_results.csv`
* final (trained) model: `[UCInetID]_model.hdf5`

**Important**: please submit all your files prefixed with your UCInetID as listed above. Your UCInetID is the part of your UCI email address that comes before `@uci.edu`. For example, Peter Anteater has an email address of panteater@uci.edu, so his notebooke file would be submitted under the name `panteater_notebook.ipynb`, his spreadshhet would be submitted under the name `panteater_results.csv` and and his model file would be submitted under the name `panteater_model.hdf5`.