## AI for Medicine Course 1 Week 1 lecture exercises

# Outline

Click on the links to jump to the desired section of this notebook!

- [Image pre-processing in Keras](#image-processing)
- [Counting Labels](#counting-labels)
- [Weighted Loss Function](#weighted-loss)
- [Densenet](#densenet)
- [Patient overlap and data leakage](#patient-overlap)

<a name="image-processing"></a>
# Image Pre-preprocessing in Keras

Let's use the Keras function [ImageDataGenerator](https://keras.io/preprocessing/image/) to perform data preprocessing and data augmentation.

In [None]:
from keras.preprocessing.image import ImageDataGenerator

In [None]:
import pandas as pd

Load the csv file containing patient labels and file names for the chest x-rays.

In [None]:
train_df = pd.read_csv("nih/train-small.csv")
train_df.head(1)

#### Standardization

We want to center the mean of the data around zero, and also make the standard deviation of the data equal to 1.  

So we subtract the mean and divide by the standard deviation.

$$\frac{x_i - \mu}{\sigma}$$

$\mu$: the mean (average)  

$\sigma$: the standard deviation

In [None]:
# normalize images
image_generator = ImageDataGenerator(
    samplewise_center=True, #Set each sample mean to 0.
    samplewise_std_normalization= True # Divide each input by its standard deviation
)

In [None]:
train_df[['Image','Mass']].head(1)

In [None]:
# flow from directory with specified batch size
# and target image size
generator = image_generator.flow_from_dataframe(
        dataframe=train_df,
        directory="nih/images-small/",
        x_col="Image", # features
        y_col= ['Mass'], # labels
        class_mode="raw", # 'Mass' column should be in train_df
        batch_size= 1, # images per batch
        shuffle=False, # shuffle the rows or not
        target_size=(320,320) # width and height of output image
)

View the raw data

In [None]:
import imageio
# get the first image that was listed in the train_df dataframe
raw_image = imageio.imread('nih/images-small/00008270_015.png')

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.imshow(raw_image);

Processed data
Get the first image that's in the generator

In [None]:
x, y = generator.__getitem__(0)
plt.imshow(x[0]);

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="counting-labels"></a>
# Counting labels

To avoid having class imbalance impact the loss function, we can weight the losses differently.  To choose the weights, we calculate the class frequencies.

For this exercise, let's just get the count of each label.  You'll use the concepts practiced here to calculate frequencies in the assignment!

In [None]:
import numpy as np

The labels will be stored where each row is a sample (a patient), and each column is a label for a particular condition.  

In [None]:
# two patients (rows), and 3 labels (columns)
labels_matrix = np.array(
    [[1, 0, 0],
     [0, 1, 1]])

In [None]:
# this sums all patients and labels
np.sum(labels_matrix)

In [None]:
# sum for each label (column)
np.sum(labels_matrix,axis=0)

In [None]:
# sum for each patient (row)
np.sum(labels_matrix,axis=1)

Decide which axis you should use in the assignment!

Find out when the label is zero

In [None]:
# find out when the label is zero
labels_matrix == 0

In [None]:
# convert booleans to integers
labels_matrix_count_zeros = (labels_matrix == 0).astype(int)
labels_matrix_count_zeros

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="weighted-loss"></a>
# Weighted Loss function


In [1]:
import numpy as np

In [2]:
y_true = np.array(
        [[1],
         [1],
         [1],
         [0]])

In [3]:
w_p = np.array([0.25])

In [4]:
w_n = np.array([0.75])

In [5]:
y_pred_1 = 0.9*np.ones(y_true.shape)
y_pred_1

array([[0.9],
       [0.9],
       [0.9],
       [0.9]])

In [6]:
e = 1e-7

#### Weighted Loss Equation
Calculate the loss for the zero-th label (column at index 0)

- The loss is made up of two terms:
    - $loss_{pos}$: we'll use this to refer to the loss where the actual label is positive (the positive examples).
    - $loss_{neg}$: we'll use this to refer to the loss where the actual label is negative (the negative examples).  
- Note that within the $log()$ function, we'll add a tiny positive value, to avoid an error if taking the log of zero.

$$ loss^{(i)} = loss_{pos}^{(i)} + los_{neg}^{(i)} $$

$$loss_{pos}^{(i)} = -1 \times weight_{pos}^{(i)} \times y^{(i)} \times log(\hat{y}^{(i)} + \epsilon)$$

$$loss_{neg}^{(i)} = -1 \times weight_{neg}^{(i)} \times (1- y^{(i)}) \times log(1 - \hat{y}^{(i)} + \epsilon)$$

$$\epsilon = \text{a tiny positive number}$$

For this exercise, we will add up the losses from the examples.  In this week's programming assignment, you'll take the average loss over the multiple examples.

In [7]:
# loss from the positive predictions
loss_1_pos = -1 * np.sum(w_p[0] * 
                y_true[:, 0] * 
                np.log(y_pred_1[:, 0] + e)
              )
loss_1_pos

0.07902030341004104

In [8]:
# loss from the negative predictions
loss_1_neg = -1 * np.sum( 
                w_n[0] * 
                (1 - y_true[:, 0]) * 
                np.log(1 - y_pred_1[:, 0] + e)
              )
loss_1_neg

1.7269380697459094

In [9]:
loss_1 = loss_1_pos + loss_1_neg
loss_1

1.8059583731559503

Do the same for when all predictions are 0.1

In [10]:
y_pred_2 = 0.1 * np.ones(y_true.shape)

In [11]:
# loss from the positive predictions
loss_2_pos = -1 * np.sum(w_p[0] * 
                y_true[:, 0] * 
                np.log(y_pred_2[:, 0] + e)
              )
loss_2_pos

1.7269380697459094

In [12]:
# loss from the negative predictions
loss_2_neg = -1 * np.sum( 
                w_n[0] *
                (1 - y_true[:, 0]) * 
                np.log(1 - y_pred_2[:, 0] + e)
              )
loss_2_neg

0.07902030341004104

In [13]:
loss_2 = loss_2_pos + loss_2_neg
loss_2

1.8059583731559503

In [14]:
print("class is imbalanced (there are 3 positive labels and 1 negative label)")
print(f"weighted loss when predictions are all 0.7 is {loss_1:.4f}")
print(f"weighted loss when predictions are all 0.3 is {loss_2:.4f}")

class is imbalanced (there are 3 positive labels and 1 negative label)
weighted loss when predictions are all 0.7 is 1.8060
weighted loss when predictions are all 0.3 is 1.8060


The weights are helping to give the single example of the minority (negative label) equal weight as the 3 positive examples with the positive label.

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="densenet"></a>
# Densenet

Densenet is a convolutional network where each layer is connected to all other layers that are deeper in the network
- The first layer is connected to the 3rd, 4th etc.
- The second layer is connected to the 3rd, 4th, 5th etc.

For a detailed explanation of Densenet, check out this post by Pablo Ruiz ["Underestanding and visualizing DenseNets"](https://towardsdatascience.com/understanding-and-visualizing-densenets-7f688092391a)

In [None]:
from keras.applications.densenet import DenseNet121
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
from keras import backend as K

In [None]:
# create the base pre-trained model
base_model = DenseNet121(weights='./nih/densenet.hdf5', include_top=False);

View a summary of the model

In [None]:
model.summary()

In [None]:
# there are multiple convolutional layers
layers_l = base_model.layers

print("First 5 layers")
layers_l[0:5]

In [None]:
print("Last 5 layers")
layers_l[-6:-1]

In [None]:
# get the convolutional layers
conv2D_layers = [layer for layer in base_model.layers 
                if str(type(layer)).find('Conv2D') > -1]
print("A couple conv2D layers")
conv2D_layers[0:5]

In [None]:
print(f"There are {len(conv2D_layers)} convolutional layers")

In [None]:
print("The input has 3 channels")
base_model.input

In [None]:
print("The output has 1024 channels")
x = base_model.output
x

In [None]:
# add a global spatial average pooling layer
x_pool = GlobalAveragePooling2D()(x)
x_pool

In [None]:
labels = ['Emphysema', 
          'Hernia', 
          'Mass', 
          'Pneumonia',  
          'Edema']
n_classes = len(labels)
print(f"In this example, we want our model to identify {n_classes} classes")

In [None]:
# and a logistic layer
predictions = Dense(n_classes, activation="sigmoid")(x_pool)
print("Predictions have {n_classes} units, one for each class")
predictions

In [None]:
model = Model(inputs=base_model.input, outputs=predictions)

In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy')
# we'll customize the loss function in the assignment!

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="patient-overlap"></a>
# Patient Overlap and Data Leakage

Patient overlap in medical data is a general problem called **data leakage**.  To identify patient overlap, check to see if a patient's ID appears in both the train set and validation set.

In [None]:
import pandas as pd

In [None]:
ids_train = [0, 1, 2, 0, 1, 2]
ids_valid = [2, 3, 4, 2, 3, 4]

In [None]:
ids_train_set = set(ids_train)
ids_train_set

In [None]:
ids_valid_set = set(ids_valid)
ids_valid_set

In [None]:
patient_overlap = ids_train_set.intersection(ids_valid_set)
patient_overlap

In [None]:
len(patient_overlap)

The patient ID '2' appears in both the train set and valid set!

### This is the end of this practice section.

Please continue on with the lecture videos!

---