### Autoencoder with Convolutional Neural Network
This notebook contains the implementation of an autoencoder for augmentation of the histopathology dataset. The augmented data is used to train CNN networks to see if the accuracy and AUC are higher compared to the baseline CNN model trained on the normal dataset. First, the required libraries are imported and the size of the images in the PCAM dataset is defined.

In [None]:
# Load the functions and classes from main_util.py
from main_util import get_pcam_generators
from main_util import Model_architecture
from main_util import Model_transform

# Standard libraries
import matplotlib.pyplot as plt
import os

# Modelcheckpoint and tensorboard callbacks
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

# ROC curve analysis
from sklearn.metrics import roc_curve, auc, RocCurveDisplay

# the size of the images in the PCAM dataset
IMAGE_SIZE = 96

### Instantiating data generators

The PatchCAMELYON dataset is too big to fit in the working memory of most personal computers. This is why, we need to define some functions that will read the image data batch by batch, so only a single batch of images needs to be stored in memory at one time point. We can use the handy ImageDataGenerator function from the Keras API to do this. Note that the generators are defined within the function `get_pcam_generators` that returns them as output arguments. This function will later be called from the main code body. The function is located in `main_util.py`.


Before executing the code block below, do not forget to change the path where the PatchCAMELYON dataset is located (that is, the location of the folder that contains `train+val` that you previously downloaded and unpacked).

If everything is correct, the following output will be printed on screen after executing the code block:

`Found 144000 images belonging to 2 classes.`

`Found 16000 images belonging to 2 classes.`

In [None]:
path = "../data"
train_gen, val_gen = get_pcam_generators(path)

### Building model architectures

The model architectures are defined within the class `Model_architecture`. Organizing the code into classes instead of piling everything up in a single script makes the code more clear to read and understand, and helps reuse functionality that is already implemented. The class is also located in `main_util.py`. In the code block below, an instance to the class is made and the structure of the baseline CNN model is loaded.

In [None]:
# Class instance is made
model_cnn = Model_architecture()
model_cnn.create_cnn(kernel_size=(3,3), pool_size=(4,4), first_filters=32, second_filters=64)
model_cnn.compile_cnn()
model_cnn._name = 'cnn'

# Prints a summary of the model structure
model_cnn.summary();

### Training the baseline CNN model on regular data and evaluating the model

After loading the CNN model structure, the training phase will be initiated in the code block below. This is followed by a ROC curve analysis of the trained CNN model. The training is done with the regular dataset.

In [None]:
# Save the model and weights
model_name = 'cnn'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_cnn.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json)

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps = train_gen.n//train_gen.batch_size
val_steps = val_gen.n//val_gen.batch_size

history = model_cnn.fit(train_gen, steps_per_epoch=train_steps,
                        validation_data=val_gen,
                        validation_steps=val_steps,
                        epochs=1,
                        callbacks=callbacks_list)

In [None]:
# Getting labels and predictions on validation set
val_true = val_gen.classes
val_probs = model_cnn.predict(val_gen, steps=val_steps)

# Calculating false positive rate (fpr), true positive rate (tpr) and AUC
fpr, tpr, thresholds = roc_curve(val_true, val_probs)
roc_auc = auc(fpr, tpr)

# Generate ROC curve
roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc.plot();

### Training and evaluating the autoencoder model

First, we need to construct new data generators. The training process of the autoencoder is unsupervised so the class mode of the data generators should be set to `input`. With these new generators, the autoencoder can be trained effectively. A new instance of the `Model_architecture` class is created for the autoencoder. 

In [None]:
# Constructing the data generators for unsupervised learning for autoencoder training
train_gen_ae, val_gen_ae = get_pcam_generators(path, train_batch_size=16, val_batch_size=16, class_mode='input') 

In [None]:
model_ae = Model_architecture()
model_ae.create_autoencoder(kernel_size=(3,3), pool_size=(2,2), first_filters=32, second_filters=16)
model_ae.compile_autoencoder()
model_ae._name = 'Autoencoder'

model_ae.summary();

Next, the training phase of the autoencoder can be initiated.

In [None]:
# Save the model and weights
model_name = 'autoencoder'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_ae.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json) 

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps_ae = train_gen_ae.n//train_gen_ae.batch_size
val_steps_ae = val_gen_ae.n//val_gen_ae.batch_size

history = model_ae.fit(train_gen_ae, steps_per_epoch=train_steps_ae, 
                       validation_data=val_gen_ae,
                       validation_steps=val_steps_ae,
                       epochs=3,
                       callbacks=callbacks_list)

Lets visualize the output of the trained autoencoder model. The output of the autoencoder is used as augmented dataset in the upcoming steps.

In [None]:
# Produce a prediction on the validation set
img_batch = train_gen_ae[0][1] # [batch][class][image_nr]
predict_test = model_ae.predict(img_batch) 
image_nr = 3

fig,ax = plt.subplots(1,2)
ax[0].imshow(img_batch[image_nr])
ax[0].set_title('Original image')
ax[1].imshow(predict_test[image_nr])
ax[1].set_title('Reconstructed image');

### Training CNN with augmented data using autoencoder

The generators for the augmented data are initialized with the autoencoder model as preprocessing function. This is done with the class `Model_transform`. This class is located in `main_util.py` and is responsible for augmenting the input of the data generators. The `Model_architecture` class is used again to create a new instance for this model. This model is trained on the augmented dataset.

In [None]:
# Constructing the data generators for the augmented dataset  
train_gen_aug, val_gen_aug = get_pcam_generators(path,
                                                 class_mode='binary', 
                                                 prep_function=Model_transform(model_ae).model_transform)

The following code is to see if the generator works properly. It plots a few images of a batch to see the result.

In [None]:
fig,ax = plt.subplots(1, 8)
for images, labels in train_gen_aug:
    for i in range(8):
        ax[i].imshow(images[i])
        ax[i].axis('off')
    break

In [None]:
model_cnn_aug = Model_architecture()
model_cnn_aug.create_cnn(kernel_size=(3,3), pool_size=(4,4), first_filters=32, second_filters=64)
model_cnn_aug.compile_cnn()
model_cnn_aug._name = 'cnn_aug'

model_cnn_aug.summary()

The training phase on the augmented data can now be initiated.

In [None]:
# Save the model and weights
model_name = 'cnn_aug'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_cnn_aug.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json)

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps_cnn_aug = train_gen_aug.n//train_gen_aug.batch_size
val_steps_cnn_aug = val_gen_aug.n//val_gen_aug.batch_size

history = model_cnn_aug.fit(train_gen_aug, steps_per_epoch=train_steps_cnn_aug,
                            validation_data=val_gen_aug,
                            validation_steps=val_steps_cnn_aug,
                            epochs=1,
                            callbacks=callbacks_list)

In [None]:
# Getting labels and predictions on validation set
val_true = val_gen_aug.classes
val_probs = model_cnn_aug.predict(val_gen_aug, steps=val_steps_cnn_aug)

# Calculating false positive rate (fpr), true positive rate (tpr) and AUC
fpr, tpr, thresholds = roc_curve(val_true, val_probs)
roc_auc = auc(fpr, tpr)

# Generate ROC curve
roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc.plot()