### Autoencoder with Convolutional Neural Network

In this notebook, I will experiment with the implementation of an autoencoder for augmentation of the histopathology dataset. The augmented data is used to train a CNN network to see if it performs better than a baseline CNN model trained on the normal dataset. First, the required libraries are imported and the size of the images in the PCAM dataset is defined.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import os

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, GlobalAveragePooling2D, Input, MaxPool2D, UpSampling2D
from tensorflow.keras import utils
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

from sklearn.metrics import roc_curve, auc, RocCurveDisplay

# the size of the images in the PCAM dataset
IMAGE_SIZE = 96

### Instantiating data generators

The PatchCAMELYON dataset is too big to fit in the working memory of most personal computers. This is why, we need to define some functions that will read the image data batch by batch, so only a single batch of images needs to be stored in memory at one time point. We can use the handy ImageDataGenerator function from the Keras API to do this. Note that the generators are defined within the function `get_pcam_generators` that returns them as output arguments. This function will later be called from the main code body. The class `model_transform` is used to set the correct preprocessing function for the data generators.


In [4]:
def get_pcam_generators(base_dir, train_batch_size=32, val_batch_size=32, class_mode='binary', prep_function=None):
     # dataset parameters
     TRAIN_PATH = os.path.join(base_dir, 'train+val', 'train')
     VALID_PATH = os.path.join(base_dir, 'train+val', 'valid')

     RESCALING_FACTOR = 1./255

     # instantiate data generators
     if prep_function is None:
          datagen = ImageDataGenerator(rescale=RESCALING_FACTOR)
     else:
          datagen = ImageDataGenerator(rescale=RESCALING_FACTOR, preprocessing_function=prep_function)

     train_gen = datagen.flow_from_directory(TRAIN_PATH,
                                             target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                             batch_size=train_batch_size,
                                             class_mode=class_mode)


     val_gen = datagen.flow_from_directory(VALID_PATH,
                                           target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                           batch_size=val_batch_size,
                                           class_mode=class_mode)
     
     return train_gen, val_gen

Now, the function `get_pcam_generators` that defines the data generators can be called from the main code body. Before executing the code block below, do not forget to change the path where the PatchCAMELYON dataset is located (that is, the location of the folder that contains `train+val` that you previously downloaded and unpacked).

If everything is correct, the following output will be printed on screen after executing the code block:

`Found 144000 images belonging to 2 classes.`

`Found 16000 images belonging to 2 classes.`

In [5]:
train_gen, val_gen = get_pcam_generators('../data')

Found 144000 images belonging to 2 classes.
Found 16000 images belonging to 2 classes.


### Building model architectures

The model architectures are defined within a class. Organizing the code into classes instead of piling everything up in a single script makes the code more clear to read and understand, and helps reuse functionality that is already implemented. For example, we can use the `get_pcam_generators` function to create data generators with different batch sizes just by calling the function with a different set of parameters. Or, we can use the `model_architecture` class to generate networks with different number of feature maps (see below). 

In [6]:
# Class structure from Constantijn & Nino implemented for AE model from Mart
class model_architecture(Sequential):
    def __init__(self, kernel_size, pool_size, first_filters, second_filters):
        super().__init__()
        self.add(Input(shape=(IMAGE_SIZE,IMAGE_SIZE,3)))
        self.kernel_size = kernel_size
        self.pool_size = pool_size
        self.first_filters = first_filters
        self.second_filters = second_filters

    def create_cnn(self):
        self.add(Conv2D(self.first_filters, self.kernel_size, activation='relu', padding='same', input_shape=(IMAGE_SIZE,IMAGE_SIZE,3)))
        self.add(MaxPool2D(pool_size=self.pool_size))
        self.add(Conv2D(self.second_filters, self.kernel_size, activation='relu', padding='same'))
        self.add(MaxPool2D(pool_size=self.pool_size))

        # layers replacing the dense layers
        self.add(Conv2D(self.second_filters, (6,6), activation='relu', padding='valid'))
        self.add(Conv2D(1, (1,1), activation='sigmoid', padding='same'))
        self.add(GlobalAveragePooling2D())

    def compile_cnn(self):
        self.compile(SGD(learning_rate=0.01, momentum=0.95), loss = 'binary_crossentropy', metrics=['accuracy'])

    def create_autoencoder(self):
        # Encoder
        self.add(Conv2D(self.first_filters, self.kernel_size, activation='relu', padding='same'))
        self.add(MaxPool2D(self.pool_size, padding='same'))
        self.add(Conv2D(self.second_filters, self.kernel_size, activation='relu', padding='same'))
        self.add(MaxPool2D(self.pool_size, padding='same'))

        # Decoder
        self.add(Conv2D(self.second_filters, self.kernel_size, activation='relu', padding='same'))
        self.add(UpSampling2D(self.pool_size))
        self.add(Conv2D(self.first_filters, self.kernel_size, activation='relu', padding='same'))
        self.add(UpSampling2D(self.pool_size))
        self.add(Conv2D(3, self.kernel_size, activation='sigmoid', padding='same'))

    def compile_autoencoder(self):
        self.compile(Adam(learning_rate=0.001), loss='mean_squared_error')
        

### Training the CNN model on regular data and evaluating the model

In the next part, an instance of the `model_architecture` class is created for the CNN model. A kernel size of (3,3) and a pooling size of (4,4) is used. After that, the training phase will be initiated. This is followed by a ROC curve analysis of the trained CNN model. The training is done with the regular dataset.

In [None]:
model_cnn = model_architecture(kernel_size=(3,3), pool_size=(4,4), first_filters=32, second_filters=64)
model_cnn.create_cnn()
model_cnn.compile_cnn()
model_cnn._name = 'cnn'

model_cnn.summary();

In [None]:
# Save the model and weights
model_name = 'cnn'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_cnn.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json)

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps = train_gen.n//train_gen.batch_size
val_steps = val_gen.n//val_gen.batch_size

history = model_cnn.fit(train_gen, steps_per_epoch=train_steps,
                        validation_data=val_gen,
                        validation_steps=val_steps,
                        epochs=3,
                        callbacks=callbacks_list)

In [None]:
# Getting labels and predictions on validation set
val_true = val_gen.classes
val_probs = model_cnn.predict(val_gen, steps=val_steps)

# Calculating false positive rate (fpr), true positive rate (tpr) and AUC
fpr, tpr, thresholds = roc_curve(val_true, val_probs)
roc_auc = auc(fpr, tpr)

# Generate ROC curve
roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc.plot();

### Training and evaluating the autoencoder model

First, we need to construct new data generators. The training process of the autoencoder is unsupervised so the class mode of the data generators should be set to `input`. With these new generators, the autoencoder can be trained effectively. A new instance of the `model_architecture` class is created for the autoencoder. 

In [7]:
# Constructing the data generators for unsupervised learning for autoencoder training
train_gen_ae, val_gen_ae = get_pcam_generators('../data', 
                                               train_batch_size=16, 
                                               val_batch_size=16, 
                                               class_mode='input') 

Found 144000 images belonging to 2 classes.
Found 16000 images belonging to 2 classes.


In [8]:
model_ae = model_architecture(kernel_size=(3,3), pool_size=(2,2), first_filters=32, second_filters=16)
model_ae.create_autoencoder()
model_ae.compile_autoencoder()
model_ae._name = 'Autoencoder'

model_ae.summary();

Model: "Autoencoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 96, 96, 32)        896       
                                                                 
 max_pooling2d (MaxPooling2  (None, 48, 48, 32)        0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 48, 48, 16)        4624      
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 24, 24, 16)        0         
 g2D)                                                            
                                                                 
 conv2d_2 (Conv2D)           (None, 24, 24, 16)        2320      
                                                                 
 up_sampling2d (UpSampling2  (None, 48, 48, 16)        

In [9]:
# Save the model and weights
model_name = 'autoencoder'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_ae.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json) 

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps_ae = train_gen_ae.n//train_gen_ae.batch_size//4
val_steps_ae = val_gen_ae.n//val_gen_ae.batch_size//4

history = model_ae.fit(train_gen_ae, steps_per_epoch=train_steps_ae, 
                       validation_data=val_gen_ae,
                       validation_steps=val_steps_ae,
                       epochs=3,
                       callbacks=callbacks_list)

Epoch 1/3
Epoch 1: val_loss improved from inf to 0.01169, saving model to autoencoder_weights.hdf5
Epoch 2/3


  saving_api.save_model(




Lets visualize the output of the trained CAE model. The output of the CAE model is used as augmented dataset in the upcoming steps.

In [None]:
# Produce a prediction on the validation set
img_batch = train_gen_ae[0][1] # [batch][class][image_nr]
predict_test = model_ae.predict(img_batch) 
image_nr = 1

fig,ax = plt.subplots(1,2)
ax[0].imshow(img_batch[image_nr])
ax[0].set_title('Original image')
ax[1].imshow(predict_test[image_nr])
ax[1].set_title('Reconstructed image');

### Training CNN with CAE preprocessing function

The generators for the augmented data are initialized with the CAE model as preprocessing function. The `model_architecture` class is used again to create a new instance for this model. This model is trained on the augmented dataset.

In [None]:
# Written by Constantijn
def model_transform(image):
     image = utils.img_to_array(image)
     image = np.array([image]) # Convert into a single batch
     image_prediction = model_ae.predict(image)

     return image_prediction[0]

In [None]:
# Constructing the data generators for the augmented dataset  
train_gen_aug, val_gen_aug = get_pcam_generators('../data', 
                                                 class_mode='binary', 
                                                 prep_function=model_transform)

In [None]:
model_cnn_aug = model_architecture(kernel_size=(3,3), pool_size=(4,4), first_filters=32, second_filters=64)
model_cnn_aug.create_cnn()
model_cnn_aug.compile_cnn()
model_cnn_aug._name = 'cnn_aug'

model_cnn_aug.summary()

In [None]:
# Save the model and weights
model_name = 'cnn_aug'
model_filepath = model_name + '.json'
weights_filepath = model_name + '_weights.hdf5'

model_json = model_cnn_aug.to_json() # serialize model to JSON
with open(model_filepath, 'w') as json_file:
    json_file.write(model_json)

# Define the model checkpoint and Tensorboard callbacks
checkpoint = ModelCheckpoint(weights_filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorboard = TensorBoard(os.path.join('logs', model_name))
callbacks_list = [checkpoint, tensorboard]

# Train the model
train_steps_cnn_aug = train_gen_aug.n//train_gen_aug.batch_size
val_steps_cnn_aug = val_gen_aug.n//val_gen_aug.batch_size

history = model_cnn_aug.fit(train_gen_aug, steps_per_epoch=train_steps_cnn_aug,
                            validation_data=val_gen_aug,
                            validation_steps=val_steps_cnn_aug,
                            epochs=1,
                            callbacks=callbacks_list)

In [None]:
# Getting labels and predictions on validation set
val_true = val_gen_aug.classes
val_probs = model_cnn_aug.predict(val_gen_aug, steps=val_steps_cnn_aug)

# Calculating false positive rate (fpr), true positive rate (tpr) and AUC
fpr, tpr, thresholds = roc_curve(val_true, val_probs)
roc_auc = auc(fpr, tpr)

# Generate ROC curve
roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc.plot()