

# Histopathological Image Classification Using Bayesian Convolutional Neural Networks

_This work was completed in Aalto University in the spring of 2019 as part of the course CS-E4890 - Deep Learning_![stockphoto](swirls_2_crop.jpeg "photo by Unsplash user @lurm")

## Section 0. Background

### Motivation
The aim of the project was to apply probabilistic machine learning in the form of Bayesian neural networks to classification of histopathological images in order to achive robust and reliable prediction of the presence of cancer. The probabilistic approach was chosen because medical applications of machine learning, such as diagnostic tools, require us to be able to quantify the uncertainty present in our models. Neural networks, particularly deep neural networks, have garnered much attention from both academia and the public thanks to groundbraking results in many tasks, but they have serious deficiencies when it comes to quantifying the quality of predictions. Unpredictable behaviour highlighted by [adversarial examples](https://openai.com/blog/adversarial-example-research/) present a major problem in applications where it is necessary that the system behave robustly and predictably, like medical technology. Ideally we would like the neural network to not just perform predictions, but to also provide a reliable information about the confidence of those predictions.

### Bayesian neural networks
The motivation behind the development of Bayesian neural networks (BNNs) is precisely to create models that are inherently probabilistic and to allow us to get an idea about the model's confidence in its predictions. As the name suggests, BNNs are a probabilistic models that apply [Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference) to the training procedure. While traditional neural networks learn by adjusting the weight parameters of neurons to minimize a chosen loss function, BNNs assign probability distributions to each of the weights. The following illustration by [Shridhar et al.](https://github.com/kumar-shridhar/PyTorch-BayesianCNN/tree/master/Image%20Recognition) shows this difference between traditional and Bayesian neural networks quite nicely.
![stockphoto](BayesCNNwithdist.png "photo by Unsplash user @lurm") 
BNNs learn learn by forming the posterior distribution of the weights based on the training data. What this means is that, given a likelihood function $p(\mathcal D | \mathbf w)$, we calculate the posterior distribution of the weights using Bayes' theorem

$$ p(\mathbf w | \mathcal D) = \frac{p(\mathcal D | \mathbf w)p(\mathbf w)}{p(\mathcal D)},$$

where $\mathcal D$ is the training data, $\mathbf w$ is a particular set of weights, and $p(\mathbf w)$ is a prior on the weights. In practice, we do not calculate the exact posterior distribution, since the term $p(\mathcal D)$ is generally intractable. We instead use approximations that are explained below. As the weights are described by probability distributions, the output for each sample is also in the form of probability distributions for all the classes. This posterior predictive distribution is evaluated using the formula 

$$p(y^* | \mathbf x^*) = \int p(y^* | \mathbf x^*, \mathbf w)p(\mathbf w | \mathcal D)\text d \mathbf w,$$

where $p(y^* | \mathbf x^*, \mathbf w)$ is the probability of the model with weights $\mathbf w$ assigning the label $y^*$ to unseen data example $\mathbf x^*$. Again, calculating these distributions exactly would be far too computationally intensive, or outright impossible, so we instead use clever mathematical tricks to approximate the distribution of layer activations, and draw samples to give us an estimate of the posterior predictive distribution. By sampling the activations and using them to compute multiple outputs, Bayesian neural networks are in effect a form of [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning). This reduces overfitting and allows us to quantify the uncertainty in the predictions.

### Bayesian convolutional neural networks
Even though Bayesian neural networks were introduced all the way back in [1995](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.446.9306&rep=rep1&type=pdf), they failed to gain much popularity due to the difficulty in training them efficiently. It was only quite recently, with the introduction [Bayes by backprop ](https://arxiv.org/abs/1505.05424) (2015), the [local reparametrization trick ](https://arxiv.org/abs/1506.02557) (2015), and finally [Bayesian CNNs](https://github.com/kumar-shridhar/PyTorch-BayesianCNN/blob/master/Paper/BayesianCNN-with-VariationalInference.pdf) (2018), that deep Bayesian neural networks became practical. Instead of evaluating the Bayes' theorem directly, or even using regular [variational inference](http://edwardlib.org/tutorials/bayesian-neural-network), we use a clever method where we perform two convolutional operations, one which seeks to optimize the means of the parameter distributions, and another to optimize the variance.

### Data
The data used is from a [Kaggle competition](https://www.kaggle.com/c/histopathologic-cancer-detection/overview), which is a slightly modified version of the [PCam](https://github.com/basveeling/pcam) dataset. From the PCam GitHub page:
> [The data] consists of 327 680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted [sic] with a binary label indicating presence of metastatic tissue...all splits have a 50/50 balance between positive and negative examples...A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label.

The Kaggle version removes duplicates and splits off 25% of the samples into a test set, which brings the number of training samples down to 220 000. The data can be downloaded [here](https://www.kaggle.com/c/histopathologic-cancer-detection/data). I used the Kaggle data, since it is provided in a convinient format, and allows results to be directly compared to the competition submissions.

## Section 1. Data

In [12]:
import torch
import csv
import os
import sys
import time
import numpy as np
import math
import matplotlib.pyplot as plt
import torch.backends.cudnn as cudnn
import torch.optim as optim
import torch.nn.functional as F

from PIL import Image
from torch.utils import data
from torchvision import transforms
from torchvision.transforms import ToTensor

In [13]:
# csv which contains the IDs and labels for the training data
label_path = 'kaggle/train_labels.csv'

# we make a dictionary called labels, which pairs each ID with its label
with open(label_path, mode='r') as file:
    reader = csv.reader(file)
    next(reader) # skip the header
    labels = {row[0]: int(row[1]) for row in reader}

# we also store all the labels in a separate list
list_IDs = list(labels.keys())

In [14]:
# we use the Torch Dataset class to load the samples with parallelization
class Dataset(data.Dataset):
    def __init__(self, list_IDs, labels, transform):
        self.labels = labels
        self.list_IDs = list_IDs
        self.transform = transform
        
    def __len__(self):
        return len(self.list_IDs)
    
    def __getitem__(self, index):
        ID = self.list_IDs[index]
        filename = '{}.tif'.format(ID)
        
        filepath = os.path.join('kaggle', 'train', filename)
        image = Image.open(filepath)
        X = self.transform(image)
        y = self.labels[ID]
        
        return X, y

In [25]:
# for expariments use 20% of the samples for testing and validation and 80% for training
test_ratio = 0.1
validation_ratio = 0.1

# transformations applied to samples
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(), # to prevent overfitting
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

training_set = Dataset(list_IDs, labels, transform=transform)

n_train = len(training_set)
indices = list(range(n_train))
# randomize the order of the samples
np.random.shuffle(indices)
test_split = int(np.floor(test_ratio * n_train))
validation_split = int(np.floor((test_ratio + validation_ratio) * n_train))

test_IDs, validation_IDs, training_IDs = indices[:test_split], \
    indices[test_split:validation_split], indices[validation_split:]

# change the number of workers based on your CPU
loader_params = {'batch_size': 32, 'num_workers': 6}

## Section 2. Method

<div class="alert alert-block alert-info">
<b>Note:</b> The latest version (as of 18.05.2019) of at least one of the modules in the PyTorch-BayesianCNN repo have bugs in them that result in nan when calculating losses for Bayesian layers. This is a <a href="https://github.com/kumar-shridhar/PyTorch-BayesianCNN/issues/8">known issue</a> and the author has promised a fix in the future. I used the version 83f5333 which I would recommend before a proper fix is issued.
</div>

In [10]:
# if you are running the code, make sure this path is correct
sys.path.append('../PyTorch-BayesianCNN/Image Recognition/')

# the module containing the main implementation
import utils.BBBlayers

## Section 3. Experiments and results

## Section 4. Conclusions