# Clinical Heart Failure Detection Using Whole-Slide Images of H&E tissue

## Version

- **0.06**: 
  - Before any modification: full rerun took ~15mins (most of it for loading train/val/test images/labels... and very less for model training/test)
  - Before any modification: Train/Val/Test Acc = 53/51.1/51.3
  - Changed model = Pre-trained ResNet50 + Modified FCN Layer
    - SGD (LR=0.0001, M=0.9), Categorical Crossentropy
    - 1st run: Train/Val/Test Acc = 57.5/72.7/67.1
    - Converted ones/zeroes count to function
    - Converted images/label prep to function
    - Epoch=1: Train/Val/Test Acc = 61/67.9/67
    - Epoch=5: Train/Val/Test Acc = 85.6/80.8/75.9
    - Epoch=10: Train/Val/Test Acc = 93.0/85.6/77.7
    - Epoch=20: Train/Val/Test Acc = 94.1/89.0/80.4
    - Epoch=20: Changed loss function to 'binary crossentropy', reduced the accuracy: Train/Val/Test Acc = 90.2/84.0/77.7 Hence, reverted to categorical crossentropy as the loss function
    - Epoch=20: Changed optimizer to Adam(lr=0.01), increased accuracy: Train/Val/Test Acc = 99.3/86.9/83.2
    - In above training accuracy is high, validation and test accuracy is low ... it is overfitting ... let us implement Dropout
    - Epoch=20: Adam(lr=0.01): Dropout(0.2): Train/Val/Test Acc = 93.0/88.8/82.1
    - Epoch=20: Adam(lr=0.01): Dropout(0.4): Train/Val/Test Acc = 90.8/89.8/83.4
    - Epoch=20: Adam(lr=0.01): Dropout(0.5): Train/Val/Test Acc = 92.2/90.6/82.3 .. will reverted to Dropout(0.4)
    - Let us try Batch Normalization. ResNet50 -> Dense (1024) -> Batch Norm -> relu -> Dropout -> Dense(2,softmax)
    - Epoch=20: Adam(lr=0.01): Dropout(0.4): BN: Train/Val/Test Acc = 90.6/81.6/79.0 ... reduced accuracy
    - Epoch=20: Adam(lr=0.01): Dropout(0.2): BN: Train/Val/Test Acc = 95.2/86.9/81.9 ... reduced accuracy will revert to no BN, Dropout(0.4)
    - Epoch=20: Adam(lr=0.01): Dropout(0.4): Train/Val/Test Acc = 88.2/89.3/81.7
- **0.05**: Migrate to Google Colab/Drive, changed loss function to 'binary_crossentropy', epochs to 1 : Train/Val/Test Acc = 55.7/55.9/55.2
- **0.04**: Class Label Info in List was not accurate - fixed it. Now, Train/Validate/Test Accuracy is more realistic = ~55%
- **0.03**: Tweak CNN model done for MNIST to work for this dataset to have a working end to end CNN model. Train/Validate/Test Accuracy = ~100%
- **0.02**: Prepare Train/Validate/Test Labels and Images 
- **0.01**: Prepare Train/Validate/Test Images

## Improvement Opportunity

- **DONE**: Convert code sections in data preparation for train/validation/test to functions
- **TRIED**: As this is a 2 class classification - loss function can be changed to binary_crossentropy instead of categorical_crossentropy
- **TRIED**: Reduce parameters, epochs.
- Use Data Augmentation
- Try to train last few layers of ResNet50 with the data
- Accuracy is fluctuating... optimize hyper parameters to have a smooth increase
- Try k-cross validation
- Try changing batch size


## Download Dataset

### Download Train, Validate and Test Images
- Source Link to the Dataset / Annotation File: https://idr.openmicroscopy.org/webclient/?show=project-402
- Follow the instructions at following link, install IBM Aspera Desktop Client to download the dataset.
- Copy downloaded folders to '**data/images**' folder in your working directory where you have this Jupyter Notebook:
  - 'held-out_validation'
  - 'training'

### Download Label Information for Train, Validate and Test Images 
- Following link will point to below Github link which has the annotation File: https://idr.openmicroscopy.org/webclient/?show=project-402
- Source Link for the Annotation File: https://github.com/IDR/idr0042-nirschl-wsideeplearning/tree/master/experimentA
- Download and copy file '**idr0042-experimentA-annotation.csv**' to '**data/labels/**' folder in your working directory where you have this Jupyter Notebook

## References

#### Data Preparation
- Access Google Drive files from Google Colab
  - https://www.youtube.com/watch?reload=9&v=lHRC5gFvQnA
- Reading an image
  - mathplotlib: https://stackoverflow.com/questions/9298665/cannot-import-scipy-misc-imread
  - pathlib: https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f#:~:text=To%20use%20it%2C%20you%20just,for%20the%20current%20operating%20system.
  - OpenCV: https://www.geeksforgeeks.org/python-opencv-cv2-imread-method/
- Load multiple images into a numpy array
  - glob / os.listdir: https://stackoverflow.com/questions/39195113/how-to-load-multiple-images-in-a-numpy-array
  - glob / cv2: https://medium.com/@muskulpesent/create-numpy-array-of-images-fecb4e514c4b
- Load a CSV file
  - Datacamp: https://www.datacamp.com/community/tutorials/pandas-read-csv?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=278443377095&utm_targetid=dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061994&gclid=EAIaIQobChMIz5TKz-v17QIV1AorCh0bfw96EAAYASAAEgKiGPD_BwE
- Split a String
  - Python Central: https://www.pythoncentral.io/cutting-and-slicing-strings-in-python/

#### Model
- ResNet50
  - https://cv-tricks.com/keras/understand-implement-resnets/
- Keras Optimizer / Adam
  - https://keras.io/api/optimizers/


## Understand Dataset

### Understand Images Folder Structure and Number of Images Available

Training/Validation
- \..\training\fold_1: has images for training = 770#
- \..\training\test_fold_1: has images for validation = 374#
- Total = 770 + 374 = 1144 images

Test
- \..\held-out_validation: has images for testing = 1155#

### Understand Annotation File and Label Information Available

Relevant columns of interest:
- Column A: Dataset Name: Classifies each row/instance as 'training' or 'test'
- Column B: Image Name: Specifies filename of the image for the row/instance
- Column Z: Experimental Condition [Diagnosis]: has 3 classes:
  - 'chronic heart failure'
  - 'heart tissue pathology' - We will treat this as 'not chronic heart failure'
  - 'not chronic heart failure'
- Column AA: Channels: mentions RGB => images are color images and will have 3 channels Red/Green/Blue (for CNN). 
  
Breakup of training/test instances:
- training
  - 'chronic heart failure' = 517
  - 'not chronic heart failure' = 627
- test
  - 'chronic heart failure' = 517
  - 'not chronic heart failure' = 638

Total 'training' = 517 + 627 = 1144  (Note: 'validate' is a portion of this 'training' set.)

Total 'test' = 517 + 638 = 1155

## Load Libraries

We need to read 'train, validate and test images' to arrays so that we can then use them to feed to our CNN model. We need to import the annotation file into a dataframe so that we can then access the labels information.

In [1]:
# install OpenCV package - this is required only once
# pip install opencv-python

In [2]:
# aids in reading image files
import cv2
import glob

In [3]:
# aids in working with arrays
import numpy as np

In [4]:
# aids in working with dataframes
import pandas as pd

## Mount Google Drive

We need to mount the google drive so that we can then access the files from google drive.

In [None]:
# run this. click on the link it will ask for. get the authentication code. Copy/Paste in the cell. Hit Enter.
from google.colab import drive
drive.mount('/content/gdrive')

## Get labels info into a dataframe

In [None]:
# Google Drive / Colab
filepath_annotation_file = r'/content/gdrive/MyDrive/Colab Notebooks/Clinical Heart Failure using H&E Images /data/labels/idr0042-experimentA-annotation.csv'
labels = pd.read_csv(filepath_annotation_file)

# Local Drive / Jupyter
# labels = pd.read_csv('data/labels/idr0042-experimentA-annotation.csv')

### Explore and Understand

In [None]:
# uncomment & check the contents of labels is as expected
labels

In [None]:
print(labels['Dataset Name'])

In [None]:
print(labels['Dataset Name'][0])

In [None]:
type(labels['Dataset Name'][0])

In [None]:
print(labels['Image Name'])

In [None]:
print(labels['Image Name'][0])

In [None]:
type(labels['Image Name'][0])

In [None]:
print(labels['Experimental Condition [Diagnosis]'])

In [None]:
print(labels['Experimental Condition [Diagnosis]'][0])

In [None]:
type(labels['Experimental Condition [Diagnosis]'][0])

In [None]:
# confirm 'no info' cells have been encoded as 'nan'... check one entry
print(labels['Characteristics [Disease Subtype]'][463])

## Prepare Train Images and Train Labels

### Explore and Understand

In [None]:
# Google Drive / Colab
filepathlist_train = glob.glob('/content/gdrive/MyDrive/Colab Notebooks/Clinical Heart Failure using H&E Images /data/images/training/fold_1/*.png')

In [None]:
# confirm you have got the total number of desired items in the list
len(filepathlist_train)

In [None]:
# check what an element in the filelist contain.
# it has both directory information and the filename, we need to extract filename 
# the filename can then be used to check for the label info in the labels dataframe
filepathlist_train[0]

### Extract Filename and Label Info

All the images are of type '*.png'. We will read filepath for all "filenames with extension as 'png'" into a list. Here, filepath means 'relative directory + filename'. We will extract filename of the image from the file path. This filename can then be used to get the label information from the annotation file.

#### Local Drive / Jupyter 

In [None]:
# this scenario has '\\' between the directory and filename
# split the string
# directory, filename = filepathlist_train[0].split('\\')
# gives the directory info
# directory
# gives the filname we need
# filename

#### Google Drive / Colab

##### Explore and Understand

In [None]:
len(filepathlist_train[0])

In [None]:
filepathlist_train[0][-19:]

In [None]:
filepathlist_train[0][len(filepathlist_train[0]) - 1]

In [None]:
filepathlist_train[0][-1]

In [None]:
# POC
idx = -1
while (filepathlist_train[0][idx] != '/'):
  idx = idx - 1 
# index currently points to '/' location, we need to start reading from next location to get file name
print(idx)
filename = filepathlist_train[0][idx + 1:]
print(filename)

In [None]:
# POC
index_filepathlist = 0
for filepath in filepathlist_train:
    #print(index_filepathlist, filepath)
    index_filepathlist += 1

In [None]:
# read a file using the list containing the file path
img = cv2.imread(filepathlist_train[0])

##### Define and Call Function to prepare images and labels

In [None]:
# function to prepare images & labels for the model
def prepare_images_labels (path_to_img_files, labels_dataframe):
  filepathlist = glob.glob(path_to_img_files)
  # define the empty list that need to populated with info
  images = []
  labels = []
  index_filepathlist = 0
  # iterate for all items in the file path list
  for filepath in filepathlist:
      # prepare image list
      img = cv2.imread(filepath)
      images.append(img)
      # prepare labels list
      # extract filename from the file path
      # Local Drive / Jupyter
      # directory, filename = filepath.split('\\')
      # Google Drive / Colab
      index_character = -1
      while (filepathlist[index_filepathlist][index_character] != '/'):
        index_character = index_character - 1 
      # character index currently points to '/' location, we need to start reading from next location to get file name
      filename = filepathlist[index_filepathlist][index_character + 1:]
      index_filepathlist += 1
      # iterate for all items in our labels dataframe to search for the label
      for index in range(len(labels_dataframe)):
          # we will compare the filename with all the filenames in the 'Image Name' column of the labels dataframe
          # when there is a match, we will copy the label from the 'Experimental Condition [Diagnosis]' column
          if (filename == labels_dataframe['Image Name'][index]):
              label = labels_dataframe['Experimental Condition [Diagnosis]'][index]
              # encode Class1 and Class0 as applicable
              if (label == 'chronic heart failure'):
                  label = 1
              elif (label == 'not chronic heart failure'):
                  label = 0
              elif (label == 'heart tissue pathology'):
                  label = 0
              # append the label to the list
              labels.append(label)
  return images, labels

In [None]:
# read filepath for all "filenames with extension as 'png'" into a list
# here filepath means 'relative directory + filename'

# Google Drive / Colab
path_to_img_files = '/content/gdrive/MyDrive/Colab Notebooks/Clinical Heart Failure using H&E Images /data/images/training/fold_1/*.png'
train_images, train_labels = prepare_images_labels (path_to_img_files, labels)

# Local Drive / Jupyter
# filepathlist_train = glob.glob('data/images/training/fold_1/*.png')

In [None]:
# train_labels

Convert images to numpy arrays and confirm shape is as required for CNN. 

In [None]:
# confirm you have got the total number desired images in the list
len(train_images)

In [None]:
# train is a list
type(train_images)

In [None]:
# convert list to a numpy array and the values to float
train_images = np.array(train_images, dtype = 'float32')

In [None]:
# check the shape to confirm it is ready for CNN
# number of instances, width, height, number of channels
# number of instances = number of image
# number of channels = 3 ... as these are color images
train_images.shape

Convert labels to numpy arrays and confirm shape is as required for CNN. 

In [None]:
len(train_labels)

In [None]:
type(train_labels)

In [None]:
train_labels[0]

In [None]:
train_labels[432]

In [None]:
# convert list to a numpy array and the values to int64
train_labels = np.array(train_labels, dtype = 'int64')

In [None]:
# check the shape to confirm it is ready for CNN
train_labels.shape

In [None]:
len(train_labels)

Let us check on the number of Class 0 and Class 1s that we have. 

In [None]:
# function to count ones and zeroes in the label array
def print_ones_zeroes (labels):
  count_ones = 0
  count_zeroes = 0
  for i in range(len(labels)):
    if (labels[i] == 1):
      count_ones += 1
    elif (labels[i] == 0):
      count_zeroes += 1
  print('Total Labels:',(count_ones + count_zeroes))
  print('# of Class 1:',count_ones)   
  print('# of Class 0:',count_zeroes)

In [None]:
print_ones_zeroes(train_labels)

We will convert the labels to 2bit values: 01 and 10 to correspond to the 2 classes. This is required to match to the model's output layer expectation so that we can effectively train and test. 

In [None]:
from keras.utils import to_categorical

In [None]:
# convert labels to categorical
train_labels = to_categorical(train_labels)

### Prepare Validation Images and Validation Labels

In [None]:
# read filepath for all "filenames with extension as 'png'" into a list
# here filepath means 'relative directory + filename'

# Google Drive / Colab
path_to_img_files = '/content/gdrive/MyDrive/Colab Notebooks/Clinical Heart Failure using H&E Images /data/images/training/test_fold_1/*.png'
validation_images, validation_labels = prepare_images_labels (path_to_img_files, labels)

# Local Drive / Jupyter
# filepathlist_validation = glob.glob('data/images/training/test_fold_1/*.png')

In [None]:
# convert list to a numpy array and the values to float
validation_images = np.array(validation_images, dtype = 'float32')

In [None]:
# check the shape to confirm it is ready for CNN
validation_images.shape

In [None]:
# convert list to a numpy array and the values to int64
validation_labels = np.array(validation_labels, dtype = 'int64')

In [None]:
# check the shape to confirm it is ready for CNN
validation_labels.shape

Let us check on the number of Class 0 and Class 1s that we have. 

In [None]:
print_ones_zeroes(validation_labels)

In [None]:
# convert labels to categorical
validation_labels = to_categorical(validation_labels)

### Prepare Test Images and Test Labels

In [None]:
# read filepath for all "filenames with extension as 'png'" into a list
# here filepath means 'relative directory + filename'

# Google Drive / Colab
path_to_img_files = '/content/gdrive/MyDrive/Colab Notebooks/Clinical Heart Failure using H&E Images /data/images/held-out_validation/*.png'
test_images, test_labels = prepare_images_labels (path_to_img_files, labels)

# Local Drive / Jupyter
# filepathlist_test = glob.glob('data/images/held-out_validation/*.png')

In [None]:
# convert list to a numpy array and the values to float
test_images = np.array(test_images, dtype = 'float32')

In [None]:
# check the shape to confirm it is ready for CNN
test_images.shape

In [None]:
# convert list to a numpy array and the values to int64
test_labels = np.array(test_labels, dtype = 'int64')

In [None]:
# check the shape to confirm it is ready for CNN
test_labels.shape

Let us check on the number of Class 0 and Class 1s that we have. 

In [None]:
print_ones_zeroes(test_labels)

In [None]:
# convert labels to categorical
test_labels = to_categorical(test_labels)

## Define Model

### MNIST CNN modified for HE

In [None]:
# import libraries (general)
# from keras import models
# from keras import layers
# model_cnn = models.Sequential()

# Layer Details:
# - 2 dimensional Convolution Layer
# - Number of filters/kernels = 32
# - Filter/Kernel Size = 3x3
# - Activation Function = relu (for non-linearity detection)
# - Input Shape = 250x250 matrix with 3 channel (as we have a color image)
# model_cnn.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(250,250,3)))

# Layer Details:
# - Downsample the output from previous layer
# - We will take the max value for a every 2x2 window ... moved over the input
# model_cnn.add(layers.MaxPooling2D(2,2))

# Layer Details:
# - 2 dimensional Convolution Layer
# - Number of filters/kernels = 64
# - Filter/Kernel Size = 3x3
# - Activation Function = relu (for non-linearity detection)
# model_cnn.add(layers.Conv2D(64, (3,3), activation = 'relu'))

# Layer Details:
# - Downsample the output from previous layer
# - We will take the max value for a every 2x2 window ... moved over the input
# model_cnn.add(layers.MaxPooling2D(2,2))

# Layer Details:
# - 2 dimensional Convolution Layer
# - Number of filters/kernels = 64
# - Filter/Kernel Size = 3x3
# - Activation Function = relu (for non-linearity detection)
# model_cnn.add(layers.Conv2D(64, (3,3), activation='relu'))

# Data at this stage is in matrix form. We will convert it to vector form to feed to a fully connected network (FCN).
# model_cnn.add(layers.Flatten())

# We will design for 64 outputs with activation function as relu (to learn non-linearity).
# model_cnn.add(layers.Dense(64, activation = 'relu'))

# This is the final layer. Hence, the outputs will be 2 corresponding to the 2 classes:
# - clinical heart failure = yes: 1
# - clinical heart failure = no: 0
# Activation Function chosen here is softmax to have a probabilistic output. 
# model_cnn.add(layers.Dense(2, activation = 'softmax'))

# model_cnn.summary()

# Define Optimizer, Loss Function and Metrics to be used for the Model
# - Going ahead with the well known functions at this point in time
# - Selected accuracy as the metrics to understand validation / test accuracy of the model
# model_cnn.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

# Train and Validate the Model
# We will now train the model using train images and train labels. 
# - We will use a batch size = 10.
# - 1 epoch = 770 / 10 = 77 batches
# - 1 epoch = 1 complete run of all train samples for training the model
# - We will go for a total of 5 epochs = 5 complete run of the all train samples
# We will validate the model using validation images and validation labels.
# model_cnn.fit(train_images, train_labels, epochs = 1, batch_size = 10, validation_data = (validation_images, validation_labels))

# Test the Model
# We will now test model's performance with the test data.
# - We predict the class for each of the 1155 test using the model.
# - We will check the test accuracy.
# test_loss, test_acc = model_cnn.evaluate(test_images, test_labels)
# print('test accuracy:', (test_acc*100))


### ResNET50

In [None]:
# import libraries (resnet50)
from keras.applications.resnet50 import ResNet50
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import Adam
# from keras.layers import Dropout
from keras.layers.normalization import BatchNormalization
from keras.layers.core import Dense, Dropout, Activation

In [None]:
# load ResNet50 with pre-trained parameters for 'imagenet' challenge
# disable the last layer ... so that we can have our own FCN layer for our desired classes
resnet50 = ResNet50(include_top=False, weights='imagenet')

# define the last layers
# get the output of the last layer of the ResNet50
prediction = resnet50.output
# add a Global Average Pooling layer (GAP) - this helps reduce number of parameters as compared to a flatten/dense layer
prediction = GlobalAveragePooling2D()(prediction)
# add a FCN with 1024 output neurons
#prediction = Dense(1024, activation = 'relu')(prediction)
# we will split the actions to have the Batch Normalization done befor relu
prediction = Dense(1024)(prediction)
#prediction = BatchNormalization()(prediction)
prediction = Activation('relu')(prediction)
# add a dropout of 40% to avoid overfitting
prediction = Dropout(0.4)(prediction)
# add a FCN with 2 output neurons corresponding to the 2 classes we want to predict
prediction = Dense(2, activation = 'softmax')(prediction)

# connect the last layer with ResNet50 layer to define the model
model = Model(inputs = resnet50.input, outputs = prediction)

# we wish use the pretrained resnet50 model as is i.e. do not want it's parameters to get updated during training
for layer in resnet50.layers:
  layer.trainable = False

# define the optimizer, loss function, metrics we will use
#model.compile(optimizer = SGD(lr=0.0001, momentum = 0.9), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.compile(optimizer = Adam(lr=0.01), loss = 'categorical_crossentropy', metrics = ['accuracy'])

# model.summary

In [None]:
# train and validate model
model.fit(train_images, train_labels, epochs = 20, batch_size = 10, validation_data = (validation_images, validation_labels))

In [None]:
# test model
test_loss, test_acc = model.evaluate(test_images, test_labels)