
# Introduction

#### - This notebook explores a novel convolutional network architechture as discussed in the following research paper to build a classification system for better assistance in diagonosing Acute Lymphoblastic Leukemia in blood cells.
**[Research Paper](http://www.ijcte.org/vol10/1198-H0012.pdf)**


#### - The dataset has been taken from : [Link](https://homes.di.unimi.it/scotti/all/)

* Here, ALL_IDB2 version of the dataset has been used

* This dataset is completely balanced with equal number of samples in both the classes.


#### - Data augmentation ensures that data is large enough and model extracts features efficiently without overfitting and therefore we have analysed two types of data augmentation techniques in this notebook
* A particular type of GAN called [SinGAN](https://arxiv.org/pdf/1905.01164.pdf) was used alongwith the following techniques mentioned in the research paper:

   1. Grayscaling of image
   2. Horizontal reflection
   3. Vertical reflection
   4. Gaussian Blurring
   5. Histogram Equalization
   6. Rotation
   7. Translation
   8. Shearing

* SinGAN without the above techniques

**The dataset was split into 80% and 20% for training and testing respectively.**

#### - The details of methodologies and results of our present analysis is present [here](https://docs.google.com/document/d/11XXjFRofXlyNGcE_plRDMO4xjxELFnkVqGtBEGomL2k/edit?usp=sharing)
#### - It is also worth noting the biases present in our methodology, ethical concerns and qualititaive interpretation of our results, mentioned in the doc



#### Below is the detailed code implementation.



### Loading requires packages




In [None]:
!pip install keras_metrics

In [None]:
from pathlib import Path
from google.colab import drive
import glob
import random
import cv2
from numpy.random import seed
from tensorflow import set_random_seed
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from scipy import ndimage
from skimage import exposure
import skimage
from skimage import io
from skimage import transform as tm
import seaborn as sns
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow.keras
from keras.utils import np_utils
import keras_metrics
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,precision_score,recall_score
from sklearn.metrics import roc_auc_score
%matplotlib inline

In [None]:
# for consistemt results across multiple executions
seed(3)
set_random_seed(3)

In [None]:
print(tensorflow.keras.__version__)
print(tf.__version__)

## **Mount your Google Drive**

 

##### Upload the **ALL-Keras-2019** directory from your cloned repo to the root of your Google Drive. Use the following commands and follow the provided steps to mount your Google Drive.


In [None]:
root_dir = "/content/drive/My Drive/ALL-Keras-2019/"
drive.mount('/content/gdrive',force_remount=True)

#### **You will notice the data folder in the Model directory, Model/Data, inside you have Train and Test.**

#### **You can place all the images inside the *Train* folder. We will split them into training and test set below**




In [None]:
data_dir = 'Model/Data/Train'
dataset = Path(root_dir + data_dir)
images= dataset.glob("*.tif")
data = []

for img in images:
  name, ext = os.path.splitext(os.path.basename(img))
  if name[-1]=='1':
    data.append((img,1))
  elif name[-1]=='0':
    data.append((img,0))
    
data_frame = pd.DataFrame(data,columns=['image','label'],index = None)
data_frame = data_frame.sample(frac=1.).reset_index(drop=True)
data_frame.head()

In [None]:
#  Splitting training and test data; we will not be augmenting test data
orig_train = pd.DataFrame()
test = pd.DataFrame()

orig_train = data_frame[:130]
test_data = data_frame[130:]

##**Augmentation Techniques**

**Note: Test data should never be augmented and so we will only augment training set**


###**1. Using [SinGAN](https://arxiv.org/pdf/1905.01164.pdf)**

In [None]:
# Clone the repository
!git clone https://github.com/tamarott/SinGAN.git

In [None]:
# SinGAN works on a single image. We will choose 13 original images from the 
# training set randomly and generate 50 artificial images from each. 

random_img = []
for i in range(13):
  n = random.randint(0,129)
  random_img.append(n)
print(random_img)

aug_list = []

def augment(ind):
  %cd /content/SinGAN
  !python main_train.py --input_name train['image'][ind]
  !python random_samples.py --input_name train['image'][ind] --mode random_samples --gen_start_scale 9
  %cd /content/SinGAN/Output/RandomSamples/train['image'][ind]/gen_start_scale=9
  for i in range(50):
    aug_list.append((str(i)+'.png',train['label'][ind]))

# This will start generating 50 images for each randomly chosen training image. 
for i in range(13):
  augment[random_img[i]]

aug_train = pd.DataFrame(aug_list,columns=['image','label'],index = None)
aug_train = aug_df.sample(frac=1.).reset_index(drop=True)

# We will now combine original and augmented images into a single dataframe
train_data = pd.concat([orig_train,aug_train]) 

### **2. Augmentation as presented in the [paper](http://www.ijcte.org/vol10/1198-H0012.pdf)**

### 8 augmentation techniques have been used here
1. Grayscaling of image
2. Horizontal reflection 
3. Vertical reflection
4. Gaussian Blurring 
5. Histogram Equalization
6. Rotation
7. Translation
8. Shearing

In [None]:
# histogram equalization function
def hist(img):
  img_to_yuv = cv2.cvtColor(img,cv2.COLOR_BGR2YUV)
  img_to_yuv[:,:,0] = cv2.equalizeHist(img_to_yuv[:,:,0])
  hist_equalization_result = cv2.cvtColor(img_to_yuv, cv2.COLOR_YUV2BGR)
  return hist_equalization_result

In [None]:
# function to perform rotation on an image
def rotation(img):
  rows,cols = img.shape[0],img.shape[1]
  randDeg = random.randint(-180, 180)
  matrix = cv2.getRotationMatrix2D((cols/2, rows/2), randDeg, 0.70)
  rotated = cv2.warpAffine(img, matrix, (rows, cols), borderMode=cv2.BORDER_CONSTANT,
                                     borderValue=(144, 159, 162))
  return rotated     

In [None]:
# function to perform shearing of an image
def shear(img):
  # Create Afine transform
  afine_tf = tm.AffineTransform(shear=0.5)
  # Apply transform to image data
  modified = tm.warp(img, inverse_map=afine_tf)
  return modified

In [None]:
def aug_method(dataframe,dim,aug=True):
  if aug:
    n = len(dataframe)
    data = np.zeros((n*6,dim,dim,3),dtype = np.float32)
    labels = np.zeros((n*6,2),dtype = np.float32)
    count = 0

    for j in range(0,n):
      img_name = dataframe.iloc[j]['image']
      label = dataframe.iloc[j]['label']
      encoded_label = np_utils.to_categorical(label, num_classes=2)
      img = cv2.imread(str(img_name))
      img = cv2.resize(img, (dim,dim))

      if img.shape[2]==1:
        img = np.dstack([img, img, img])
      orig_img = img.astype(np.float32)/255.
      data[count] = orig_img
      labels[count] = encoded_label
      # Cases where we also use SinGAN as a data augmentation technique, only 5 out of the above 8 have been used.      
      aug_img1 = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
      aug_img2 = cv2.flip(img, 0) 
      #aug_img3 = cv2.flip(img,1)
      #aug_img4 = ndimage.gaussian_filter(img, sigma= 5.11)
      aug_img5 = hist(img)
      aug_img6 = rotation(img)
      aug_img7 = cv2.warpAffine(img, np.float32([[1, 0, 84], [0, 1, 56]]), (img.shape[0], img.shape[1]),
                                  borderMode=cv2.BORDER_CONSTANT, borderValue=(144, 159, 162))
      #aug_img8 = shear(img)
      aug_img1 = np.dstack([aug_img1, aug_img1, aug_img1])

      aug_img1 = aug_img1.astype(np.float32)/255.                 
      aug_img2 = aug_img2.astype(np.float32)/255.
      #aug_img3 = aug_img3.astype(np.float32)/255. 
      #aug_img4 = aug_img4.astype(np.float32)/255.
      aug_img5 = aug_img5.astype(np.float32)/255.
      aug_img6 = aug_img6.astype(np.float32)/255.
      aug_img7 = aug_img7.astype(np.float32)/255.
      #aug_img8 = aug_img8.astype(np.float32)/255.

      data[count+1] = aug_img1
      labels[count+1] = encoded_label
      data[count+2] = aug_img2
      labels[count+2] = encoded_label
      data[count+3] = aug_img5
      labels[count+3] = encoded_label
      data[count+4] = aug_img6
      labels[count+4] = encoded_label
      data[count+5] = aug_img7
      labels[count+5] = encoded_label
      #data[count+6] = aug_img5
      #labels[count+6] = encoded_label
      #data[count+7] = aug_img5
      #labels[count+7] = encoded_label
      #data[count+8] = aug_img5
      #labels[count+8] = encoded_label
      count +=6      
  else:
    n = len(dataframe) 
    data = np.zeros((n,dim,dim,3),dtype = np.float32)
    labels = np.zeros((n,2),dtype = np.float32) 
    count = 0
    for j in range(0,n):   
      img_name = dataframe.iloc[j]['image']
      label = dataframe.iloc[j]['label']      
      encoded_label = np_utils.to_categorical(label, num_classes=2)            
      img = cv2.imread(str(img_name))
      img = cv2.resize(img, (dim,dim))      
      if img.shape[2]==1:    
        img = np.dstack([img, img, img])                                    
      orig_img = img.astype(np.float32)/255.                       
      data[count] = orig_img
      labels[count] = encoded_label    
      count +=1                      
  return data,labels                  

In [None]:
def aug_mode(mode):
  if mode=='both':
    X_train,y_train = aug_method(train_data,dim=100,aug=True)
    X_test,y_test = aug_method(test_data,dim=100,aug=False)
  elif mode=='SinGAN':
    X_train,y_train = aug_method(train_data,dim=100,aug=False)
    X_test,y_test = aug_method(test_data,dim=100,aug=False)

  X_train = np.asarray(X_train)
  y_train = np.asarray(y_train)
  X_test = np.asarray(X_test)
  y_test = np.asarray(y_test)
  print('Shape of training data:',X_train.shape)
  print('Shape of test data:',X_test.shape)

# augmentation with both SinGAN and other techniques
aug_mode('both')

# augmentation only with SinGAN
aug_mode('SinGAN')



###**The following model was used in the paper**
Additionaly three dropout layers with different dropout rates have been used to reduce overfitting

In [None]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(16,(5,5),padding='valid',input_shape = X_train.shape[1:]))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2,2),strides=2,padding = 'valid'))
model.add(tf.keras.layers.Dropout(0.4))

model.add(tf.keras.layers.Conv2D(32,(5,5),padding='valid'))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2,2),strides=2,padding = 'valid'))
model.add(tf.keras.layers.Dropout(0.6))

model.add(tf.keras.layers.Conv2D(64,(5,5),padding='valid'))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.8))

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(2,activation = 'softmax'))

In [None]:
model.summary()

In [None]:
# Model visualization
from keras.utils.vis_utils import plot_model
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

#### **Case 1: Using both data augmentation methods**



In [None]:
batch_size = 100
epochs = 50
optimizer = tf.keras.optimizers.RMSprop(learning_rate = 0.0001, decay = 1e-6)
model.compile(loss = 'binary_crossentropy',optimizer = optimizer, metrics = ['accuracy',keras_metrics.precision(), keras_metrics.recall()])

In [None]:
history = model.fit(X_train,y_train,steps_per_epoch = int(len(X_train)/batch_size),epochs=epochs)
history

In [None]:
score = model.evaluate(X_test,y_test,verbose=0)
score

In [None]:
# Accuracy and loss curves
acc = history.history['acc']
loss = history.history['loss']

plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
plt.plot(acc, label='Training Accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.ylim([min(plt.ylim()),1])
plt.title('Training Accuracy')

plt.subplot(2, 1, 2)
plt.plot(loss, label='Training Loss')
plt.legend(loc='upper right')
plt.ylabel('Cross Entropy')
plt.ylim([0,max(plt.ylim())])
plt.title('Training Loss')
plt.show()

#### **Case 2: Using only SinGAN**

In [None]:
batch_size = 32
epochs = 50
optimizer = tf.keras.optimizers.rmsprop(lr = 0.0001, decay = 1e-6)
model.compile(loss = 'binary_crossentropy',optimizer = optimizer, metrics = ['accuracy',keras_metrics.precision(), keras_metrics.recall()])

In [None]:
history = model.fit(X_train,y_train,steps_per_epoch = int(len(X_train)/batch_size),epochs=epochs)
history

In [None]:
score = model.evaluate(X_test,y_test,verbose=0)
score

In [None]:
# Accuracy and loss plots
acc = history.history['acc']
loss = history.history['loss']

plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
plt.plot(acc, label='Training Accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.ylim([min(plt.ylim()),1])
plt.title('Training Accuracy')

plt.subplot(2, 1, 2)
plt.plot(loss, label='Training Loss')
plt.legend(loc='upper right')
plt.ylabel('Cross Entropy')
plt.ylim([0,max(plt.ylim())])
plt.title('Training Loss')
plt.show()