<a href="https://colab.research.google.com/github/adiojha629/TEWH_Malaria_Adi_Files/blob/master/Dropout%2C_Dense%2C_and_Activation_Experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Model Evaluation Framework
Here we setup a framework for running model experiments.

## Import Raw Dataset
On the right-hand side of this page, you will see a folders symbol. If you click on it and then click on "refresh", you should be able to see the files on this Google Colab document. 


Here we import the NIH malaria dataset from the website, with the terminal command ```!wget```. Then we extract the images into our temporary file directory called ```/content```. Specifically, we load these images in a folder called ```/content/cell_images```. It is not too important to understand this code chunk, since it will remain virtually unchanged for the entire project.  

In [2]:
# Import relevant packages for image file importation 
import numpy as np
import os
from shutil import copyfile
from zipfile import ZipFile

# Here we download the NIH dataset as a zip file
!wget -nc ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/Malaria/cell_images.zip

# Extract images if not already extracted
ROOT_DIR = os.path.join("/", "content")
if not os.path.isdir("cell_images"):
    print("Extracting images...")
    with ZipFile(os.path.join(ROOT_DIR, "cell_images.zip"), "r") as zipObj:
        zipObj.extractall()
    print("Done!")

--2020-08-04 22:04:34--  ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/Malaria/cell_images.zip
           => ‘cell_images.zip’
Resolving lhcftp.nlm.nih.gov (lhcftp.nlm.nih.gov)... 130.14.55.35, 2607:f220:41e:7055::35
Connecting to lhcftp.nlm.nih.gov (lhcftp.nlm.nih.gov)|130.14.55.35|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Open-Access-Datasets/Malaria ... done.
==> SIZE cell_images.zip ... 353452851
==> PASV ... done.    ==> RETR cell_images.zip ... done.
Length: 353452851 (337M) (unauthoritative)


2020-08-04 22:09:03 (1.26 MB/s) - ‘cell_images.zip’ saved [353452851]

Extracting images...
Done!


## Load and Resize Images into NumPy Arrays
Here we create some a new folder called ```RescaledSet``` to store our resized images of our parasitized and uninfected cells. 


We rescale all of the images to 128x128 pixels with RGB channels. 


The parasitized images will be stored in the NumPy array called ```Parasitized```, while the uninfected images will be stored in the NumPy array called ```Uninfected```.

For testing purposes, we only rescale 500 images for each class.
When the program is run on the Google Cloud, then we will rescale all the images. 

In [3]:
# Create new folders to save rescaled images
# Create new directory called "test_folder"
if not os.path.isdir("RescaledSet"):
  os.mkdir("/content/RescaledSet")
if not os.path.isdir("RescaledSet/Parasitized"):
  os.mkdir("/content/RescaledSet/Parasitized")
if not os.path.isdir("RescaledSet/Uninfected"):
  os.mkdir("/content/RescaledSet/Uninfected")

# Generate list of parasitized file names
import os
ParasitizedFiles = os.listdir("/content/cell_images/Parasitized/")
UninfectedFiles = os.listdir("/content/cell_images/Uninfected/")

import numpy as np
import cv2
from PIL import Image

# Short-version for testing
Parasitized = np.empty([500,128,128,3])
Uninfected = np.empty([500,128,128,3])

for i in range(0,500):
  # Import image as np.array
  TempImage = cv2.imread('/content/cell_images/Parasitized/'+ParasitizedFiles[i])
  # Resize image to 128x128 pixels
  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
  # Save image in folder 
  Parasitized[i,:,:,:] = ResizedImage

for i in range(0,500):
  # Import image as np.array
  TempImage = cv2.imread('/content/cell_images/Uninfected/'+UninfectedFiles[i])
  # Resize image to 128x128 pixels
  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
  # Save image in folder 
  Uninfected[i,:,:,:] = ResizedImage

"""The code below will be run on Google Cloud. The code rescales all the images"""
#Parasitized = np.empty([13779,128,128,3])
#Uninfected = np.empty([13779,128,128,3])

# May have to remove some thumbs.db files (i = 10394 for parasitized and 1 = 10392 for uninfected)
#for i in range(0,10394):
#  # Import image as np.array
#  TempImage = cv2.imread('/content/cell_images/Parasitized/'+ParasitizedFiles[i])
#  # Resize image to 128x128 pixels
#  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
#  # Save image in folder 
#  Parasitized[i,:,:,:] = ResizedImage

#for i in range(10395,13780):
#  # Import image as np.array
#  TempImage = cv2.imread('/content/cell_images/Parasitized/'+ParasitizedFiles[i])
#  # Resize image to 128x128 pixels
#  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
#  # Save image in folder 
#  Parasitized[i-1,:,:,:] = ResizedImage

#for i in range(0,10392):
#  # Import image as np.array
#  TempImage = cv2.imread('/content/cell_images/Uninfected/'+UninfectedFiles[i])
#  # Resize image to 128x128 pixels
#  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
#  # Save image in folder 
#  Uninfected[i,:,:,:] = ResizedImage

#for i in range(10393,13780):
#  # Import image as np.array
#  TempImage = cv2.imread('/content/cell_images/Uninfected/'+UninfectedFiles[i])
#  # Resize image to 128x128 pixels
#  ResizedImage = cv2.resize(TempImage, dsize=(128,128))
#  # Save image in folder 
#  Uninfected[i-1,:,:,:] = ResizedImage

'The code below will be run on Google Cloud. The code rescales all the images'

## Split Data into Five Groups for Cross-Validation
We will be using ```k=5``` cross-validation groups.


In [4]:
# Number of cross-validation groups 
k = 5

# Generate dataset labels
ParasitizedLabels = np.repeat([[0,1]], 500, axis=0)
UninfectedLabels = np.repeat([[1,0]], 500, axis=0)
Labels = np.concatenate((ParasitizedLabels,UninfectedLabels), axis=0)

# Generate image dataset
Dataset = np.concatenate((Parasitized, Uninfected), axis=0)

# Generate cross-validation groups
CVIndices = np.random.permutation(Dataset.shape[0]) #CVIndices gets random, non-repeating numbers from 0 to the n, where n is the number of images in our dataset
#Index-n holds the indices of the images in the Dataset that go in to the n-th crossvalidation group
Index1, Index2, Index3, Index4, Index5 = CVIndices[:200], CVIndices[200:400], CVIndices[400:600], CVIndices[600:800], CVIndices[800:]
#Using the Index variables above we get the images and labels for each cross validation group.
Images1, Images2, Images3, Images4, Images5 = Dataset[Index1,:], Dataset[Index2,:], Dataset[Index3,:], Dataset[Index4,:], Dataset[Index5,:]
Labels1, Labels2, Labels3, Labels4, Labels5 = Labels[Index1,:], Labels[Index2,:], Labels[Index3,:], Labels[Index4,:], Labels[Index5,:]

## Train model with different hyperparameters and Cross-Validation Groups


In [5]:
# Libraries needed
import sys
if 'tensorflow' in sys.modules == False:
  %tensorflow_version 2.x
  import tensorflow as tf
import keras
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D, BatchNormalization
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping

import warnings
warnings.filterwarnings('ignore')

List_Dropout_Rates = [0.25,0.50,0.75] #all the drop out rates used
List_Dense = [128,256,512,1024] #number of neurons in Dense layer
List_Activation = ["relu","tanh"] #the different activation functions used

results = []# This list will contain the Loss, Accuracy, Validation Loss etc. for each dropout rate, size of dense layers and activation function

for rate in List_Dropout_Rates:
  for dense_num in List_Dense:
    for acti in List_Activation:
      for i in range(2):#only run for 2 cross validation sets
        # Create the appropriate training and testing sets
        if i == 0:
          TrainImages = np.concatenate((Images1,Images2,Images3,Images4), axis=0)
          TrainLabels = np.concatenate((Labels1,Labels2,Labels3,Labels4), axis=0)
          TestImages = Images5
          TestLabels = Labels5
        elif i == 1:
          TrainImages = np.concatenate((Images1,Images2,Images3,Images5), axis=0)
          TrainLabels = np.concatenate((Labels1,Labels2,Labels3,Labels5), axis=0)
          TestImages = Images4
          TestLabels = Labels4
        elif i == 2:
          TrainImages = np.concatenate((Images1,Images2,Images4,Images5), axis=0)
          TrainLabels = np.concatenate((Labels1,Labels2,Labels4,Labels5), axis=0)
          TestImages = Images3
          TestLabels = Labels3
        elif i == 3:
          TrainImages = np.concatenate((Images1,Images3,Images4,Images5), axis=0)
          TrainLabels = np.concatenate((Labels1,Labels3,Labels4,Labels5), axis=0)
          TestImages = Images2
          TestLabels = Labels2
        else:
          TrainImages = np.concatenate((Images2,Images3,Images4,Images5), axis=0)
          TrainLabels = np.concatenate((Labels2,Labels3,Labels4,Labels5), axis=0)
          TestImages = Images1
          TestLabels = Labels1

        # Recompile model (need to do this everytime within the loop to reset model weights)
        #Dropout rate, dense layer size and activation function are used
        base_model = applications.VGG19(weights = "imagenet", include_top=False, input_shape = (128,128,3))
        x = base_model.output
        x = Flatten()(x)
        x = Dense(dense_num, activation=acti)(x)
        x = Dropout(rate)(x)
        x = Dense(dense_num, activation=acti)(x)
        x = Dropout(rate)(x)
        predictions = Dense(2, activation="softmax")(x) #Note that the predictions activation function is not varied across trials
        model = Model(inputs = base_model.input, outputs = predictions)
        adam = optimizers.Adam(lr=0.00001, beta_1=0.9, beta_2=0.999, amsgrad=False)
        model.compile(loss = "categorical_crossentropy", optimizer = adam, metrics=["accuracy"])

        # Train model and evaluate performance
        print('We are now training cross-validation set #',i+1)
        Results = model.fit(TrainImages, TrainLabels, epochs=3, batch_size=64, validation_data=(TestImages,TestLabels), validation_freq=1)

        # Display and store performance results
        print("The Rate is " + str(rate) +", the dense number is " + str(dense_num) + ", the activation "+ str(acti))
        Results.history['loss'] = [round(j, 4) for j in Results.history['loss']]
        Results.history['accuracy'] = [round(j, 4) for j in Results.history['accuracy']]
        Results.history['val_loss'] = [round(j, 4) for j in Results.history['val_loss']]
        Results.history['val_accuracy'] = [round(j, 4) for j in Results.history['val_accuracy']]
        print('Training Loss:',Results.history['loss'])
        print('Training Accuracy:',Results.history['accuracy'])
        print('Validation Loss:',Results.history['val_loss'])
        print('Validation Accuracy:',Results.history['val_accuracy'])
        print('')
        results.append([Results.history['loss'],Results.history['accuracy'],Results.history['val_loss'],Results.history['val_accuracy']])#Update the results list



Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg19/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
We are now training cross-validation set # 1
Epoch 1/3
Epoch 2/3
Epoch 3/3
The Rate is 0.25, the dense number is 128, the activation relu
Training Loss: [2.2755, 0.6727, 0.4163]
Training Accuracy: [0.6425, 0.7425, 0.8363]
Validation Loss: [0.4831, 0.5257, 0.3295]
Validation Accuracy: [0.805, 0.8, 0.85]

We are now training cross-validation set # 2
Epoch 1/3
Epoch 2/3
Epoch 3/3
The Rate is 0.25, the dense number is 128, the activation relu
Training Loss: [2.3454, 0.857, 0.496]
Training Accuracy: [0.6413, 0.7175, 0.7775]
Validation Loss: [0.9228, 0.5766, 0.4942]
Validation Accuracy: [0.695, 0.755, 0.785]

We are now training cross-validation set # 1
Epoch 1/3
Epoch 2/3
Epoch 3/3
The Rate is 0.25, the dense number is 128, the activation tanh
Training Loss: [0.7526, 0.486, 0.3749]
Training Accuracy: [0.6025, 0.7613, 0.8225]
Validation Loss: [0.5431, 0.4

In [2]:
#This code allows you to see the testing and validation accuarcy and losses for a given Dropout rate, Dense layer size and activation function

#Libraries needed
import matplotlib.pyplot as plt

List_Dropout_Rates = [0.25,0.50,0.75] #all the drop out rates used
List_Dense = [128,256,512,1024] #number of neurons in Dense layer
List_Activation = ["relu","tanh"] #the different activation functions used
results_dict = dict()
i = 0
for rate in List_Dropout_Rates:
  results_dict[rate] = dict()
  for dense_num in List_Dense:
    results_dict[rate][dense_num] = dict()
    for acti in List_Activation:
      results_dict[rate][dense_num][acti] = []
      for _ in range(2):
        results_dict[rate][dense_num][acti].append(results[i])
        i+=1



NameError: ignored

In [1]:
cont = True
while cont:
  #now ask the user:
  drop = float(input("What Dropout level would you like? Options are: "+ str(List_Dropout_Rates)))
  dense = int(input("What Dense size would you like? Options are: "+ str(List_Dense)))
  activ = input("What activation function would you like? Options are: "+ str(List_Activation))
  print("Fetching the data you requested")

  ##code for graphs
  fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2)
  fig.set_figheight(10)
  fig.set_figwidth(10)
  fig.subplots_adjust(hspace=.4, wspace=.4)
  for validation in range(2):
    print("Results from Validation Test #",validation+1)
    loss = results_dict[drop][dense][activ][validation][0]
    acc = results_dict[drop][dense][activ][validation][1]
    val_loss = results_dict[drop][dense][activ][validation][2]
    val_acc = results_dict[drop][dense][activ][validation][3]
    print("Test Loss was ",loss)
    print("Test Accuracy was ",acc)
    print("Validation Loss was ",val_loss)
    print("Validation Accuracy was ",val_acc)
    print()
    ax1.plot(loss,label = "Training Loss from Validation # " + str(validation+1))
    ax2.plot(acc,label = "Training Accuracy from Validation # " + str(validation+1))
    ax3.plot(val_loss,label = "Validation Loss from Validation # " + str(validation+1))
    ax4.plot(val_acc,label = "Validation Accuracy from Validation # " + str(validation+1))
  ax1.legend(loc="upper center")
  ax2.legend(loc="upper center")
  ax3.legend(loc="upper center")
  ax4.legend(loc="upper center")
  print("\n\n")
  user = input("Would you like to continue? (y/n)")
  cont = True if user.lower() == "y" else False

NameError: ignored

In [12]:
print(results_dict[.75][1024]['tanh'][1])

[[1.3331, 1.3487, 1.2702], [0.5013, 0.525, 0.5638], [0.6232, 0.5874, 0.5826], [0.665, 0.7, 0.655]]
