<h2 align = "center"><font color='black'> <b>Histopathological Image Classification using Convolutional Neural Network</b></font></h2> <h4 align = "right"><font color='purple'> DSML@2020</font></h4>


## <font color='blue'> <b>**Introduction** </b></font>

Cancer begins when healthy cells in the breast change and grow out of control, forming a mass or sheet of cells called a tumor. A tumor can be cancerous or benign. A cancerous tumor is malignant, meaning it can grow and spread to other parts of the body. A benign tumor means the tumor can grow but will not spread.<a href="https://www.cancer.net/cancer-types/breast-cancer/introduction#:~:text=Cancer%20begins%20when%20healthy%20cells,grow%20but%20will%20not%20spread."> (Source, [caner.net]) </a>



<table><tr>
<td> <img src="benign.png" alt="Benign Histopath" style="width: 150px;"/> </td>
<td> <img src="malignant.png" alt="malignant Histopath" style="width: 150px;"/> </td>
</tr><caption><b>Histopathological Image</b></caption>
</table>


## <font color='blue'>**The problem statement**</b></font>


Identifying the presence or absence of benign or malignant tumors from `640 x 470px` from a digital histopathology images. One key challenge is that there exist varaiblity in size and shape of the nuclei and traditonal ML algorithms needs greater fine tuning of hand-crafted features. It's known now, as the amount of data increases traditional ML algorithms fails to capture all the detailing that hinders from incing closer to human accuracy.

In this lesson you will be classifying histopathological images using this <a href="https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/">BrekHis</a> dataset.
So, as part of the task you need to process this data by analyzing the images in the two categories and then creating a model that classifies these images into `benign` and `malignant`. 

You'll follow these steps:

<ol>
    <font color='blue'>
  <li> Explore the Example Data of benign and malignant Histopathological images</li>
  <li>Build a model</li>
    <ol>
        <li> Approach 1: Shallow Network
        <li> Approach 2: Convolutional Neural Network</li>
     </ol>
  <li>Training the above two approaches
  <li>Evaluating the model accuracy</li>
        </font>
</ol>

The contents of the histopathological image dir is extracted to the base directory `/path_to_local-dir`, which creates `train`,`validation`, and `test` subdirectories for the training and validation datasets (see the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/validation/check-your-intuition) for a refresher on training, validation, and test sets), which in turn each contain `benign` and `malignant` subdirectories.

In short: The training set is the data that is used to tell the neural network model that 'this is what a benign images looks like', 'this is what a malignant image looks like' etc. The validation data set contains images of benign and malignant histopathological images that the neural network will not see as part of the training, so you can test how well or how badly it does in evaluating if an image contains a benign or a malignant image.

One thing to pay attention to in this demo code is that: We do not explicitly label the images as benign or malignant. The handwriting example (see the [Kaggle Data challenge on Digit Classification](https://www.kaggle.com/c/digit-recognizer/data)), the labels are given in `.csv` file that each digit: 'this is a 1', 'this is a 7' etc. But later in this demo excersie you'll see something called an `ImageGenerator` being used -- and this is coded to read images from subdirectories, and it automatically label them from the name of that subdirectory. So, for example, you will have a `'training'` directory containing a `'benign'` directory and a `'malignant'` one. `ImageGenerator` will label the images appropriately for you, reducing a coding step. 



**Let us fist import the necessary libararies that are needed for this experiment**

In [None]:
import os
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator

### <font color='blue'>**STEP 1: Explore the Example Data**</b></font>


In [None]:
import splitfolders  # or import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
input_folder = '/home/ulle/BreCan/data/CNN_400X/'
split_folder = '/home/ulle/BreCan/data/DataSplit/'
splitfolders.ratio(input_folder, output= split_folder, seed=1337, ratio=(.8, .2), group_prefix=None) # default values

**Try to walk the dir structure and get the images (Benign, malignant) from their respective subdirectories**

In [None]:
base_dir = '/home/ulle/BreCan/data/DataSplit/'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'val')

# Directory with our training cat pictures
train_benign_dir = os.path.join(train_dir, 'benign')

# Directory with our training dog pictures
train_malignant_dir = os.path.join(train_dir, 'malignant')

# Directory with our validation cat pictures
validation_benign_dir = os.path.join(validation_dir, 'benign')

# Directory with our validation dog pictures
validation_malignant_dir = os.path.join(validation_dir, 'malignant')

Now, let's see what the filenames look like in the `benign` and `malignant` `train` directories (file naming conventions are the same in the `validation` directory):

In [None]:
train_benign_fnames = os.listdir( train_benign_dir )
train_malignant_fnames = os.listdir( train_malignant_dir )

print('Training images from Benign folder:\n\n',train_benign_fnames[:10])
print('\nTraining images from Malignant folder:\n\n',train_malignant_fnames[:10])

Let's find out the total number of benign and malignant images in the `train` and `validation` directories:

In [None]:
print('total training benign images :', len(os.listdir(      train_benign_dir ) ))
print('total training malignant images :', len(os.listdir(      train_malignant_dir ) ))

print('total validation benign images :', len(os.listdir( validation_benign_dir ) ))
print('total validation malignant images :', len(os.listdir( validation_malignant_dir ) ))

Now let's take a look at a few images to get a better sense of what the bening and malignant datasets look like. First, configure the matplot parameters:

In [None]:
%matplotlib inline

import matplotlib.image as mpimg
import matplotlib.pyplot as plt

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

pic_index = 0 # Index for iterating over images

Now, display a batch of 8 benign and 8 malignant images. You can rerun the cell to see a fresh batch each time:

In [None]:
# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols*4, nrows*4)

pic_index+=8

next_benign_pix = [os.path.join(train_benign_dir, fname) 
                for fname in train_benign_fnames[ pic_index-8:pic_index] 
               ]

next_malignant_pix = [os.path.join(train_malignant_dir, fname) 
                for fname in train_malignant_fnames[ pic_index-8:pic_index]
               ]


for i, img_path in enumerate(next_benign_pix+next_malignant_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes tf.keras.layers.Conv2D(16, (3,3), activatio(or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()

### <font color='blue'>**STEP 2: Building the Model**</b></font>


In the previous section you saw that the images were in a variety of shapes and sizes and color. In order to train a neural network to handle them you'll need them to be in a uniform size. We've chosen 64 x 64 for this, and you'll see the code that preprocesses the images to that shape shortly. 

We start using two-step appraoch:
    
1) Using standard MLP - shallow network

2) Using CNN - CONV-POOL-CONV-POOL

## <font color='blue'>**Approach 1: shallow Network**</b></font>


In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(8,activation='relu',input_shape=(64,64,3)),
    tf.keras.layers.Flatten(), 
    tf.keras.layers.Dense(50, activation='relu'),     
    tf.keras.layers.Dense(1, activation='sigmoid')  
])

In [None]:
model.summary()

The "output shape" column shows how the size of your feature map evolves in each successive layer. 

Next, we'll configure the specifications for model training. We will train our model with the `binary_crossentropy` loss, because it's a binary classification problem and our final activation is a sigmoid. (For a refresher on loss metrics, see the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/descending-into-ml/video-lecture).) We will use the `rmsprop` optimizer with a learning rate of `0.001`. During training, we will want to monitor classification accuracy.

**NOTE**: In this case, using the [RMSprop optimization algorithm](https://wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) is preferable to [stochastic gradient descent](https://developers.google.com/machine-learning/glossary/#SGD) (SGD), because RMSprop automates learning-rate tuning for us. (Other optimizers, such as [Adam](https://wikipedia.org/wiki/Stochastic_gradient_descent#Adam) and [Adagrad](https://developers.google.com/machine-learning/glossary/#AdaGrad), also automatically adapt the learning rate during training, and would work equally well here.)

In [None]:
from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer=RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics = ['accuracy'])

**Data ImageGenerator**

Let's set up data generators that will read images in our source folders, convert them to `float32` tensors, and feed them (with their labels) to our network. We'll have one generator for the training images and one for the validation images. Our generators will yield batches of 20 images of size 64 x 64 and their labels (binary).

As you may already know, data that goes into neural networks should usually be normalized in some way to make it more amenable to processing by the network. (It is uncommon to feed raw pixels into a convnet/shallow DenseNets.) In our case, we will preprocess our images by normalizing the pixel values to be in the `[0, 1]` range (originally all values are in the `[0, 255]` range).

In Keras this can be done via the `keras.preprocessing.image.ImageDataGenerator` class using the `rescale` parameter. This `ImageDataGenerator` class allows you to instantiate generators of augmented image batches (and their labels) via `.flow(data, labels)` or `.flow_from_directory(directory)`. These generators can then be used with the Keras model methods that accept data generators as inputs: `fit`, `evaluate_generator`, and `predict_generator`.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255.
train_datagen = ImageDataGenerator( rescale = 1.0/255. )
test_datagen  = ImageDataGenerator( rescale = 1.0/255. )

# --------------------
# Flow training images in batches of 20 using train_datagen generator
# --------------------
train_generator = train_datagen.flow_from_directory(train_dir,
                                                    batch_size=20,
                                                    class_mode='binary',
                                                    target_size=(64, 64))     
# --------------------
# Flow validation images in batches of 20 using test_datagen generator
# --------------------
validation_generator =  test_datagen.flow_from_directory(validation_dir,
                                                         batch_size=20,
                                                         class_mode  = 'binary',
                                                         target_size = (64, 64))

### <font color='blue'>**STEP3: The training process**</b></font>


Let's train on all 940 images available, for 5 epochs, and validate on all 236 test images. (This may take a few minutes to run.)

Do note the values per epoch.

You'll see 4 values per epoch -- Loss, Accuracy, Validation Loss and Validation Accuracy. 

The Loss and Accuracy are a great indication of progress of training. It's making a guess as to the classification of the training data, and then measuring it against the known label, calculating the result. Accuracy is the portion of correct guesses. The Validation accuracy is the measurement with the data that has not been used in training. 


In [None]:
history = model.fit(train_generator,
                              validation_data=validation_generator,
                              steps_per_epoch=50,
                              epochs=5,
                              validation_steps=20,
                              verbose=1)

### <font color='blue'>**STEP4: Evaluating Accuracy and Loss for the Model**</b></font>


Let's plot the training/validation accuracy and loss as collected during training:

In [None]:
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r-', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot(epochs, loss, 'r-', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

## <font color='blue'>**Approach 2: The Convolutional Neural Network**</b></font>



### <font color='blue'>**STEP2: Building the Model**</b></font>


In [None]:
model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(64, 64, 3)),
        tf.keras.layers.MaxPooling2D(2,2),  
        tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(64, 64, 3)),
        tf.keras.layers.MaxPooling2D(2,2),        
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(100, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')                                 
    
])

In [None]:
model.summary()

The "output shape" column shows how the size of your feature map evolves in each successive layer. The convolution layers reduce the size of the feature maps by a bit due to padding, and each pooling layer halves the dimensions.

Next, we'll configure the specifications for model training. We will train our model with the `binary_crossentropy` loss, because it's a binary classification problem and our final activation is a sigmoid. (For a refresher on loss metrics, see the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/descending-into-ml/video-lecture).) We will use the `rmsprop` optimizer with a learning rate of `0.001`. During training, we will want to monitor classification accuracy.

**NOTE**: In this case, using the [RMSprop optimization algorithm](https://wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) is preferable to [stochastic gradient descent](https://developers.google.com/machine-learning/glossary/#SGD) (SGD), because RMSprop automates learning-rate tuning for us. (Other optimizers, such as [Adam](https://wikipedia.org/wiki/Stochastic_gradient_descent#Adam) and [Adagrad](https://developers.google.com/machine-learning/glossary/#AdaGrad), also automatically adapt the learning rate during training, and would work equally well here.)

In [None]:
from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer=RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics = ['accuracy'])

In [None]:
history = model.fit(train_generator,
                              validation_data=validation_generator,
                              steps_per_epoch=100,
                              epochs=5,
                              validation_steps=50,
                              verbose=1)

### <font color='blue'>**STEP4: Evaluating Accuracy and Loss for the Model**</b></font>


Let's plot the training/validation accuracy and loss as collected during training:

In [None]:
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r-', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot(epochs, loss, 'r-', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

**Congratulations!!!** for coming this far. Now you have a clear understanding of the classifying medical images using `Convolutional Neural Networks`