<a href="https://colab.research.google.com/github/aiexplorations/deeplearning/blob/master/Geological_Image_Similarity_ImageGenerator_and_other_tweaks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Geologically Similar Images with Neural Networks

This notebook explores how the "Geological similarity" dataset can be used for a multi-class classification problem. We pull, prepare and build a model to solve the 6-way classification task to different between different kinds of minerals.

Specifically, the objective of this notebook is to demonstrate the Image Generator functionality within `tensorflow.keras`, and the benefit of this when dealing with large, clearly labelled data present as image files on disk. We also demonstrate other elements of tweaking deep neural networks, specifically the addition of convolutional and pooling layers.

## Data

Some notes about the data:
1. The images here are 28x28, colour RGB images.
2. There are six classes of images: andesite, gneiss, marble, quartzite  rhyolite, and schist

## Experiments

Some notes about the model and the experiments:
1. Tried building a simple CNN with a single `Conv2D` layer, which didn't perform as well.
2. Subsequent iterations increased the number of `Conv2D` layers, married to MaxPooling2D layers
3. The number of epochs required to train simpler models to reach ~99% accuracy was high, of the order of 100.
4. When more complex models were used, these issues were resolved, with 99% accuracy being reached in 60-odd epochs.


## Techniques Demonstrated

1. Image Generator - a built-in method within Keras that accelerates the process of creating labelled data by just pointing to a file system folder with the label names being folder names. A very handy tool.
2. Early stopping using accuracy callbacks, which enables easier retraining of the model.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

We pull in the geological similarity data in the below cell, and in subsequent cells, store the data in a local folder in the environment/container.

In [0]:
dataset_url = "http://aws-proserve-data-science.s3.amazonaws.com/geological_similarity.zip"

In [3]:
!wget --no-check-certificate \
    http://aws-proserve-data-science.s3.amazonaws.com/geological_similarity.zip \
    -O /tmp/geological_similarity.zip

--2020-06-02 11:46:47--  http://aws-proserve-data-science.s3.amazonaws.com/geological_similarity.zip
Resolving aws-proserve-data-science.s3.amazonaws.com (aws-proserve-data-science.s3.amazonaws.com)... 52.218.248.58
Connecting to aws-proserve-data-science.s3.amazonaws.com (aws-proserve-data-science.s3.amazonaws.com)|52.218.248.58|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35312590 (34M) [application/zip]
Saving to: ‘/tmp/geological_similarity.zip’


2020-06-02 11:46:48 (28.2 MB/s) - ‘/tmp/geological_similarity.zip’ saved [35312590/35312590]



In [0]:
import os
import zipfile

local_zip = '/tmp/geological_similarity.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp/')
zip_ref.close()

In [5]:
! ls /tmp/geological_similarity/

andesite  gneiss  marble  quartzite  rhyolite  schist


## Defining the Model

Here, we've defined a `Sequential` model in Keras of relatively high complexity, compared to the simple models we see for the FMNIST data classification task. There are three `Conv2D` layers, with associated pooling layers. The third set of layers uses a smaller filter size compared the earlier ones.

The flattened results are then taken to a DNN, which then outputs to a `Softmax` layer to do the multi-class classification. 

In [0]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(256, (3,3), activation='relu', input_shape=(28, 28, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (2,2), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])

In [7]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 256)       7168      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 256)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 128)       295040    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 128)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 4, 4, 64)          32832     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0

In [0]:
from tensorflow.keras.optimizers import RMSprop
'''
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='RMSprop(lr=0.001)',
              metrics=['accuracy'])
'''

model.compile(loss = 'sparse_categorical_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

## Using Image Data Generators

The image data generator class from Keras' preprocessing module has been used to scale and build the `train_generator` instance. This instance of the class takes in the images in the `geological_similarity` folder, and then prepares the training data for *sparse categorical crossentropy* loss. This means that the output will be a tensor where the number of columns in the tensor will be equal to the number of classes in the classification problem statement.

The image data generator makes short work of the potentially laborious task of labelling thousands of images, as long as the images are present in different folders.

## Training and Test Sets
When specifying the data generator, we can set up a `validation_split` ratio within the constructor, which enables the generator to be pointed to the same location for generating distinct training and test sets.

In [9]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255
image_datagen = ImageDataGenerator(rescale=1/255,
                                   validation_split = 0.25)

# Flow training images in batches of 128 using train_datagen generator
train_generator = image_datagen.flow_from_directory(
        '/tmp/geological_similarity',  # This is the source directory for training images
        target_size=(28, 28),  # All images will be resized to 28x28
        batch_size=128,
        # Since we use sparse_categorical_crossentropy loss, we need sparse labels
        class_mode='sparse',
        subset = "training")


test_generator = image_datagen.flow_from_directory(
        '/tmp/geological_similarity',  # This is the source directory for training images
        target_size=(28, 28),  # All images will be resized to 28x28
        batch_size=128,
        # Since we use sparse_categorical_crossentropy loss, we need sparse labels
        class_mode='sparse',
        subset = "validation")


Found 22499 images belonging to 6 classes.
Found 7499 images belonging to 6 classes.


The `callbackClass()` class here enables us to stop training the neural network when we reach a certain level of accuracy. In this case, we're looking for 99% accuracy on the training data. We don't have a test dataset here, potentially that could be added as well, if required.

In [0]:
class callbackClass(tf.keras.callbacks.Callback):
      def on_epoch_end(self, epoch, logs={}):
            if(logs.get('accuracy')>0.99):
                print("\nReached 99% accuracy so cancelling training!")
                self.model.stop_training = True

accuracy_filter = callbackClass()

In the cell below, we train the model over 100 maximum epochs, with the accuracy filter enabling us to stop early if the required accuracy has been reached. We can specify additional parameters such as the batch size, and if we possess test data, we could use that too to get validation statistics.

## Using Validation Data in Model Training

In the model.fit() method, we can introduce the validation_data argument and supply `test_generator` to it. Keras has made this process of supplying validation data really simple!

In [11]:
history = model.fit(
      train_generator,
      batch_size=16,  
      epochs=100,
      verbose=1,
      validation_data = test_generator,
      callbacks = [accuracy_filter])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Reached 99% accuracy so cancelling training!


We see that 99% accuracy has been reached, and the training process has been stopped as a result.