<a href="https://colab.research.google.com/github/clemsage/NeuralDocumentClassification/blob/master/skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Settting up the computing environment


## Install and import TensorFlow 2.0 with GPU

Select "GPU" in the Accelerator drop-down on Notebook Settings through the Edit menu.

In [0]:
!pip install tensorflow-gpu==2.0
import tensorflow as tf
print (tf.__version__)

## Confirm TensorFlow can see the GPU

In [0]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

## Additional information about hardware

In [0]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

For CPU information and RAM, run:

In [0]:
!cat /proc/cpuinfo
!cat /proc/meminfo

## Other useful package imports

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import os
import PIL
import sys
import importlib

# Working on the dataset

The dataset is a subset of the [RVL-CDIP dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/). See [Harley et al.](http://scs.ryerson.ca/~aharley/icdar15/harley_convnet_icdar15.pdf) and [Asim et al.](https://www.dfki.de/fileadmin/user_upload/import/10637_Asim_Document_Image_Classification.pdf) papers for recent works on this dataset.

## Information about the dataset

This project only considers the following 5 classes among the 16 classes of the original dataset:

In [0]:
class_names = ['form', 'email', 'handwritten', 'advertisement', 'invoice']
NUM_CLASSES = len(class_names)

## Import the dataset

First, clone or pull the GitHub repository of the project:

In [0]:
if not os.path.exists('NeuralDocumentClassification'):
  !git clone https://github.com/clemsage/NeuralDocumentClassification.git
else:
  !git -C NeuralDocumentClassification pull
sys.path.append('NeuralDocumentClassification')

Download and extract labels, images and dataset assignments from this [Google Drive](https://drive.google.com/drive/folders/1Pkd6sUkDGBUymWKK93abZx1MQiWmzFgP):

In [0]:
import download_dataset
importlib.reload(download_dataset)
for elt in ['label', 'image', 'dataset_assignment']:
  download_dataset.download_and_extract(elt)
dataset_path = 'dataset'

Parse `dataset_assignment.txt` to retrieve the training and test sets:

In [0]:
dataset = {"training": [], "test": []}
with open(os.path.join(dataset_path, 'dataset_assignment.txt'), 'r') as f:
  for line in f.readlines():
    line = line.split('\n')[0]
    file_id, assignment = line.split(',')
    file_path = os.path.join(dataset_path, 'image_png', '%s.png' % file_id)
    dataset[assignment].append(file_path)

print("Number of training documents: %d" % len(dataset["training"]))
print("Number of test documents: %d" % len(dataset['test']))

List the image files of the training and test dataset:

In [0]:
list_train_ds = tf.data.Dataset.from_tensor_slices(dataset['training'])
list_train_ds = list_train_ds.shuffle(100000)
list_test_ds = tf.data.Dataset.from_tensor_slices(dataset['test'])
list_test_ds = list_test_ds.shuffle(100000)

Print 5 image file names of the training set (see [TensorFlow tutorial for loading data](https://www.tensorflow.org/tutorials/load_data/images#load_using_tfdata)):

In [0]:
### Insert your code here ###
# See the expected solution by clicking on the cell below

In [0]:
#@title
for f in list_train_ds.take(5):
  print(f.numpy())

Get the labels for all files of the dataset:

In [0]:
raw_class_indices = ['1', '2', '3', '4', '11']

# Parse the labels.txt file to get all labels
file_paths, labels = [], []
with open(os.path.join(dataset_path, 'label.txt'), 'r') as f:
  for line in f.readlines():
    line = line.split('\n')[0]
    file_id, label = line.split(',')
    file_path = os.path.join(dataset_path, 'image_png', '%s.png' % file_id)
    file_paths.append(file_path)
    labels.append(raw_class_indices.index(label))

labels_idx = tf.lookup.StaticHashTable(
    initializer=tf.lookup.KeyValueTensorInitializer(keys=file_paths, 
                                                    values=labels),
    default_value=tf.constant(-1))

Implement a function that resizes images to the [US Letter format](https://en.wikipedia.org/wiki/Letter_(paper_size)) (8.5 by 11 inches) with 35 pixels by inch ([PPI](https://en.wikipedia.org/wiki/Pixel_density)):

In [0]:
PPI = 35  # number of Pixels Per Inch 
IMG_WIDTH, IMG_HEIGHT = None, None  # define them as a function of PPI

def decode_img(img):
  # Adapt the function given in the tutorial : 
  # https://www.tensorflow.org/tutorials/load_data/images#load_using_tfdata 
  return img

In [0]:
#@title
PPI = 35  # Number of Pixels Per Inch 
IMG_WIDTH, IMG_HEIGHT = int(PPI * 8.5), int(PPI * 11)

def decode_img(img):
  # convert the compressed string to a uint8 tensor
  img = tf.io.decode_png(img, channels=1)
  # convert to floats in the [0,1] range
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size
  img = tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])
  return img

Implement a function to convert a file path to an (image_data, label) pair:

In [0]:
def process_path(file_path):
  label = labels_idx.lookup(file_path)

  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label

Apply image and label retrieval to the training and test datasets:

In [0]:
labeled_train_ds = list_train_ds.map(process_path, 
                         num_parallel_calls=tf.data.experimental.AUTOTUNE)
labeled_test_ds = list_test_ds.map(process_path, 
                         num_parallel_calls=tf.data.experimental.AUTOTUNE)

## Explore the data

Get image shape and label for one element of the training dataset:

In [0]:
for image, label in labeled_train_ds.take(1):
  print("Image shape (height, width, depth):", image.numpy().shape)
  print("Label:", class_names[label.numpy()])

Plot 3 random training images of each class:

In [0]:
plt.figure(figsize=(30, 60))
n_images_per_class = 3

for image, label in labeled_train_ds:
  break # Sample images and labels for each class

for class_idx, class_name in enumerate(class_names):
  for i in range(n_images_per_class):
    plt.subplot(NUM_CLASSES, n_images_per_class, 
                class_idx*n_images_per_class + i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    # plt.imshow(...)
    # plt.xlabel(...)

#plt.show()

In [0]:
#@title
plt.figure(figsize=(30, 60))
n_images_per_class = 3
images = {class_name: [] for class_name in class_names}

for image, label in labeled_train_ds:
  image = image.numpy()
  label = label.numpy()

  if len(images[class_names[label]]) < n_images_per_class:
    images[class_names[label]].append(image)
  
  if all([len(images[class_name]) == n_images_per_class 
          for class_name in class_names]):
    break

for class_idx, class_name in enumerate(class_names):
  for i in range(n_images_per_class):
    plt.subplot(NUM_CLASSES, n_images_per_class, 
                class_idx*n_images_per_class + i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(np.squeeze(images[class_name][i]), cmap='gray')
    plt.xlabel(class_name)

plt.show()

Print the class distribution in the training set:

In [0]:
cnt_class = Counter()
for file_path in dataset['training']:
  label = labels_idx.lookup(tf.constant(file_path))
  # Update the counter with the label value

# Print the class counter

In [0]:
#@title
cnt_class = Counter()
for file_path in dataset['training']: 
  label = labels_idx.lookup(tf.constant(file_path))
  cnt_class.update([class_names[label.numpy()]])

for key, val in cnt_class.most_common():
  print('%s: %d' % (key, val))

## Prepare training

Use a temporary folder for caching elements of the dataset in order to speed up training and testing:

In [0]:
temp_folder = '/tmp/%dx%dx1' % (IMG_HEIGHT, IMG_WIDTH)
labeled_train_ds = labeled_train_ds.cache(temp_folder)
labeled_test_ds = labeled_test_ds.cache(temp_folder)

Shuffle the documents within each subset:

In [0]:
labeled_train_ds = labeled_train_ds.shuffle(2048)
labeled_test_ds = labeled_test_ds.shuffle(2048)

Batch documents within each subset:


In [0]:
batch_size = 128
labeled_train_ds = labeled_train_ds.batch(batch_size)
labeled_test_ds = labeled_test_ds.batch(batch_size)

Prefetch the subsets in the background while the model is computing:

In [0]:
labeled_train_ds = labeled_train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
labeled_test_ds = labeled_test_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

# Visual classifiers

## Fully connected neural network

### Set up the layers

Build a neural network composed of one fully connected (aka dense) hidden layer with 128 [ReLu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) units and one output softmax layer.

Each image must be reshaped to a 1 dimensional vector before being fed to the hidden layer.

In [0]:
model = tf.keras.Sequential([
  # Insert your layers here, see the following documentation:
  # https://www.tensorflow.org/tutorials/quickstart/beginner
])

In [0]:
#@title
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

### Compile the model

Compile the model by providing the optimizer, the loss function you want to minimize and the metrics to monitor during training:

In [0]:
model.compile(
    optimizer='adam', # https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
    loss='sparse_categorical_crossentropy', # Loss used for multi-class classification with integer labels
    # https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy
    metrics=['accuracy'] # https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Accuracy
    )

Print a summary of the model:

In [0]:
print(model.summary())

### Train the model

Fit the model on the training set for 20 epochs:

In [0]:
EPOCHS = 20
# model.fit(...)  # https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

In [0]:
#@title
EPOCHS = 20
model.fit(labeled_train_ds, epochs=EPOCHS)

### Evaluation on the test set

Get the values of the loss and accuracy: 

In [0]:
# model.evaluate(..., verbose=2) # https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate

In [0]:
#@title
model.evaluate(labeled_test_ds, verbose=2)

Are these values different from their training counterparts ?

### Prediction on the test set

Implement a function that gathers the model predictions and the ground truth labels for a random batch of a given dataset:

In [0]:
def predict_random_batch(model, dataset):
  """
  Sample a random batch of the dataset and return the images of this batch 
  as well as its labels and the predicted classes of a given model

  Parameters
  ----------
  model : tf.keras.Model
  dataset: tf.data.Dataset

  Returns
  -------
  images: np.ndarray or EagerTensor
  labels: list of str
  predicted_classes: list of str
  """
  images, labels = next(iter(dataset))
  
  # get label names for the sampled batch

  # make predictions
  predicted_classes = None

  return images, labels, predicted_classes

In [0]:
#@title
def predict_random_batch(model, dataset):
  """
  Sample a random batch of the dataset and return the images of this batch 
  as well as its labels and the predicted classes of a given model

  Parameters
  ----------
  model : tf.keras.Model
  dataset: tf.data.Dataset

  Returns
  -------
  images: np.ndarray or EagerTensor
  labels: list of str
  predicted_classes: list of str
  """
  images, labels = next(iter(dataset))
  
  # get label names for the sampled batch
  labels = [class_names[i] for i in labels]

  # make predictions
  predictions = model.predict(images)
  predicted_classes_idx = np.argmax(predictions, axis=1)
  predicted_classes = [class_names[i] for i in predicted_classes_idx]

  return images, labels, predicted_classes

Plot the first 9 images of this batch, give their labels and predicted classes in the legend:

In [0]:
def plot_images_predictions_and_labels(images, labels, predicted_classes):
  plt.figure(figsize=(30, 40))

  for im_idx in range(9):
    plt.subplot(3, 3, im_idx + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(np.squeeze(images[im_idx]), cmap='gray')
    plt.xlabel("label: %s\npred: %s" % (labels[im_idx], 
                                        predicted_classes[im_idx]))

  plt.show()

result = predict_random_batch(model, labeled_test_ds)
plot_images_predictions_and_labels(*result)

### Under the Hood

Implement an ReLu dense layer by creating its weights and biases and giving the transformation from inputs to outputs:

In [0]:
# https://www.tensorflow.org/guide/keras/custom_layers_and_models#the_layer_class
class MyDenseLayer(tf.keras.layers.Layer):

  def __init__(self, units, input_dim):
    super(MyDenseLayer, self).__init__()
    self.w = self.add_weight(
        shape=None,  ## Insert the shape of the weight matrix here
        initializer='glorot_uniform',  # Default initializer for weights of a tf.keras.layers.Dense layer
        trainable=True)

    self.b = self.add_weight(
        shape=None, ## Insert the shape of the bias vector here
        initializer='zeros',  # Default initializer for biases of a tf.keras.layers.Dense layer
        trainable=True)

  def call(self, inputs):
    outputs = None
    return outputs

In [0]:
#@title
# https://www.tensorflow.org/guide/keras/custom_layers_and_models#the_layer_class
class MyDenseLayer(tf.keras.layers.Layer):

  def __init__(self, units, input_dim):
    super(MyDenseLayer, self).__init__()
    self.w = self.add_weight(
        shape=(input_dim, units),
        initializer='glorot_uniform',  # Default initializer for weights of a tf.keras.layers.Dense layer
        trainable=True)

    self.b = self.add_weight(
        shape=(units,),
        initializer='zeros',  # Default initializer for biases of a tf.keras.layers.Dense layer
        trainable=True)

  def call(self, inputs):
    return tf.keras.activations.relu(tf.matmul(inputs, self.w) + self.b)

Using your custom hidden layer, set up again the layers of the model defined previously:

In [0]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    ## Insert your custom hidden layer here
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

In [0]:
#@title
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    MyDenseLayer(128, IMG_HEIGHT*IMG_WIDTH),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

Lower-level implementation of the model compile step:

In [0]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

@tf.function
def train_step(images, labels):
  with tf.GradientTape() as tape:
    predictions = model(images)
    loss = loss_object(labels, predictions)
  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_loss(loss)
  train_accuracy(labels, predictions)

Lower-level implementation of the model fit step:

In [0]:
for epoch in range(EPOCHS):
  for images, labels in labeled_train_ds:
    train_step(images, labels)
    template = 'Epoch {}, Loss: {}, Accuracy: {}'
    print(template.format(epoch+1,
                          train_loss.result(),
                          train_accuracy.result()*100))

## Convolutional Neural Networks (CNN)

### Training from scratch

Create and compile a model alterning convolution and max pooling layers. You can add some fully connected layers between the last locally connected layer and the output layer. Start with a shallow network (4 or 5 convolution layers) and progressively move to deeper architectures: 

In [0]:
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = tf.keras.Sequential([
    # Alterning Conv2D and MaxPooling2D layers

    # Some dense hidden layer(s)

    Dense(NUM_CLASSES, activation='softmax')
])

print(model.summary())

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [0]:
#@title
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
shallow_model = tf.keras.Sequential([
    Conv2D(16, 3, padding='same', activation='relu', 
           input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    MaxPooling2D(4),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(4),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(4),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(4),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(NUM_CLASSES, activation='softmax')
])

deep_model = tf.keras.Sequential([
    Conv2D(16, 3, padding='same', activation='relu', 
           input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(256, activation='relu'),
    Dense(NUM_CLASSES, activation='softmax')
])

model_with_strides = tf.keras.Sequential([
    Conv2D(16, 3, padding='same', activation='relu', 
           input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu', strides=2),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu', strides=2),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu', strides=2),
    MaxPooling2D(),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(NUM_CLASSES, activation='softmax')
])
model = deep_model
print(model.summary())

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Fit the CNN on the training data:

In [0]:
EPOCHS = 20
model.fit(labeled_train_ds, epochs=EPOCHS)

Evaluate the trained model on the test set:

In [0]:
model.evaluate(labeled_test_ds, verbose=2)

You should reach test accuracy greater than 0.99 !




Plot images, predictions and labels for some test documents:

In [0]:
plot_images_predictions_and_labels(*predict_random_batch(model, labeled_test_ds))

### Transfer Learning with pre-trained models 

The objective is to leverage the knowledge learnt by a pre-trained image classifier. See [TensorFlow Hub](https://tfhub.dev/) to browse available state-of-the art models such as [Inception V3](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf) or [MobileNet V2](https://arxiv.org/pdf/1801.04381.pdf).

Choose a pre-trained model for extracting high level feature vectors of document images:

In [0]:
extractor_model = 'inception_v3'
if extractor_model == 'inception_v3':
  feature_extraction_url = "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4"
  IMG_HEIGHT, IMG_WIDTH = None, None  ## Insert expected input image shape here
elif extractor_model == 'mobilenet_v2':
  feature_extraction_url = "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
  IMG_HEIGHT, IMG_WIDTH = None, None  ## Insert expected input image shape here

In [0]:
#@title
extractor_model = 'inception_v3'
if extractor_model == 'inception_v3':
  feature_extraction_url = "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4"
  IMG_HEIGHT, IMG_WIDTH = 299, 299
elif extractor_model == 'mobilenet_v2':
  feature_extraction_url = "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
  IMG_HEIGHT, IMG_WIDTH = 224, 224

Reshape images to the format expected by the chosen model, i.e. IMG_HEIGHT x IMG_WIDTH x 3 (RGB) and recreate training dataset:

In [0]:
def decode_img(img):
  # convert the compressed string to a uint8 tensor
  img = tf.io.decode_png(img, channels=1)
  # convert to floats in the [0,1] range
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size
  img = tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])
  # convert to RGB color scale
  img = tf.concat([img for _ in range(3)], axis=-1)  # R = G = B
  return img

def process_path(file_path):
  label = labels_idx.lookup(file_path)

  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label

labeled_train_ds = list_train_ds.map(process_path, 
                         num_parallel_calls=tf.data.experimental.AUTOTUNE)
labeled_train_ds = labeled_train_ds.cache('/tmp/%dx%dx3' % (IMG_HEIGHT, 
                                                            IMG_WIDTH))
labeled_train_ds = labeled_train_ds.shuffle(2048).batch(batch_size)
labeled_train_ds = labeled_train_ds.prefetch(buffer_size=
                                             tf.data.experimental.AUTOTUNE)

Construct our own image classifier by retrieving and freezing the hidden layers of the pre-trained model: 

In [0]:
import tensorflow_hub as hub
# feature_extraction_layer = hub.KerasLayer(...)
# https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub

model = tf.keras.Sequential([     
    feature_extraction_layer,

    # Some dense hidden layer(s)

    Dense(NUM_CLASSES, activation='softmax')
    ])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print(model.summary())

In [0]:
#@title
import tensorflow_hub as hub
feature_extraction_layer = hub.KerasLayer(feature_extraction_url, 
                                          trainable=False,
                                          input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))
model = tf.keras.Sequential([          
    feature_extraction_layer,
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(NUM_CLASSES, activation='softmax')
    ])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print(model.summary())

Train this new model:

In [0]:
EPOCHS = 10
model.fit(labeled_train_ds, epochs=EPOCHS)