<a href="https://colab.research.google.com/github/clemsage/NeuralDocumentClassification/blob/master/skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Settting up the computing environment


## Install and import TensorFlow 2.0 with GPU

Select "GPU" in the Accelerator drop-down on Notebook Settings through the Edit menu.

In [0]:
!pip install tensorflow-gpu==2.0
import tensorflow as tf
print (tf.__version__)

## Confirm TensorFlow can see the GPU

In [0]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

## Additional information about hardware

In [0]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

For CPU information and RAM, run:

In [0]:
!cat /proc/cpuinfo
!cat /proc/meminfo

## Other useful package imports

In [0]:
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import os
import PIL
import sys
import importlib

# Working on the dataset

## Information about the dataset

In [0]:
class_names = ['form', 'email', 'handwritten', 'advertisement', 'invoice']
num_classes = len(class_names)

## Import the dataset

The dataset is a subset of the [RVL-CDIP dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/).

First, clone or pull the GitHub repository of the project:

In [0]:
if not os.path.exists('NeuralDocumentClassification'):
  !git clone https://github.com/clemsage/NeuralDocumentClassification.git
else:
  !git -C NeuralDocumentClassification pull
sys.path.append('NeuralDocumentClassification')

Download and extract labels, images and dataset assignments:

In [0]:
import download_dataset
importlib.reload(download_dataset)
for elt in ['label', 'image', 'dataset_assignment']:
  download_dataset.download_and_extract(elt)
dataset_path = 'dataset'

Parse `dataset_assignment.txt` to retrieve the training and test sets:

In [0]:
dataset = {"training": [], "test": []}
with open(os.path.join(dataset_path, 'dataset_assignment.txt'), 'r') as f:
  for line in f.readlines():
    line = line.split('\n')[0]
    file_id, assignment = line.split(',')
    file_path = os.path.join(dataset_path, 'image_png', '%s.png' % file_id)
    dataset[assignment].append(file_path)

print("Number of training documents: %d" % len(dataset["training"]))
print("Number of test documents: %d" % len(dataset['test']))

List the image files of the training and test dataset (see [Load images tutorial](https://www.tensorflow.org/tutorials/load_data/images)):

In [0]:
list_train_ds = tf.data.Dataset.from_tensor_slices(dataset['training'])
list_train_ds = list_train_ds.shuffle(100000)
list_test_ds = tf.data.Dataset.from_tensor_slices(dataset['test'])
list_test_ds = list_test_ds.shuffle(100000)

Print 5 image file names of the training set:

In [0]:
for f in list_train_ds.take(5):
  print(f.numpy())

Implement functions to convert a file path to an (image_data, label) pair:

In [0]:
raw_class_indices = ['1', '2', '3', '4', '11']

# Parse the labels.txt file to get all labels
file_paths, labels = [], []
with open(os.path.join(dataset_path, 'label.txt'), 'r') as f:
  for line in f.readlines():
    line = line.split('\n')[0]
    file_id, label = line.split(',')
    file_path = os.path.join(dataset_path, 'image_png', '%s.png' % file_id)
    file_paths.append(file_path)
    labels.append(raw_class_indices.index(label))

labels_idx = tf.lookup.StaticHashTable(
    initializer=tf.lookup.KeyValueTensorInitializer(keys=file_paths, 
                                                    values=labels),
    default_value=tf.constant(-1))

# Resize to the most frequent format in the dataset: Letter (8.5 by 11 inches)
DPI = 35  # Number of Dots Per Inch
IMG_WIDTH, IMG_HEIGHT = int(DPI * 8.5), int(DPI * 11)

def decode_img(img):
  # convert the compressed string to a uint8 tensor
  img = tf.io.decode_png(img, channels=1)
  # convert to floats in the [0,1] range
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size
  img = tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])
  return img

def process_path(file_path):
  label = labels_idx.lookup(file_path)

  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label


Apply image and label retrieval to the training and test datasets:

In [0]:
labeled_train_ds = list_train_ds.map(process_path, 
                         num_parallel_calls=tf.data.experimental.AUTOTUNE)
labeled_test_ds = list_test_ds.map(process_path, 
                         num_parallel_calls=tf.data.experimental.AUTOTUNE)

## Explore the data

Get image shape and label for one element of the training dataset:

In [0]:
for image, label in labeled_train_ds.take(1):
  print("Image shape (height, width, depth):", image.numpy().shape)
  print("Label:", class_names[label.numpy()])

Plot 5 random training images of each class:

In [0]:
plt.figure(figsize=(30, 60))
n_images_per_class = 3
images = {class_name: [] for class_name in class_names}

for image, label in labeled_train_ds:
  image = image.numpy()
  label = label.numpy()

  if len(images[class_names[label]]) < n_images_per_class:
    images[class_names[label]].append(image)
  
  if all([len(images[class_name]) == n_images_per_class 
          for class_name in class_names]):
    break

for class_idx, class_name in enumerate(class_names):
  for i in range(n_images_per_class):
    plt.subplot(num_classes, n_images_per_class, 
                class_idx*n_images_per_class + i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(np.squeeze(images[class_name][i]), cmap='gray')
    plt.xlabel(class_name)

plt.show()

Get the class distribution in the training set:

In [0]:
cnt_class = Counter()
for file_path in dataset['training']: 
  label = labels_idx.lookup(tf.constant(file_path))
  cnt_class.update([class_names[label.numpy()]])

for key, val in cnt_class.most_common():
  print('%s: %d' % (key, val))

## Prepare training

Use a temporary folder for caching elements of the dataset in order to speed up training and testing:

In [0]:
labeled_train_ds = labeled_train_ds.cache('/tmp')
labeled_test_ds = labeled_test_ds.cache('/tmp')

Shuffle the documents within each subset:

In [0]:
labeled_train_ds = labeled_train_ds.shuffle(2048)
labeled_test_ds = labeled_test_ds.shuffle(2048)

Batch documents within each subset:


In [0]:
batch_size = 128
labeled_train_ds = labeled_train_ds.batch(batch_size)
labeled_test_ds = labeled_test_ds.batch(batch_size)

Prefetch the subsets in the background while the model is training:

In [0]:
labeled_train_ds = labeled_train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
labeled_test_ds = labeled_test_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

# Visual classifiers

## Fully connected neural network

### Set up the layers

Build a neural network composed of one fully connected (aka dense) layer with 128 hidden units and one output layer.

Each image must be reshaped to a 1 dimensional vector before being fed to the hidden layer.

In [0]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    keras.layers.Dense(128, activation='relu'),
    #keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(num_classes, activation='softmax')
])

### Compile the model

Compile the model by providing the optimizer, the loss function you want to minimize and the metrics to monitor during training

In [0]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Print the summary of the model:

In [0]:
print(model.summary())

### Train the model

Fit the model on the training preprocessed images for 10 epochs

In [0]:
num_epochs = 15
model.fit(labeled_train_ds, epochs=num_epochs)

### Evaluate the model performances on the test set

Get the values of the loss and accuracy: 

In [0]:
model.evaluate(labeled_test_ds, verbose=2)

Are these values different from their training counterparts ?

### Make predictions on the test set

Predict the class of for a random batch of the test set and retrieve their labels:

In [0]:
image_batch, label_batch = next(iter(labeled_test_ds))

# prediction
predictions = model.predict(image_batch)
predicted_classes_idx = np.argmax(predictions, axis=1)
predicted_classes = [class_names[i] for i in predicted_classes_idx]

Plot 9 images of this batch, give their labels and predicted classes in the legend:

In [0]:
plt.figure(figsize=(30, 40))

for im_idx in range(9):
  plt.subplot(3, 3, im_idx + 1)
  plt.xticks([])
  plt.yticks([])
  plt.grid(False)
  plt.imshow(np.squeeze(image_batch[im_idx]), cmap='gray')
  plt.xlabel("label: %s\npred: %s" % (class_names[label_batch[im_idx]], 
                                      predicted_classes[im_idx]))

plt.show()

### Under the Hood

Implement the hidden (dense) layer by creating its weights and bias (see [documentation for custom layers](https://www.tensorflow.org/guide/keras/custom_layers_and_models#the_layer_class)):

In [0]:
class MyHiddenLayer(keras.layers.Layer):

  def __init__(self, units, input_dim):
    super(MyHiddenLayer, self).__init__()
    w_init = tf.keras.initializers.GlorotUniform()
    self.w = tf.Variable(initial_value=w_init(shape=(input_dim, units),
                                              dtype='float32'),
                         trainable=True)
    b_init = tf.zeros_initializer()
    self.b = tf.Variable(initial_value=b_init(shape=(units,),
                                              dtype='float32'),
                         trainable=True)

  def call(self, inputs):
    return keras.activations.relu(tf.matmul(inputs, self.w) + self.b)

Set up again the layers of the model using your custom hidden layer:

In [0]:
# Unlike the default implementation, you need to give the input dimension of MyHiddenLayer
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
    MyHiddenLayer(128, IMG_HEIGHT*IMG_WIDTH),
    keras.layers.Dense(num_classes, activation='softmax')
])

Compile and retrain the resulting model:

In [0]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])
model.fit(labeled_train_ds, epochs=num_epochs)

## Convolutional Neural Networks (CNN)

### Training from scratch

### Leveraging pre-trained models (Transfer Learning)

Get the final feature vectors generated by the Inception V3 model trained on ImageNet (browse models available on [TensorFlow Hub](https://tfhub.dev/)): 

In [0]:
import tensorflow_hub as hub
inception_v3_url = "https://tfhub.dev/google/tf2-preview/inception_v3/" \
  "feature_vector/4"
feature_extraction = keras.Sequential([hub.KerasLayer(inception_v3_url)])

train_images_rgb = np.concatenate([train_images for _ in range(3)], axis=-1)
feature_extraction(train_images_rgb).shape