# Historical document classification
We'll show with a very simple exmaple how to separate 3 different type of documents using a simple convolutional neural network.

**Don't forget to turn on the GPU for the training** Runtime > Change runtime type > Hardware accelerator > GPU

## Downloads

In [0]:
!wget https://github.com/dhlab-epfl/fdh-tutorials/releases/download/v0.1/data_classification.zip -O /content/data_classification.zip
!rm -r /content/data_classification;unzip /content/data_classification.zip

## Data preparation
This part is adapted from the [example](https://www.tensorflow.org/tutorials/load_data/images#basic_methods_for_training) of data preparation in Tensorflow

In [0]:
# Install TensorFlow
try:
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf

import os
import numpy as np
from glob import glob 
import matplotlib.pyplot as plt

In [0]:
test_data_dir = '/content/data_classification/test'
train_data_dir ='/content/data_classification/train'

In [0]:
# count number of sample for each split
n_train = len(glob(os.path.join(train_data_dir, '*', '*')))
n_test = len(glob(os.path.join(test_data_dir, '*', '*')))

In [0]:
n_train, n_test

In [0]:
CLASS_NAMES = [os.path.basename(d) for d in glob(os.path.join(train_data_dir, '*'))]
BATCH_SIZE = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE
IMG_WIDTH = 200
IMG_HEIGHT = 200

There are 3 classes :
* illustration
* sale information
* objects description

In [0]:
CLASS_NAMES

In [0]:
def get_label(file_path):
  # convert the path to a list of path components
  parts = tf.strings.split(file_path, '/')
  # The second to last is the class-directory
  return parts[-2] == CLASS_NAMES

def decode_img(img):
  # convert the compressed string to a 3D uint8 tensor
  img = tf.image.decode_jpeg(img, channels=3)
  # Use `convert_image_dtype` to convert to floats in the [0,1] range.
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size.
  return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])

def process_path(file_path):
    label = get_label(file_path)
    # load the raw data from the file as a string
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

Prepare the data for the training, this means:
* Load the images
* Resize the images to 200 x 200
* Associate the correct label to each image

We will use for that the TensorFlow ``Dataset`` object. 

In [0]:
# Create a Dataset object with all the image filenames
list_ds_train = tf.data.Dataset.list_files(os.path.join(train_data_dir, '*', '*'))
list_ds_test = tf.data.Dataset.list_files(os.path.join(test_data_dir, '*', '*'))

# Load, resize and couple
labeled_ds_train = list_ds_train.map(process_path, num_parallel_calls=tf.data.experimental.AUTOTUNE)
labeled_ds_test = list_ds_test.map(process_path, num_parallel_calls=tf.data.experimental.AUTOTUNE)

Show one example per class

In [0]:
id_label_list = [0, 1, 2]
i = 1
plt.figure(figsize=(20,50))
while id_label_list:
  image, label = list(labeled_ds_train.take(1))[0]
  label_id = np.argmax(label.numpy())
  if label_id in id_label_list:
    plt.subplot(1, 3, i)
    plt.imshow(image.numpy())
    plt.title(CLASS_NAMES[label_id])
    plt.axis('off')
    id_label_list.remove(label_id)
    i += 1

Batch data for training

In [0]:
train_ds = labeled_ds_train.shuffle(buffer_size=1000).repeat().batch(BATCH_SIZE, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)
test_ds = labeled_ds_train.repeat().batch(BATCH_SIZE)

## Model definition
We will use a simpel convolutional model, you can play with the filter numbers, the number of layers, etc, and see the effect on the result.

In [0]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(filters=16, kernel_size=(3,3), activation='relu'),
  tf.keras.layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(len(CLASS_NAMES), activation='softmax')
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

## Training

In [0]:
model.fit(train_ds,
          validation_data=test_ds,
          steps_per_epoch=n_train // BATCH_SIZE,
          validation_steps=n_test // BATCH_SIZE,
          epochs=5)


## Visualize prediction

In [0]:
# Take one sample from the test dataset
image_to_predict = list(labeled_ds_test.take(1))[0][0].numpy()

In [0]:
predictions = model.predict(image_to_predict[None])
label = CLASS_NAMES[np.argmax(predictions)]

In [0]:
plt.figure(figsize=(10,10))
plt.imshow(image_to_predict)
plt.title(label)
plt.axis('off')