### Train a page classifier for finding tables in digitized books

This notebook describes how a page classifier was trained for finding tables in digitized books from the collections of the Bavarian State Library. <br>
So far, the classifier is rather simple. It is set up to train a Convolutional Neural Network (CNN) for image classification, that distinguishes between 4 classes (Table, TextAndTable, Text, Title). The goal of the page classifier is to quickly find full-page tables across a broad selection of digitized books published between 1750 and 1900.<br>
Although the results are satisfactory, there is much scope for improving the page classifier. <br>
Additional training data and revision of parameters for training could probably improve the model's performance.<br>
Eventually it could be developed into a general classifier that distinguishes between title pages, indices, table of contents, text, pages containing full-page images etc. This, however, is beyond the scope of the current project. <br>

#### Load packages

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense
from tensorflow.keras import backend as K

from tensorflow.keras.layers import Input
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

from PIL import Image
import tensorflow as tf

#### Load training data

In [None]:
# Select the directory that contains the training data
train_data_dir = SELECT_TRAINING_DATA_DIRECTORY

In [None]:
# Check image format and set input shape accordingly
# This part is to check the data format i.e the RGB channel is coming first or last. 
# Whatever it may be, the model will check first and then input shape will be fed accordingly.
img_width, img_height = 299, 299

if K.image_data_format() == 'channels_first':
	input_shape = (3, img_width, img_height)
else:
	input_shape = (img_width, img_height, 3)

#### Set model parameters

In [None]:
epochs = 200
batch_size = 8 # can be reduced to 16 if memory is an issue

model = Sequential([
    Input(shape=input_shape), 
    
    # First Convolutional Block
    Conv2D(32, (3, 3), padding='same', activation='relu'), 
    MaxPooling2D(pool_size=(2, 2)),
    
    # Second Convolutional Block
    Conv2D(32, (3, 3), padding='same', activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    # Third Convolutional Block
    Conv2D(64, (3, 3), padding='same', activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    # Flatten and Dense Layers
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(4, activation='softmax') # number of classes
])

In [None]:
# Define optimizer with customized parameters
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.0002, # for second model, learning rate is reduced to 0.0002
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07
)

# Compile model with additional metrics
model.compile(
    optimizer=optimizer,
    loss='categorical_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc'),
        tf.keras.metrics.F1Score(name='f1_score', threshold=0.5)
    ]
)

In [None]:
# Normalize the data (similar to rescale=1./255 in ImageDataGenerator)
normalization_layer = tf.keras.layers.Rescaling(1./255)

#### Create training and validation datasets

In [None]:
# Create training dataset
train_ds = tf.keras.utils.image_dataset_from_directory(
    train_data_dir,
    validation_split=0.2,
    subset='training',
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
    label_mode='categorical'
)

# Create validation dataset
val_ds = tf.keras.utils.image_dataset_from_directory(
    train_data_dir,
    validation_split=0.2,
    subset='validation',
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
    label_mode='categorical'
)

# Apply normalization to the datasets
train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
val_ds = val_ds.map(lambda x, y: (normalization_layer(x), y))

# Optimize the datasets for performance
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#### Train the model

In [None]:
callbacks = [
    ModelCheckpoint('best_model_v2.keras', save_best_only=True, monitor='val_accuracy'), # saves best model during training
    EarlyStopping(monitor='val_accuracy', patience=20)
]

model.fit(
    train_ds,
    epochs=epochs,
    validation_data=val_ds,
    callbacks=callbacks
)

#### Save the model

In [None]:
# Save the final model
model.save('final_model.keras') # it is possible to save only the weights using model.save_weights('model_weights.keras')