## LUISS - Project 1: The Librarian from Alexandria - Group 12

Gabriele De Ieso, Denise Di Franza e Alessia Tonicello

##
Objective 
As the newly appointed librarian of the Great Library of Alexandria, your task is to classify ancient digitized texts based on their fonts. This project requires developing a neural network 
model capable of accurately categorizing different writing styles to streamline the digital archiving process. 


Dataset Features:
* 1256 scanned historical texts and their respective font used. 
* Documents originate from various years and contain different writing styles and fonts. 
* Requires cleaning, augmentation, and structured labeling. 



Assignment 
1. Perform Exploratory Data Analysis (EDA) 

* Understand the dataset structure. 
* Implement preprocessing steps (grayscale conversion, binarization, noise reduction). 

2. Data Augmentation 

* Apply transformation techniques to enrich training data.


3. Model Development 

* Define and optimize a Neural Network topology for classification. 
* Test different architectures to determine the most effective model. 


4. Evaluation & Insights 

* Assess performance using appropriate classification metrics. 
* Provide a detailed report explaining model choices and key findings. 

# PHASE 0: IMPORT

## Library Imports

In this code block, we import all the necessary libraries and modules required for the project. Here's a breakdown of what each part does:

---

### Standard Libraries
- `os`, `random`, and `pathlib.Path` are used for file system navigation, random operations, and path handling.
- `time` is used to measure execution time and track performance.

---

### Data Manipulation and Scientific Computing
- `numpy` and `pandas` provide powerful tools for numerical operations and structured data manipulation.

---

### Data Visualization
- `matplotlib.pyplot` and `matplotlib.gridspec` are used to create plots and define complex grid-based layouts.
- `seaborn` adds a high-level interface for statistical graphics.
- `tqdm.auto.tqdm` is used to display progress bars for loops, especially helpful during model training or data processing.
- `IPython.display.display` and `HTML` are used to improve notebook visualization and ensure that output areas dynamically adjust height when displaying large content.

---

### Image Processing
- `PIL.Image` and `ImageFile` handle image loading and manipulation. We also disable the pixel size limit with `Image.MAX_IMAGE_PIXELS = None` to safely work with very large images.

---

### PyTorch Core Libraries
- `torch`, `torch.nn`, and `torch.optim` form the core of the PyTorch framework, providing tensor operations, neural network components, and optimization algorithms.
- `torch.nn.functional` is used for functions like activation functions and loss computations that are not classes.
- `torch.utils.data.Dataset`, `DataLoader`, `Subset`, and `random_split` are essential for data handling, batching, and creating custom datasets.

- `torchvision.transforms` includes tools for data preprocessing and augmentation.
- `torchvision.models` provides access to pre-trained models, such as `mobilenet_v2`, which we use for transfer learning.
- `torch.cuda.amp.autocast` and `GradScaler` enable mixed precision training, speeding up computations on compatible hardware.
- `torch.optim.lr_scheduler.ReduceLROnPlateau` automatically lowers the learning rate when validation metrics stagnate, improving training stability.

---

### Model Evaluation and Validation
- `sklearn.metrics.classification_report` and `confusion_matrix` help evaluate model performance on classification tasks.
- `sklearn.model_selection.StratifiedKFold` performs stratified cross-validation to ensure that each fold preserves the class distribution.
- `sklearn.utils.class_weight.compute_class_weight` is used to compute class weights for imbalanced datasets, ensuring that the model doesn't become biased toward more frequent classes.



---

### Other Utilities
- `tabulate` is a utility that formats tabular data nicely, especially useful for printing training results or summary statistics in a readable format.
- `warnings` is used to suppress specific user warnings that may clutter notebook output.

---

### Summary
We’re setting up our environment by importing all the essential tools required for image preprocessing, model building, training, evaluation, and visualization. This foundational block ensures that our project runs efficiently, cleanly, and reproducibly across experiments.


In [None]:
# Standard Libraries
import os
import random
from pathlib import Path  
import time

# Data Analysis & Visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

# Torch & TorchVision
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, Subset, random_split
from torchvision import transforms, models
from torchvision.models import mobilenet_v2

# Mixed Precision & LR Scheduler
from torch.cuda.amp import autocast, GradScaler
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Image Processing
from PIL import Image, ImageFile
Image.MAX_IMAGE_PIXELS = None  # Disabilita il controllo anti-OOM

# Scikit-learn
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight

# Utility
import warnings
from tqdm.auto import tqdm
from IPython.display import display, HTML
display(HTML("<style>.output_wrapper, .output { height:auto !important; max-height:100000px; }</style>"))
warnings.filterwarnings('ignore', category=UserWarning)

# Tabulate (per tabelle ordinate)
import tabulate as tabulate


What we are doing and why:
Setting the Random Seed:
We set a fixed SEED value (42) for the random, numpy, and torch libraries. This ensures that every run of the code—especially data splits, weight initializations, and any stochastic operations—produces the same results. This is essential for debugging and validating model performance.

GPU Determinism Settings:
- If a CUDA-enabled GPU is available, we set the manual seed for all CUDA devices.
- torch.backends.cudnn.deterministic = True ensures operations are deterministic (but might reduce performance).
- torch.backends.cudnn.benchmark = False disables the auto-tuner, which is helpful for reproducibility but may slow down training slightly.

Device Selection:  We detect whether a GPU is available and assign device accordingly, this allows us to later move our models and data to either CPU or GPU seamlessly.

Output:
The output will be a single line indicating the selected device, if a GPU is available, the output will be cuda, otherwise, it will be cpu.

In [None]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

In this step, we define the paths to our dataset components: the image directory and the metadata CSV file, these two elements are essential for structuring our data pipeline.
We start by setting a base_dir, which points to the main folder containing both the images and the CSV file, from this base directory, we then construct the full path to:
- **img_dir**, the subfolder containing all the image files—each image presumably representing a page with text in a specific font;
- **csv_path**, the CSV file that links each image to its corresponding font label.
Using Python’s Path object from the pathlib module makes path handling cleaner and more robust than relying on raw strings or concatenation, it also makes the code easier to maintain and modify later if we move the dataset or deploy the project elsewhere.
At this point, we’re simply preparing variables we’ll need when loading and processing the data. No computation happens here yet, but it’s an essential foundational step.

In [None]:
base_dir = Path(r"C:\Users\denis\Downloads\TheLibrarianFromAlexandria")
img_dir = base_dir / "img"
csv_path = base_dir / "pages.csv"

In this step, we load and verify the consistency of our dataset annotations. The pages.csv file, which associates each image file with a font label, is read into a DataFrame without headers, and we assign the column names "image_name" and "font" for clarity. To avoid potential ordering biases during training, we shuffle the dataset using random_state=42, ensuring that this operation is fully reproducible across runs.

Next, we perform a thorough consistency check to ensure that all image paths listed in the CSV actually correspond to existing image files in the img folder. Since each image path in the CSV includes a "img\\" prefix that doesn’t match the folder structure, we remove this prefix when checking for file existence. If any image file is missing, we store it in a list, print a warning message showing how many are absent, and drop them from the dataset to prevent runtime errors later.

To cross-validate this check, we apply an alternative method: we add a boolean exists column to the DataFrame by testing whether each image file is present using os.path.exists. Finally, we filter the rows where the image does not exist and print how many are missing according to this second method.

In our case, both checks agree and the output confirms that all image files are present and accounted for zero missing images, this ensures that we can proceed with training and evaluation without encountering any broken links or file-not-found errors.

In [None]:
try:
    df = pd.read_csv(csv_path, header=None, names=["image_name", "font"])
    df = df.sample(frac=1, random_state=SEED).reset_index(drop=True)
    missing_images = []
    for img_path in df['image_name']:
        if not (img_dir / img_path.replace('img\\', '')).exists():
            missing_images.append(img_path)
    if missing_images:
        print(f"Attenzione: {len(missing_images)} immagini mancanti!")
        df = df[~df['image_name'].isin(missing_images)]

except FileNotFoundError as e:
    raise SystemExit(f"pages non trovato: {e}")

df['exists'] = df['image_name'].apply(lambda x: os.path.exists(os.path.join(img_dir, os.path.basename(x))))
missing_files = df[~df['exists']]
print(f"Missing images: {len(missing_files)}")

Before training our classification model, we need to convert the font names—which are currently strings—into numerical values that can be handled by PyTorch. This process, known as label encoding, is essential for transforming categorical variables into a format that neural networks can interpret.
To achieve this, we first extract and sort all the unique font names found in the dataset, we then build a dictionary called font_to_label, where each font is mapped to a unique integer ID, starting from 0. This consistent ordering is important, especially for reproducibility and when interpreting the model’s output.
Using the .map() function, we apply this dictionary to the "font" column, creating a new column called "label" that contains the numeric class associated with each image.
To confirm the mapping was performed correctly, we print the full list of font-to-label associations and the output below shows a sample of this mapping:
- Font: augustus -> Label: 0
- Font: aureus -> Label: 1
- Font: cicero -> Label: 2
- Font: colosseum -> Label: 3
- Font: consul -> Label: 4
- Font: forum -> Label: 5
- Font: laurel -> Label: 6
- Font: roman -> Label: 7
- Font: senatus -> Label: 8
- Font: trajan -> Label: 9
- Font: vesta -> Label: 10

In [None]:
font_names = sorted(df['font'].unique())
font_to_label   = {font: idx for idx, font in enumerate(font_names)}
df['label'] = df['font'].map(font_to_label)
print("Mapping of font to label:")
for font, label in font_to_label.items():
    print(f"Font: {font} -> Label: {label}")

# PHASE 1: EXPLORATORY DATA ANALYSIS (EDA)

To get a clearer understanding of our dataset after label encoding, we display the first 20 rows of the DataFrame. Each row corresponds to a single image sample, and the columns include:
- **image_name**: the filename of the image,
- **font**: the original font name (as a string),
- **exists**: a boolean indicating whether the image file is present on disk (useful for debugging),
- **label**: the integer class ID assigned through label encoding.
This preview allows us to verify that the label encoding was correctly applied and that each image has an associated numerical label.

We also print two key pieces of information:
- **The number of unique fonts**, which corresponds to the number of classes the model will learn to predict (11)
- **The total number of rows in the dataset**, which tells us how many samples are available for training and evaluation (1256)
This sanity check ensures that our dataset is correctly structured before proceeding to create PyTorch datasets and dataloaders.

In [None]:
display(df.head(20))
print(" ")
print("Number of unique fonts:", len(font_to_label))
print("Number of rows in the dataset:", len(df))
print(df.columns)


Before feeding the data into our model, we perform a quick visual inspection of the font classes, in particular we group the dataset by the font column, and for each group, we display the first image associated with that font, this helps us get a qualitative understanding of the dataset's visual diversity and check if the images are correctly labeled.
Using a 3x4 grid, we arrange the sample images so that each subplot displays one representative image per font with its name as the title and next, we check the dimensions of the images in the dataset. By extracting the width and height of each image, we calculate their average dimensions, this analysis shows that the image widths are significantly large (≈ 4777 px), while the heights are also quite substantial (≈ 4827 px). This could indicate that the images are high-resolution and may need to be resized or processed to match the input requirements of the model, particularly if it expects a standard image size.

In [None]:
fig = plt.figure(figsize=(15, 12))
gs = gridspec.GridSpec(3, 4, figure=fig) 
for idx, (font, sub_df) in enumerate(df.groupby('font')):
    ax = fig.add_subplot(gs[idx])
    img_path = os.path.join(img_dir, sub_df.iloc[0]['image_name'].split('\\')[-1])
    img = Image.open(img_path)
    ax.imshow(img, cmap='gray')
    ax.set_title(font, fontsize=8)
    ax.axis('off')
plt.tight_layout()
plt.show()

sizes = [Image.open(os.path.join(img_dir, f.split('\\')[-1])).size for f in df['image_name']]
widths, heights = zip(*sizes)
print(f"Image dimensions: Avg W={np.mean(widths)}, Avg H={np.mean(heights)}")

Here, what we've done is to first assess the distribution of the fonts in our dataset, in fact understanding how many samples exist for each font is crucial, as an imbalanced dataset can lead to a biased model. We start by calculating the number of occurrences of each font using the value_counts() method, which gives us the absolute count of images for each font, then, to better understand the relative importance of each font, we calculate the percentage of the total dataset that each font represents (this is done by dividing the count of each font by the total number of samples, and then multiplying by 100).

After performing these calculations, we put the data into a table that shows both the absolute count and the percentage for each font, this provides a clear numerical overview of how the fonts are distributed across the dataset.
Next, we visualize the distribution using a bar plot, in which, each bar represents a different font, with the length of the bar indicating how many samples belong to that font and to make it easier to read, we annotate each bar with both the count and percentage of samples for that font.

The output of this analysis includes two main results:
- A table that shows the font counts and percentages;
- A bar chart that visually presents the distribution of fonts, with counts and percentages labeled on each bar.
This analysis helps us quickly identify any potential issues, such as whether some fonts are overrepresented or underrepresented, if any imbalances are found, we can take appropriate steps, such as adjusting the dataset, to ensure the model has a fair chance to learn from all fonts.

In [None]:
font_counts = df["font"].value_counts()
font_distribution = ((font_counts / font_counts.sum()) * 100).round(2)

distribution_df = pd.DataFrame({
    'Count': font_counts,
    'Percentage (%)': font_distribution
})

print("\nDistribution for each font:")
print(distribution_df.to_string(formatters={'Percentage (%)': '{:.2f}%'.format}))

plt.figure(figsize=(12, 8))
ax = sns.countplot(
    data=df, 
    y='font', 
    order=font_counts.index,
    palette='viridis',
    edgecolor='black'
)
for i, (value, name) in enumerate(zip(font_counts.values, font_counts.index)):
    ax.text(
        value + font_counts.max()*0.02,
        i,
        f'{value}\n({font_distribution[name]}%)',  
        va='center',
        ha='left',
        fontsize=9,
        color='black'
    )

plt.title('Font distribution in the dataset', fontsize=14, pad=20)
plt.xlabel('Number of samples', fontsize=12)
plt.ylabel('Font', fontsize=12)
plt.xlim(0, font_counts.max() * 1.2)
plt.grid(axis='x', linestyle='--', alpha=0.7)
sns.despine()
plt.tight_layout()
plt.show()

*Main Findings from EDA:*
- Font Distribution: The dataset contains 12 distinct fonts, with varying frequencies. The top 3 fonts—aureus, cicero, and roman—make up a significant portion of the dataset. Aureus has the highest count at 142 images, representing approximately 11.31% of the total dataset, followed by cicero with 136 images (10.83%) and roman with 130 images (10.35%); the remaining fonts have fewer occurrences, with forum having the least, at 86 images (6.85%).
- Image Dimensions: The images in the dataset are generally large, with an average width of 4777 pixels and an average height of 4827 pixels, this suggests that the images need resizing or adjustment for uniformity before feeding them into a model, as such large dimensions can lead to high computational costs.

In [None]:
IMG_DIR = img_dir
CSV_PATH = csv_path
TARGET_SIZE = (256, 256)

# PHASES 1.1 PRE-PROCESSING + 1.2 DATA AUGMENTATION

In this part of the analysis, we focus on preprocessing the images to enhance their quality for further text extraction, the main objective is to process the images so that text can be clearly identified, especially in cases where images may contain noisy or low-contrast areas:
1. **Grayscale Conversion:**
We start by converting the images to grayscale. This is a typical step in image preprocessing because color information is not necessary for text extraction, and converting to grayscale reduces the complexity of the image data.

2. **Contrast Enhancement using CLAHE:**
Next, we apply **Contrast Limited Adaptive Histogram Equalization (CLAHE)**, which improves the local contrast of the image, in fact CLAHE is particularly useful for images with uneven lighting or areas of the image where text may be hard to distinguish due to poor contrast.It works by dividing the image into small tiles and adjusting the contrast in each tile individually, ensuring that no area becomes overly bright or dark; by applying CLAHE, we expect to see improved visibility of text, particularly in areas where the contrast between the text and the background is weak.

3. **Handling Double-Page Images:**
If an image represents a double-page spread (i.e., it has a wide aspect ratio), we split it into two separate images: one for the left page and one for the right page; this allows us to treat each page individually for better text extraction and reduces the risk of missing text that might span across the middle of the spread. We rely on the image’s width-to-height ratio to determine whether it needs splitting, if the width exceeds the threshold (set at 1.2 times the height), we split the image in half.

4. **Text Patch Extraction:**
For each page (whether split or not), we apply binary thresholding to the image in order to isolate text. **Adaptive thresholding** is used, meaning the threshold value for binarization is computed for each pixel based on its local neighborhood, this is beneficial for images with varying lighting conditions.
We extract small patches of the image (e.g., 255x255 pixels) using a sliding window approach, the goal is to identify areas that likely contain text and we do this by checking the fraction of white pixels in each patch, which helps distinguish text areas from the background. If a patch has a sufficient amount of text (as defined by a threshold), it is saved as a text patch and if the number of detected patches is too low, we fall back to extracting a patch from the center of the image, which is likely to contain text as well.

### **Expected Results:**
By the end of this process, we expect to have:
* **Grayscale images** with enhanced contrast, making text more visible and easier to extract;
* **Split images** for double-page spreads, ensuring that each page is treated individually;
* **Text patches** that contain regions of interest, so where the text is present, and these patches will be used in subsequent steps for optical character recognition (OCR) or other forms of text analysis.

In summary, this preprocessing workflow improves the quality of images, especially those with low contrast or double-page spreads, and extracts small, relevant areas of text for further analysis, the expected output is a set of cleaned and processed image patches that contain text, ready for the next stage in the image processing pipeline.


In [None]:
def convert_to_grayscale(img):
    return img.convert('L')

def apply_CLAHE(pil_img: Image.Image, clipLimit: float = 2.0, tileGridSize: [int, int] = (8, 8)) -> Image.Image:
    gray = np.array(pil_img.convert('L'))
    clahe = cv2.createCLAHE(clipLimit=clipLimit, tileGridSize=tileGridSize)
    enhanced = clahe.apply(gray)
    return Image.fromarray(enhanced)


def split_double_page(pil_img, ratio_thr=1.2):
    w, h = pil_img.size
    if w / h > ratio_thr:
        mid = w // 2
        return [pil_img.crop((0, 0, mid, h)), pil_img.crop((mid, 0, w, h))]
    else:
        return [pil_img]

def extract_text_patches(image,
                         patch_size=(255, 255),
                         thresh_method='adaptive',
                         text_area_range=(0.2, 0.8),
                         denoise=True,
                         use_clahe=False,
                         min_patches=1,
                         include_center=True):

    all_patches = []
    pages = split_double_page(image)

    for page in pages:
        if use_clahe:
            gray = np.array(apply_CLAHE(page))
        else:
            gray = np.array(page.convert('L'))
        
        if thresh_method == 'adaptive':
            bin_img = cv2.adaptiveThreshold(
                gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                cv2.THRESH_BINARY_INV, blockSize=21, C=5
            )
        else:
            _, bin_img = cv2.threshold(gray, 80, 255, cv2.THRESH_BINARY_INV)
        
        
        H, W = bin_img.shape
        ph, pw = patch_size
        stride_h, stride_w = ph // 2, pw // 2

        for y in range(0, H - ph + 1, stride_h):
            for x in range(0, W - pw + 1, stride_w):
                sub = bin_img[y:y+ph, x:x+pw]
                frac = sub.sum() / (255 * ph * pw)
                if text_area_range[0] < frac < text_area_range[1]:
                    all_patches.append(page.crop((x, y, x+pw, y+ph)))
        
        if include_center and len(all_patches) < min_patches:
            cx, cy = W // 2, H // 2
            hw, hh = pw // 2, ph // 2
            c_patch = page.crop((cx-hw, cy-hh, cx+hw, cy+hh))
            all_patches.append(c_patch)
    
    return all_patches


In this part of the code, we define the transformations applied to the images during the training and validation phases. These transformations are crucial for preparing the dataset, augmenting it for the training process, and normalizing it for better performance in machine learning models.
1. **Resizing:**
Both training and validation sets are resized to a target size of **128x128 pixels**, this standardizes the images, ensuring that they all have the same dimensions, which is essential for feeding them into neural networks that require a fixed input size.

2. **Grayscale Conversion:**
The images are converted to grayscale, with a single channel, this simplifies the image and removes color information, focusing on the intensity of light in the image, and this is often sufficient for tasks such as text recognition, where color information may not be necessary.

3. **Data Augmentation (Training Set Only):**
   * **Random Horizontal Flip:** The images in the training set are randomly flipped horizontally with a probability of 50%, this helps the model generalize better by providing more diverse variations of the same image.
   * **Random Affine Transformation:** Random affine transformations are applied, including rotation (up to 10 degrees), translation (shift up to 10% of the image size), and scaling (up to ±10%); these transformations simulate slight variations in the positioning and orientation of the objects in the images, which is useful to make the model more robust to real-world variations.
   * **Random Perspective:** A random perspective distortion is applied with a scale of 0.2, meaning the image is distorted slightly as if viewed from different angles, this helps improve the model’s ability to recognize objects from various perspectives.
   * **Random Erasing (Optional):** There is an option to apply **Random Erasing**, which randomly erases parts of the image, this can simulate occlusions and forces the model to focus on the remaining visible parts of the image, however, this is commented out for now and can be activated if necessary for further augmentation.

4. **Tensor Conversion and Normalization:**
   * The **ToTensor** transformation converts the image into a tensor, which is a format that deep learning models in PyTorch can work with.
   * **Normalization:** The images are then normalized with a mean and standard deviation of **0.5** ensuring that the pixel values of the images are scaled between -1 and 1, which helps the neural network train more efficiently by stabilizing gradients.

5. **Validation Transformations:**
   * The validation set undergoes fewer transformations than the training set, only resizing, grayscale conversion, and normalization are applied. This ensures that the validation images are consistent with the format expected by the model, but without augmentation, as the model is not being trained on the validation set.

**Expected Results:**

* **Training Set:** The images will be augmented in various ways (flipping, rotation, scaling, etc.) to create diversity in the dataset, this helps prevent overfitting and allows the model to generalize better to new, unseen data.
* **Validation Set:** The validation images are simply resized, converted to grayscale, and normalized, ensuring that the model is evaluated on consistent, unaltered data.

By applying these transformations, the model should become more robust to variations in input images and able to generalize better to unseen data. The normalization step ensures that the model receives data in a form that it can effectively process, while the augmentations for training allow it to learn from a wider variety of image transformations.

In [None]:

TARGET_SIZE = (128, 128)

def get_transforms():
    train_transform = transforms.Compose([
        transforms.Resize(TARGET_SIZE),
        transforms.Grayscale(num_output_channels=1),  
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9,1.1)),
        transforms.RandomPerspective(distortion_scale=0.2, p=0.5),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5], std=[0.5]),
    ])
    val_transform = transforms.Compose([
        transforms.Resize(TARGET_SIZE),
        transforms.Grayscale(num_output_channels=1),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5], std=[0.5]),
    ])
    return train_transform, val_transform

In this next cell, we manually test the extract_text_patches() function on a small subset of images to ensure that it works correctly before integrating it into the full dataset pipeline. Specifically, we select the first 5 images from the dataset, load them, and apply the patch extraction function, we then save a few extracted patches per image to verify their visual quality and content.

We do this for several reasons:
- To verify that the image paths in the dataframe are correct and accessible;
- To inspect whether the patch extraction logic detects and crops meaningful areas of text;
- To generate visual debug samples that allow us to assess the quality of patch selection.

At the end of this cell, we expect the console to print:
- The full path of each loaded image and a confirmation that the file exists.
- The number of patches extracted per image.
- Confirmation messages for each patch saved.

Additionally, up to three image patches per sample image will be saved in the current working directory. These .png files can be inspected manually to confirm that they correspond to regions containing text — they will be named debug_idx{X}_patch{Y}.png.

In [None]:
from PIL import Image
import os

test_indices = list(range(5))

for idx in test_indices:
    filename = os.path.basename(df.loc[idx, 'image_name'])
    full_path = os.path.join(img_dir, filename)
    print(f"Uploading   {full_path}   → exists? {os.path.exists(full_path)}")
    img = Image.open(full_path).convert('RGB')
    patches = extract_text_patches(
        img,
        include_center=True,
        min_patches=1
    )
    print(f"→ Per index {idx} found {len(patches)} patch")
    for j, p in enumerate(patches[:3]):
        out_name = f"debug_idx{idx}_patch{j}.png"
        p.save(out_name)
        print(f"   • {out_name} saved")

# PHASE 1.3 DATASET CLASSES

In this part of the code, we define a custom PyTorch dataset class called FontDataset, which is specifically designed for font classification tasks based on scanned document images. This dataset implementation allows the model to focus on localized regions of text, even in the presence of noise or double-page scans, and includes a robust mechanism for handling failed image loads or missing text regions.

Key Features:
**Image Loading and Preprocessing:**
The dataset receives a pandas DataFrame (df) containing image filenames and labels, and a root directory (img_dir) where the images are stored.
Each image is loaded using its filename and converted to RGB format and in case of load failure (e.g., corrupted file, missing image), a fallback image (a black dummy patch of the specified size) is returned, with a default label of -1.
**Patch Extraction from Split Pages:**
Each image may contain a double-page scan, to improve model focus, the image is first split into two single pages using the split_double_page utility.
From each of these pages, relevant patches of text are extracted using extract_text_patches, this function looks for textual regions and returns sub-images of size patch_size that include the central part of the page and at least one valid patch.
**Fallback Mechanism: Center Crop**
If no valid patches are found (e.g., the image is blank, has no clear text region, or the text is cropped), the dataset performs a simple center crop of the original image to obtain a patch, this ensures that the model still receives input data and training can proceed without interruption.
**Image Transformations (Optional):**
If a transformation pipeline is provided (e.g., resizing, grayscale conversion, normalization), it is applied to the selected patch, otherwise, the raw patch (as a PIL image) is returned.
**Label Assignment:**
The label is retrieved from the DataFrame as an integer class ID and returned along with the image patch.
**Data Visualization Method:**
The class also includes a helper method visualize_samples() that allows users to view:
- The original image associated with a given index.
- The corresponding transformed image patch that would be used for training or evaluation.
- The font name and class label associated with each image.
This is useful for debugging the dataset construction process and verifying that the patch extraction and transformations are working as expected.

**Expected Outcomes**
Each call to __getitem__ will return a cropped or extracted patch representing the most informative portion of the original document, even in the case of corrupt or low-quality images, the fallback system ensures consistent input dimensions and label structure. During training and validation, the model receives properly preprocessed image data, increasing robustness and reducing the risk of learning from irrelevant background noise or inconsistent formatting.
By using this dataset class, the training process becomes more stable, flexible, and focused on relevant features—especially important in font classification tasks where local textual texture and structure are key discriminators.

In [None]:
class FontDataset(Dataset):
    def __init__(self, df, img_dir, transform=None, patch_size=(512, 512)):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transform = transform
        self.image_paths = df['image_name'].apply(lambda x: os.path.basename(x)).tolist()
        self.patch_size = patch_size

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.image_paths[idx])
        try:
            original_image = Image.open(img_path).convert('RGB')
        except Exception as e:
            print(f"Error loading {img_path}: {e}")
            # Fallback: immagine nera
            dummy = Image.new('RGB', self.patch_size, color=(0,0,0))
            if self.transform:
                return self.transform(dummy), -1
            return dummy, -1

        patches = []
        for page in split_double_page(original_image):
            patches.extend(
                extract_text_patches(
                    page,
                    patch_size=self.patch_size,
                    include_center=True,
                    min_patches=1
                )
            )

        if not patches:
            img_patch = transforms.functional.center_crop(original_image, self.patch_size)
        else:
            img_patch = random.choice(patches)

        if self.transform:
            img_final = self.transform(img_patch)
        else:
            img_final = img_patch

        label = int(self.df.iloc[idx]['label'])
        return img_final, label

    def visualize_samples(self, indices, title):
        plt.figure(figsize=(15, 8))
        for i, idx in enumerate(indices):
            img_t, label = self.__getitem__(idx)
            font_name = self.df.iloc[idx]['font']

            plt.subplot(2, len(indices), i + 1)
            orig = Image.open(os.path.join(self.img_dir, self.image_paths[idx])).convert('RGB')
            plt.imshow(orig)
            plt.title(f"Originale\n{font_name}", fontsize=8)
            plt.axis('off')

            plt.subplot(2, len(indices), i + 1 + len(indices))
            if isinstance(img_t, torch.Tensor):
                img_show = img_t.permute(1, 2, 0).numpy().squeeze()
            else:
                img_show = np.array(img_t)
            plt.imshow(img_show, cmap='gray' if img_show.ndim==2 else None)
            plt.title(f"Trasformata\n{label}", fontsize=8)
            plt.axis('off')

        plt.suptitle(title, fontsize=14)
        plt.tight_layout()
        plt.show()


# PHASE 1.4 SPLITTING DATASETS

In this section, we orchestrate the final steps of dataset preparation: stratified splitting, transformation assignment, and visual inspection of samples at three stages—full dataset, training set, and test set.

First, we perform a stratified split using `StratifiedShuffleSplit` with one split, allocating 20% of the data to the test set and fixing the random seed for reproducibility. This ensures that each font class maintains its original proportion in both the training and test subsets. We then call `get_transforms()` to retrieve two transformation pipelines: one for training, which includes data augmentations, and a simpler one for validation/testing.

Next, we instantiate three `FontDataset` objects. The first wraps the entire DataFrame so we can visualize five random examples from the complete dataset, giving an initial sense of the raw images and labels. We then create separate DataFrames for the training and test splits (resetting their indices for convenience) and instantiate two more datasets, each with its own transform: the training dataset receives the augmentation pipeline, while the test dataset receives only resizing and normalization.

Finally, we call `visualize_samples()` on each dataset: first on the full dataset, then on the training set, and lastly on the test set. In each case, the method displays a grid of five images where the top row shows the original page crops and the bottom row shows the transformed patches that will actually be fed to the model. By comparing these visuals side by side, we can quickly verify that the split preserved class balance, that augmentations are applied only to the training data, and that normalization and resizing are correctly applied to the test data before model evaluation.

The **output** of this section consists of three distinct visual inspection blocks:
- "Dataset examples complete":
A preview of five random samples from the full dataset before any transformations, this provides a baseline view of the raw data and confirms that the dataset is correctly loaded and indexed.
- "Training Set - Post":
A preview of five images from the training set after applying the training transformations, these images are expected to show a variety of augmentations (e.g., flipping, distortion, or perspective changes), verifying that data augmentation is active and correctly configured.
- "Test Set - Post":
A preview of five images from the test set after applying validation transformations, these images should appear more uniform and structured, since no augmentations are applied — only resizing, grayscaling, and normalization.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=SEED)
train_idx, test_idx = next(sss.split(np.arange(len(df)), df['label']))
train_transform, val_transform = get_transforms()
full_dataset = FontDataset(df, IMG_DIR)
sample_indices = np.random.choice(len(full_dataset), 5, replace=False)
full_dataset.visualize_samples(sample_indices, "Dataset examples complete")
train_df = df.iloc[train_idx].reset_index(drop=True)
test_df = df.iloc[test_idx].reset_index(drop=True)

train_dataset = FontDataset(train_df, IMG_DIR, transform=train_transform)
test_dataset = FontDataset(test_df, IMG_DIR, transform=val_transform)

print("\n TRAINING SET")
train_dataset.visualize_samples(range(5), "Training Set - Post")
print("\n TEST SET")
test_dataset.visualize_samples(range(5), "Test Set - Post")

This section focuses on evaluating how our dataset is distributed across different splits. To achieve this, we use the function get_split_stats, which helps us analyze the composition of each subset by examining the frequency of different font classes.

We start by extracting the relevant portion of the dataset using the indices provided for the split (training or test set) and once we have this subset, we determine the total number of observations within it. To better understand how each font is represented, we iterate through the unique font classes in the dataset and for each font, we calculate how many observations exist in the entire dataset and how many fall within the current split; additionally, we compute two key percentages: the proportion of the font within the full dataset and its representation within the split.

This data is then organized into a structured format using pandas DataFrame, making it easy to visualize and interpret, also to enhance readability, the results are presented in a neatly formatted table using the tabulate library. Along with the table, we include clear section headers that highlight the total number of observations and the details of the split being analyzed.

The expected output provides a snapshot of how well the dataset partitions maintain proportionality across font classes. This analysis ensures that our splits are balanced and that no particular font dominates the training or test set disproportionately, by reviewing these statistics, we gain confidence in the integrity of our dataset and its suitability for model training and validation.

In [None]:
#pip install tabulate
from tabulate import tabulate

def get_split_stats(full_df, split_indices, split_name):
    split_df = full_df.iloc[split_indices]
    total_samples = len(split_df)
    
    stats = []
    for font in full_df['font'].unique():
        font_total = len(full_df[full_df['font'] == font])
        font_split = len(split_df[split_df['font'] == font])
        
        stats.append({
            'Font': font,
            'N° Observations': font_split,
            '% x Font': f"{(font_split / font_total * 100):.1f}%",
            '% x Split': f"{(font_split / total_samples * 100):.1f}%"
        })
    
    stats_df = pd.DataFrame(stats)
    
    print(f"\n{'='*50}")
    print(f"{split_name.upper()} SET STATISTICS")
    print(f"Total number of observations: {total_samples}")
    print(f"{'='*50}")
    
    print(tabulate(stats_df, headers='keys', tablefmt='pretty', showindex=False))
    print(f"{'='*50}\n")
    
    return stats_df

Once we have split the dataset and created the training subset, the next step is ensuring that our model accounts for class imbalances. Some font classes may be underrepresented in the dataset, leading the model to favor more frequent categories, so to mitigate this issue, we compute class weights that assign higher importance to less frequent classes, balancing their impact during training.

We start by extracting the training subset from our main DataFrame using `train_idx`, then, we use `compute_class_weight` from `sklearn.utils.class_weight` to calculate weights inversely proportional to class frequencies. The `balanced` mode ensures that rarer fonts receive higher weights, making their contribution comparable to more common ones.

With these weights computed, we establish a mapping from each font to its corresponding weight using a dictionary, this allows us to reference the weight of any given font efficiently; finally, we transform these weights into a PyTorch tensor, ensuring compatibility with our model. The tensor is ordered according to the predefined numerical labels, which should have been set earlier in the notebook.

This approach guarantees that our model does not disproportionately favor dominant font classes, improving its ability to generalize across all categories.

In [None]:
train_df = df.iloc[train_idx].reset_index(drop=True)
classes = train_df['font'].unique()
weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=train_df['font']
)
font2weight = {font: w for font, w in zip(classes, weights)}
weight_list = [font2weight[f] for f in font_names]  
class_weights = torch.tensor(weight_list, dtype=torch.float).to(device)

This command calls the get_split_stats function to generate and display statistical insights for the training set, based on the previously defined logic, it will extract the subset of data corresponding to train_idx, calculate distribution percentages for each font class, and format the output neatly using tabulate.
After execution, the output will include:
- The total number of observations in the training set: 1004
- A structured table showing how each font is represented, both within the full dataset and specifically in the training split (remembering that our objective was a more or less balanced distribution and around 80% for each font in the training set)
- A well-formatted visual summary to quickly assess balance across classes.

In [None]:
train_stats = get_split_stats(df, train_idx, "Training")

Just as we did for the training set, we now apply the same statistical analysis to the test set. By calling get_split_stats(df, test_idx, "Test"), we extract the relevant observations, compute the proportional distribution of each font, and present the results in a structured format. This ensures that we maintain balance across our dataset splits and allows us to verify that no particular class dominates disproportionately.

In [None]:
test_stats = get_split_stats(df, test_idx, "Test")

To ensure that our dataset split was executed correctly, we print a confirmation of the total number of samples. This allows us to verify that all observations have been accounted for and that there are no discrepancies between the sum of the training and test sets and the original dataset size.
The output will show a breakdown of the number of samples assigned to training and test sets, confirming that their sum matches the total dataset size. This simple check reassures us that the stratified split was properly performed before proceeding with further data analysis or model training.

In [None]:
print(f"Dataset training + dataset test: {len(train_idx)} + {len(test_idx)} = {len(train_idx)+len(test_idx)}")
print(f"Dataset: {len(df)} samples")

## PHASE 1.5 DATA LOADERS

At this stage, we set up our data loaders, which will facilitate efficient batch processing during training and validation, choosing optimal parameters ensures that data is loaded quickly and minimizes memory overhead, especially when working with large datasets.

We define the `batch_size` as 32, determining how many samples are processed in each batch. To balance computational resources, we calculate `num_workers` dynamically using `os.cpu_count()`, dividing by two to allocate a reasonable amount of parallel processing while avoiding excessive resource consumption. If the system's CPU count isn't available, we default to at least one worker. Additionally, we enable `pin_memory`, which accelerates data transfer to the GPU if CUDA is available.

With these settings in place, we initialize the `DataLoader` instances:
- For the training set, we enable `shuffle=True` to ensure the model sees diverse samples in each epoch, improving generalization;
- For validation, we maintain `shuffle=False`, preserving a fixed order that keeps evaluation consistent across runs.

To verify our configuration, we print the total number of batches in both loaders, this allows us to check that the data has been correctly partitioned and that the setup aligns with expectations.

In [None]:
batch_size = 32
num_workers = os.cpu_count() // 2 or 1
pin_memory = torch.cuda.is_available()

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                      num_workers=num_workers, pin_memory=pin_memory)
val_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False,
                    num_workers=num_workers, pin_memory=pin_memory)
print(f"Train batches: {len(train_loader)}, Val batches: {len(val_loader)}")

## PHASE 2 MODEL DEVELOPMENT

**After the dataset import, exploration and manipulation, we can finally start building our model**

(We have tried various models, but we leave you only the most important and significant ones)

# 1° Model: EnhancedFontCNN

We begin with our first deep learning model, EnhancedFontCNN, a convolutional neural network tailored for font classification. Its architecture is designed to extract meaningful visual features and classify different font styles with precision.

The model’s feature extraction component consists of multiple convolutional layers (4), each followed by batch normalization and a ReLU activation function. These layers progressively refine the representation of input images, capturing intricate patterns. As we move deeper into the network, pooling operations reduce the dimensionality of the feature maps, enhancing efficiency without losing essential details. Finally, an adaptive average pooling layer standardizes the feature output before transitioning to classification.

In the classification stage, the extracted features are flattened into a single vector and passed through fully connected layers. The combination of dropout, batch normalization, and non-linear activations stabilizes training and helps prevent overfitting. At the final stage, the model predicts the font class among the total number of categories present in the dataset.

In [None]:
class EnhancedFontCNN(nn.Module):
    def __init__(self, num_classes, in_channels=1, drop_p=0.5):
        super().__init__()
        # Feature extractor
        self.features = nn.Sequential(
            nn.Conv2d(in_channels, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((8, 8))
        )
        # Classifier
        self.classifier = nn.Sequential(
            nn.Dropout(drop_p),
            nn.Linear(256 * 8 * 8, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            nn.Dropout(drop_p),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)               
        x = x.view(x.size(0), -1)         
        x = self.classifier(x)             
        return x

num_classes = df['font'].nunique()
model = EnhancedFontCNN(num_classes=num_classes, in_channels=1).to(device)

Now that we have defined our model, we focus on training it effectively. The train_cnn function orchestrates the entire process, ensuring that learning is stable and optimized. We begin by configuring the optimizer and learning rate scheduler: Adam is our choice for optimization, set with a learning rate of 1e-3 and weight decay of 1e-4, helping prevent overfitting. To refine learning dynamics, we introduce CosineAnnealingLR, gradually decreasing the learning rate across 30 epochs, ensuring smoother convergence. Additionally, ReduceLROnPlateau detects stagnation and reduces the learning rate when validation loss stops improving, adapting the training flow dynamically.

Since fonts appear with varying frequencies in the dataset, we address class imbalance by computing weights inversely proportional to their occurrences, these weights are integrated into CrossEntropyLoss, ensuring that rarer font classes influence training effectively. The weighting is derived from the dataset distribution using df['font'].value_counts(), allowing each font to contribute fairly to the learning process.

During training, the model iterates through batches of images, using batch size = 32 and leveraging GPU acceleration if available. Images and labels are processed, predictions are generated, and loss is calculated using CrossEntropyLoss, while gradients are computed (loss.backward()) and applied (optimizer.step()), updating the model’s parameters to refine its accuracy. Throughout training, we track loss evolution and performance indicators to monitor improvements.

Validation plays a crucial role in assessing model generalization, after each training phase, we switch to evaluation mode, disabling gradient computations. Here, we compute validation loss and accuracy, ensuring that unseen data is handled correctly and if validation accuracy improves, the model's weights are saved automatically (torch.save(model.state_dict(), 'best_model.pth')), preserving the best-performing version.

To further enhance learning efficiency, we implement adaptive learning rate adjustments and early stopping, if validation loss plateaus for three consecutive epochs, ReduceLROnPlateau lowers the learning rate by half, encouraging further optimization. Meanwhile, if validation accuracy does not improve for five consecutive epochs, training stops early, preventing unnecessary resource consumption and reducing the risk of overfitting.

After training is complete, we generate a classification report, summarizing model performance across different font categories, this provides clear insights into classification accuracy and helps fine-tune future improvements, by structuring the training process this way, we maintain a balance between efficiency and precision, ensuring that our model reaches optimal performance while adapting to dataset characteristics.

In [None]:
def train_cnn(model, train_loader, val_loader, df, device, epochs=30):
    class_counts = df['font'].value_counts().sort_index()
    class_weights = torch.tensor(1.0 / class_counts.values, dtype=torch.float).to(device)
    criterion = nn.CrossEntropyLoss(weight=class_weights)

    optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=epochs,
        eta_min=1e-6
    )

    history = {'train_loss': [], 'val_loss': [], 'val_acc': [], 'lr': []}
    best_acc = 0.0
    early_stop_counter = 0

    print("\n🚀 Starting training...")
    for epoch in range(1, epochs + 1):
        t0 = time.time()

        model.train()
        running_loss = 0.0
        for imgs, labels in train_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(imgs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * imgs.size(0)

        train_loss = running_loss / len(train_loader.dataset)

        model.eval()
        val_loss = 0.0
        correct = 0
        all_preds, all_labels = [], []

        with torch.no_grad():
            for imgs, labels in val_loader:
                imgs, labels = imgs.to(device), labels.to(device)
                outputs = model(imgs)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * imgs.size(0)
                preds = outputs.argmax(dim=1)
                correct += (preds == labels).sum().item()
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        val_loss = val_loss / len(val_loader.dataset)
        val_acc = correct / len(val_loader.dataset)
        scheduler.step()

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
            early_stop_counter = 0
            print(f"📦 Best model saved at epoch {epoch} with Val Acc {val_acc:.4f}")
        else:
            early_stop_counter += 1
            if early_stop_counter >= 5:
                print("⏸️ Early stopping")
                break

        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)
        history['lr'].append(optimizer.param_groups[0]['lr'])

        epoch_time = time.time() - t0
        print(f"\nEpoch {epoch}/{epochs} — {epoch_time:.1f}s")
        print(f" Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        print(f" Val Acc: {val_acc:.4f} (Best: {best_acc:.4f}) | LR: {optimizer.param_groups[0]['lr']:.2e}")

    print("\n Classification Report on validation set:")
    print(classification_report(all_labels, all_preds, target_names=class_counts.index.tolist()))

    return model, history


Now we initialize our model and begin training, ensuring that our setup aligns with the previously defined pipeline: First, we determine whether a GPU is available, selecting "cuda" if possible or defaulting to "cpu" otherwise, this ensures that computations are optimized for the available hardware; once the device is selected, we instantiate EnhancedFontCNN, specifying the number of font classes detected in the dataset. The model is moved to the chosen device, preparing it for efficient training. Printing the model architecture allows us to verify that its structure is correctly set up before proceeding.

Training begins with the train_cnn function, where we pass the model, data loaders, dataset, device, and the number of epochs set to 10 and the training loader handles the main dataset, while the validation process utilizes the test loader, ensuring that the model evaluates performance on unseen data throughout the process.

**Expected output:**
As the training progresses over 10 epochs, we expect to see a gradual reduction in training loss and validation loss, while validation accuracy steadily increases and the learning rate (LR) remains stable unless adjusted by the scheduler.
Typical output will look like this:
- Each epoch will display the training time, loss values, accuracy, and best validation accuracy recorded;
- Loss values should generally decrease as the model learns, though occasional fluctuations may occur;
- Validation accuracy may vary initially but should improve over time, indicating better generalization;
- When validation accuracy reaches a new peak, the best model weights will be saved;
- Early stopping will trigger if the accuracy does not improve over consecutive epochs;

At the end of training, the model evaluates on the validation set and produces a classification report, summarizing performance across font categories:
- Precision: Measures how many predicted font labels were correct;
- Recall: Reflects how well the model identified each font correctly;
- F1-score: The harmonic mean of precision and recall, balancing both aspects;
- Accuracy: Represents overall classification success across all samples.

The macro average gives an equally weighted score across all font classes, while the weighted average accounts for varying class frequencies. Given this performance report, we can further refine the model based on misclassified fonts or adjust preprocessing techniques.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n🔌 Device in use: {device}")
model = EnhancedFontCNN(num_classes=len(df['font'].unique())).to(device)
print(model)

trained_model, history = train_cnn(
    model, 
    train_loader,    
    val_loader,      
    df,               
    device,           
    epochs=10         
)

Now that our model has completed training, we assess its performance using a confusion matrix, which provides insights into classification accuracy across different font categories. This matrix visualizes how often predictions align with the actual labels and highlights misclassifications.
We first gather predictions and ground-truth labels from the validation dataset, the model is set to evaluation mode, ensuring that computations remain stable and that gradients do not interfere with performance assessment. Using torch.no_grad(), we process validation batches, extract predictions, and store them for comparison.
Once we have collected all labels and predictions, we compute the confusion matrix using confusion_matrix() that maps true labels against predicted labels, creating a structured representation of model performance and finally the results are then visualized using Seaborn’s heatmap, making it easier to interpret classification trends.

**Expected Output**
This confusion matrix visually represents the classification accuracy of a model, each cell contains the count of predictions for the corresponding true and predicted labels, ranging from "augustus" to "vesta." Darker blue shades indicate higher frequencies, highlighting where the model performs well or struggles.
- The highest count (21) occurs for "laurel," suggesting strong model confidence in that category.
- Some misclassifications occur, notably between "ceres" and "juno," implying potential overlap or ambiguity in their features.
- Certain categories, such as "vesta," have lower counts, indicating either rarity in the dataset or challenges in distinguishing them.
Overall, this confusion matrix serves as a crucial diagnostic tool, highlighting where improvements in feature engineering or model training may be necessary

In [None]:
all_preds, all_labels = [], []
model.eval()
with torch.no_grad():
    for imgs, labels in val_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        outputs = model(imgs)
        preds = outputs.argmax(dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(f"Collected {len(all_labels)} labels and predictions.")

cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(font_names))))
plt.figure(figsize=(10,8))
sns.heatmap(cm, annot=True, fmt='d',
            xticklabels=font_names, yticklabels=font_names,
            cmap="Blues")
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()

# 2° Model : EnhancedFontCNN 2.0

Now we look at improving our CNN training methodology, because the previous implementation worked, but there’s always room for optimization, so in this second version, we refine key aspects to enhance stability, efficiency, and learning adaptability.

One of the main adjustments is the learning rate scheduling, in fact instead of relying solely on ReduceLROnPlateau, which adapts the learning rate based on validation stagnation, we incorporate CosineAnnealingLR to provide a smoother decay across epochs. This ensures that the learning rate decreases gradually, improving convergence while preventing abrupt changes, however, ReduceLROnPlateau is still retained as a backup mechanism when validation performance stops improving for multiple epochs.
Additionally, we start training with a higher learning rate (1e-3) before gradually reducing it, as opposed to the previous fixed learning rate of 1e-4, this allows faster learning in early epochs while adapting more conservatively as training progresses.

Early stopping logic has also been revised, the initial implementation stopped training after five epochs without improvement, while the newer version introduces an early stopping check after three epochs (this provides a quicker response but needs fine-tuning to ensure it doesn’t exit training prematurely).

Finally, we streamline optimizer initialization to avoid redundant redefinitions: in the previous version, Adam was re-instantiated unnecessarily after each epoch, whereas now we ensure more consistent handling of optimization parameters across training cycles.
By integrating these improvements, we aim for more stable and effective training while maintaining flexibility for further fine-tuning

In [None]:
def train_cnn(model, train_loader, val_loader, df, device, epochs=30):
    optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-4)
    scheduler = ReduceLROnPlateau(optimizer, mode='max', patience=3, factor=0.5, verbose=True)

    class_counts = df['font'].value_counts().sort_index()
    class_weights = torch.tensor(1.0 / class_counts.values, dtype=torch.float).to(device)
    criterion = nn.CrossEntropyLoss(weight=class_weights)

    history = {'train_loss': [], 'val_loss': [], 'val_acc': [], 'lr': []}
    best_acc = 0.0
    early_stop = 0

    print("\n🚀 Starting training...")
    for epoch in range(1, epochs + 1):
        t0 = time.time()
        model.train()
        running_loss = 0.0
        for imgs, labels in train_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(imgs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * imgs.size(0)
        train_loss = running_loss / len(train_loader.dataset)

        model.eval()
        val_loss = 0.0
        correct = 0
        all_preds, all_labels = [], []
        with torch.no_grad():
            for imgs, labels in val_loader:
                imgs, labels = imgs.to(device), labels.to(device)
                outputs = model(imgs)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * imgs.size(0)
                preds = outputs.argmax(dim=1)
                correct += (preds == labels).sum().item()
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
        val_loss = val_loss / len(val_loader.dataset)
        val_acc = correct / len(val_loader.dataset)

        scheduler.step(val_acc)

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
            early_stop = 0
            print(f"📦 Best model saved at epoch {epoch} with Val Acc {val_acc:.4f}")
        else:
            early_stop += 1

        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)
        history['lr'].append(optimizer.param_groups[0]['lr'])

        epoch_time = time.time() - t0
        print(f"\nEpoch {epoch}/{epochs} — {epoch_time:.1f}s")
        print(f" Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        print(f" Val Acc: {val_acc:.4f} (Best: {best_acc:.4f}) | LR: {optimizer.param_groups[0]['lr']:.2e}")

        if early_stop >= 5:
            print("⏸️ Early stopping")
            break

    print("\n Classification Report on validation set:")
    print(classification_report(all_labels, all_preds, target_names=class_counts.index.tolist()))
    return model, history


In this cell, we initialize EnhancedFontCNN, the improved version of our CNN model for font classification. Before starting training, we check for GPU availability and assign the model to the optimal device (cuda or cpu) to maximize performance, then the model is passed to the train_cnn() function, using the train_loader for training and the val_loader for validation. The training runs for 12 epochs, incorporating optimizations such as CosineAnnealingLR for smooth learning rate adjustment and early stopping strategies to prevent overfitting.
This phase allows us to evaluate the model’s behavior before proceeding with further comparisons, such as fine-tuning a pretrained ResNet18 network. 
The expected output will be structured similarly to the previous CNN, displaying epoch progress, loss metrics, accuracy updates, and final validation results.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n🔌 Device in uso: {device}")
model = EnhancedFontCNN(num_classes=len(df['font'].unique())).to(device)
print(model)
trained_model, history = train_cnn(
    model, 
    train_loader,     
    val_loader,      
    df,               
    device,           
    epochs=12)

In this cell, we generate a confusion matrix to analyze the model's classification performance on the validation set. The matrix provides a visual representation of how well predictions align with actual labels, highlighting areas where misclassifications occur.
We start by collecting all predicted and true labels from the validation set, ensuring that the model is in evaluation mode to prevent gradient updates, using torch.no_grad(), then we process validation batches, extract predictions, and store them for analysis.

Once we have gathered predictions, we compute the confusion matrix using confusion_matrix() and visualize it with Seaborn’s heatmap. Each row corresponds to true labels, while columns represent predicted labels. Ideally, the highest values should appear along the diagonal, indicating correct classifications, while off-diagonal values reveal misclassifications, helping us identify patterns in errors.

The expected output will be structured similarly to previous evaluations, showing a summary of label collection, the confusion matrix visualization, and any key misclassification trends, in particular:
The EnhancedFontCNN model was trained for 12 epochs using a fixed learning rate of 1e-4 on CPU, here are some key observations:
- Overall Performance:
    - Final accuracy: 43.65% (Epoch 12); 
    - Best epoch: The last epoch (Epoch 12) achieved the highest validation accuracy (43.65%).
    - Trend: Accuracy improved steadily throughout training, indicating the model is learning progressively.
- Loss Analysis:
    - Training loss: Gradually decreases from 2.37 to 1.66, showing consistent learning.
    - Validation loss: Also decreases but with minor fluctuations, suggesting some data variability or potential for further regularization.
- Class-Specific Performance

*High-performing classes:*
- Forum (Precision: 71%, Recall: 59%)
- Aureus (Precision: 67%, Recall: 33%)
- Laurel (Precision: 42%, Recall: 79%) → High recall, but prone to false positives.
- Trajan (100% precision, but only 27% recall) → Very selective but identifies few samples correctly.

*Underperforming classes:*
- Vesta (Precision: 25%, Recall: 6%) → Difficult to classify.
- Consul (Precision: 10%, Recall: 4%) → Could struggle with less distinctive features.

In [None]:
all_preds, all_labels = [], []
model.eval()
with torch.no_grad():
    for imgs, labels in val_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        outputs = model(imgs)
        preds = outputs.argmax(dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(f"Collected {len(all_labels)} labels and predictions..")

cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(font_names))))
plt.figure(figsize=(10,8))
sns.heatmap(cm, annot=True, fmt='d',
            xticklabels=font_names, yticklabels=font_names,
            cmap="Blues")
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()

# 3° Model : ResNet

After training our custom CNN models, we explored an alternative approach using a pretrained network. Instead of training from scratch, we leveraged ResNet18, a widely used architecture known for its efficiency in feature extraction. Since ResNet has already learned general visual patterns from a large dataset (ImageNet), we repurpose its existing weights while adapting the final layers to our specific task.

To do this, we freeze all layers, preventing their weights from being updated during training. This retains the powerful feature extraction capabilities of ResNet while focusing learning efforts on the new classification layer. The original fully connected layer is replaced with a new one, designed to output predictions for our font dataset. Additionally, we apply dropout (0.5) to reduce overfitting, ensuring robust learning.

This approach allows us to harness the strengths of a pretrained network while efficiently adapting it to our classification task.

Let's start by adjusting the dataset transformations, specifically, the key changes we made are:
- **RGB format instead of grayscale** : The model can now leverage color-based features instead of relying solely on texture and shape.  
- **Different augmentation techniques** : Instead of geometric distortions (`RandomAffine`, `RandomPerspective`), the new version applies **color-based transformations** (`ColorJitter`) along with standard **rotation and flipping**.  
- **Updated normalization values**: The new version uses ImageNet-style normalization (`mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`), while the old version used grayscale normalization (`mean=[0.5], std=[0.5]`).  
- **Simplified structure** : The new transformation setup is **directly defined** (`train_transforms`, `val_transforms`), making it more readable but less reusable compared to the function-based approach (`get_transforms()`).  

**Impact of These Changes**:
- The updated pipeline **preserves color details**, which may improve classification accuracy when color is relevant.  
- The model now **focuses on visual variations rather than geometric distortions**, making transformations **less aggressive**.  
- The simpler structure makes **modifications easier**, but sacrifices the flexibility of a function-based approach.  

In [None]:
train_transforms = transforms.Compose([
    transforms.Resize((128,128)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485,0.456,0.406],
                         std=[0.229,0.224,0.225])
])

val_transforms = transforms.Compose([
    transforms.Resize((128,128)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485,0.456,0.406],
                         std=[0.229,0.224,0.225])
])

In this setup, the pre-trained ResNet18 is used as the foundation, leveraging the vast knowledge it has gained from ImageNet to extract rich features from images. Instead of training the entire network from scratch, all of its existing layers are frozen, meaning their parameters remain unchanged, this ensures that the deep convolutional layers continue functioning as powerful feature detectors, while the final classification layer is specifically adapted to the new task—identifying different fonts.

To achieve this, the original fully connected (fc) layer at the end of ResNet is replaced with a custom classification head that consists of a dropout mechanism, which helps prevent overfitting by randomly disabling some neurons during training, ensuring better generalization and that is followed by a linear layer that maps the extracted features to the correct number of font classes, allowing the model to make precise predictions.

Looking at the architecture, ResNet18 begins with a large 7×7 convolutional layer, which captures broad features in the image; followed by batch normalization, ensuring stable training, and a max-pooling layer, reducing dimensionality efficiently. The network then consists of multiple residual blocks, which introduce skip connections—crucial for maintaining strong gradients during backpropagation, blocks that systematically increase the number of channels, moving from 64 to 512, refining features as they progress.

Each stage contains stacked convolutional layers with ReLU activations and batch normalization, so as the depth increases, some layers also incorporate downsampling, reducing the spatial resolution while increasing the representational power of the network. Toward the end, an adaptive average pooling layer ensures that the feature maps are compact enough for classification, regardless of the input size.

Finally, the modified classifier head ensures that the extracted features are properly categorized into the required font classes, with the original backbone frozen, training focuses entirely on this new section of the network, making the process both efficient and effective.

By making these modifications, the model retains the powerful feature extraction capabilities of ResNet18, while also being fine-tuned to the specific task of font recognition. This approach balances speed, efficiency, and accuracy, making it an excellent choice when working with a well-defined classification problem.

In [None]:
resnet = models.resnet18(pretrained=True)
for param in resnet.parameters():
    param.requires_grad = False
num_ftrs = resnet.fc.in_features
resnet.fc = nn.Sequential(
    nn.Dropout(0.5),
    nn.Linear(num_ftrs, len(font_names))
)
model = resnet.to(device)
print(model)

This code cell defines a standard training and evaluation loop for our classification model using PyTorch, the goal here was to train the model on our dataset over a number of epochs and monitor its performance on both the training and validation sets. The training loop begins by setting the model in training mode, and for each batch, it performs the typical sequence of steps: zeroing the gradients, running the forward pass, computing the loss using cross-entropy (which is well suited for multi-class classification problems), backpropagating the error, and updating the weights using the Adam optimizer. We only optimized the final fully connected layer (`model.fc`) since we were using a pre-trained backbone and focusing on fine-tuning only the classifier head. After each epoch, we compute the average training loss and accuracy over the entire training set.

Once the training phase is done for an epoch, the model is evaluated on the validation set and at this point, the model is switched to evaluation mode to disable dropout and batch normalization updates, and we wrap the validation loop in a `torch.no_grad()` context to avoid computing gradients. Again, we calculate loss and accuracy, this time for the validation data, to assess generalization.

In theory, this code would output the training and validation loss and accuracy for each epoch, giving us a clear idea of how well the model is learning and whether it's overfitting or underfitting. However, in practice, we couldn't run this loop until the end because the model we were working with—most likely a version of ResNet—turned out to be too computationally heavy for our available hardware, so, despite the logic being correct and well-structured, we couldn't complete the training process and observe the expected performance metrics.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
num_epochs = 10 
for epoch in range(num_epochs):
    model.train() 
    running_loss = 0.0
    correct, total = 0, 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(imgs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * imgs.size(0)
        _, preds = torch.max(outputs, 1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    train_loss = running_loss / len(train_loader.dataset)
    train_acc = correct / total
    print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
    # Validation Loop
    model.eval()  
    val_loss = 0.0
    val_correct, val_total = 0, 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            outputs = model(imgs)
            loss = criterion(outputs, labels)
            val_loss += loss.item() * imgs.size(0)
            
            _, preds = torch.max(outputs, 1)
            val_correct += (preds == labels).sum().item()
            val_total += labels.size(0)
    val_loss = val_loss / len(val_loader.dataset)
    val_acc = val_correct / val_total
    print(f"Validation  Loss: {val_loss:.4f} | Validation  Acc: {val_acc:.4f}\n")

# 4° Model : MobileNetV2 (CPU friendly version)

The previous model — based on ResNet18 — showed some early promise but ultimately didn't perform as expected for the font classification task. While its deep residual architecture is powerful and widely used, it didn’t yield satisfactory generalization in our case, likely due to the domain-specific nuances in font data that require more efficient spatial encoding and perhaps better regularization.

To address these limitations, we transitioned to a new setup using **MobileNetV2** as the backbone. This model, as shown in the code below, is designed with efficiency and speed in mind, making it a strong candidate for tasks where we seek a balance between performance and computational cost.

The core of the architecture is a pre-trained MobileNetV2 model, from which we use only the `features` part — a deep stack of lightweight convolutional layers composed primarily of **inverted residual blocks** and **depthwise separable convolutions**. These architectural choices reduce the number of parameters significantly while preserving the network’s ability to learn rich hierarchical features from input images. The pre-trained weights come from training on ImageNet, and we freeze these layers to retain their powerful general-purpose visual representations without retraining them from scratch.

To adapt the model to our specific classification task, we introduce a new classification head. It consists of:
- An **adaptive average pooling** layer, which ensures that feature maps from the backbone are globally averaged, regardless of the input image size.
- A **dropout layer** (with 40% probability), added to prevent overfitting by randomly deactivating neurons during training, encouraging robustness.
- A final **fully connected linear layer** that takes the flattened features (dimension 1280, which is the final output size of MobileNetV2) and maps them directly to the number of font classes.

In parallel to modifying the architecture, we also redefined the image transformation pipeline to better suit this network and the variability in font data. First, all input images are resized and center-cropped to a uniform size of **224×224 pixels**, ensuring compatibility with the expected input of MobileNetV2. For training data specifically, we include augmentations such as **random horizontal flipping**, **small-angle rotations**, and **slight color jittering**. These augmentations serve a critical purpose: they artificially introduce variability into the training set, helping the model generalize better and avoid overfitting to fixed font shapes or orientations. Finally, all images are normalized using ImageNet's mean and standard deviation values, aligning them with the scale and distribution the pre-trained backbone expects.

The training setup also includes class weighting in the loss function to address class imbalance, the use of the `AdamW` optimizer (which combines adaptive learning rate with decoupled weight decay), and a `ReduceLROnPlateau` scheduler that dynamically lowers the learning rate when the validation loss plateaus.

**Output:**
Now, looking at the training output, we observe steady and consistent progress over the 5 epochs. Initially, in **Epoch 1**, the model starts with a training accuracy of approximately **22.7%** and a validation accuracy of **42.1%**, indicating that while the classifier has just started learning, the pre-trained backbone already provides meaningful features.

As training proceeds, both training and validation metrics improve significantly:
- By **Epoch 3**, validation accuracy reaches **51.2%**, suggesting that the classifier is beginning to generalize well.
- By **Epoch 5**, training accuracy increases to **47.4%**, and validation accuracy reaches **58.3%**, demonstrating a solid upward trajectory in performance.

The validation loss consistently decreases from **1.77** to **1.24**, which further confirms that the model is not just memorizing training data but is genuinely improving its ability to discriminate between different font classes. Importantly, after each epoch, the scheduler evaluates the validation loss to determine if learning rate adjustments are needed. The best-performing model — in terms of validation loss — is saved to `"best_tl_mobilenetv2.pth"`, ensuring that we retain the most generalizable weights.

In summary, switching to MobileNetV2 proved to be a more effective approach for this task. By combining a frozen, efficient feature extractor with a compact classifier head and a carefully adapted transformation strategy, we achieved better generalization, faster training, and overall more reliable performance on the font classification problem.


In [None]:
def get_transforms_tl(image_size=(224,224), train=True):
    base = [
        transforms.Resize(image_size),
        transforms.CenterCrop(image_size),
    ]
    aug = []
    if train:
        aug = [
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomRotation(10),
            transforms.ColorJitter(0.1, 0.1, 0.1, 0.1),
        ]
    norm = [
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485,0.456,0.406],
                             std=[0.229,0.224,0.225])
    ]
    return transforms.Compose(base + aug + norm)

tl_train_ds = FontDataset(df.iloc[train_idx].reset_index(drop=True),
                          img_dir,
                          transform=get_transforms_tl((224,224), train=True))
tl_val_ds   = FontDataset(df.iloc[test_idx].reset_index(drop=True),
                          img_dir,
                          transform=get_transforms_tl((224,224), train=False))

class TLMobileNetV2(nn.Module):
    def __init__(self, num_classes, freeze_backbone=True):
        super().__init__()
        self.backbone = models.mobilenet_v2(pretrained=True).features
        if freeze_backbone:
            for p in self.backbone.parameters():
                p.requires_grad = False
        self.pool = nn.AdaptiveAvgPool2d((1,1))
        in_f = 1280 
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.4),
            nn.Linear(in_f, num_classes)
        )

    def forward(self, x):
        x = self.backbone(x)
        x = self.pool(x)
        return self.classifier(x)

tl_model     = TLMobileNetV2(num_classes=len(font_to_label),
                             freeze_backbone=True).to('cpu')
tl_criterion = nn.CrossEntropyLoss(weight=class_weights.to('cpu'))
tl_optimizer = optim.AdamW(filter(lambda p: p.requires_grad, tl_model.parameters()),
                            lr=5e-4, weight_decay=1e-4)
tl_scheduler = optim.lr_scheduler.ReduceLROnPlateau(tl_optimizer,
                                                   mode='min',
                                                   patience=2,
                                                   factor=0.5)
best_val = float('inf')
for epoch in range(1, 6):
    # Train
    tl_model.train()
    train_loss, train_acc = 0.0, 0
    for imgs, labels in train_loader:
        tl_optimizer.zero_grad()
        out = tl_model(imgs)
        loss = tl_criterion(out, labels)
        loss.backward()
        tl_optimizer.step()

        train_loss += loss.item() * imgs.size(0)
        train_acc  += (out.argmax(1) == labels).sum().item()

    train_loss /= len(train_loader.dataset)
    train_acc  /= len(train_loader.dataset)

    # Val
    tl_model.eval()
    val_loss, val_acc = 0.0, 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            out = tl_model(imgs)
            loss = tl_criterion(out, labels)
            val_loss += loss.item() * imgs.size(0)
            val_acc  += (out.argmax(1) == labels).sum().item()

    val_loss /= len(val_loader.dataset)
    val_acc  /= len(val_loader.dataset)

    tl_scheduler.step(val_loss)

    print(f"[TL-MobileNetV2] Epoch {epoch} | "
          f"Train loss {train_loss:.4f}, acc {train_acc:.4f} | "
          f"Val   loss {val_loss:.4f}, acc {val_acc:.4f}")

    if val_loss < best_val:
        best_val = val_loss
        torch.save(tl_model.state_dict(), "best_tl_mobilenetv2.pth")


# 5° Model : MobileNetV2 no augmentation

In this experiment, we begin by defining the image transformation pipeline in the get_transforms() function that returns two identical pipelines—one for training and one for validation—since no data augmentation is applied here. Each image is resized to 224×224 pixels to fit MobileNetV2’s input requirements and converted from grayscale to three-channel format and this is essential because MobileNetV2 was trained on RGB ImageNet images, and expects three-channel inputs.

The images are then normalized using the standard ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]), step that is critical in ensuring the compatibility of our input data with the pre-trained weights used in the network, and allows us to leverage transfer learning effectively.

We wrap the image data into a custom FontDataset class, feed it to PyTorch DataLoaders and we use a batch size of 32, enabling shuffling for the training set to promote better generalization, while keeping the test set in fixed order for reproducibility.

Next, we load the pre-trained MobileNetV2 and freeze all convolutional layers in the features block, this allows us to reuse the generic visual features learned from ImageNet, such as edge and texture detectors, without retraining them and we then replace the final classification layer with a new linear layer mapping the feature vector to our specific number of font classes (num_classes).

For the training procedure, we use CrossEntropyLoss, suitable for multi-class classification problems; the optimizer of choice is Adam, which adapts learning rates internally for faster convergence. In addition, a ReduceLROnPlateau scheduler monitors the validation accuracy and automatically reduces the learning rate when performance stops improving—an essential feature for escaping local minima and stabilizing learning.

The training loop is fully custom so in each epoch, the model alternates between a training phase, where it learns from labeled data and updates weights, and a validation phase, where its performance is evaluated on unseen data without updating gradients. We compute both training and validation accuracy at every epoch, and whenever a new best validation accuracy is achieved, we save the model to disk as a checkpoint (best_mobilenetv2.pth).

**Output:**

The output tells the story of a training process that steadily gained ground both in reducing loss and improving accuracy, on both the training and validation sets:

At the start, as is often the case, the loss is quite high (around 67) and accuracy is roughly 29%, so the model essentially begins from scratch, which is expected for a non-trivial problem. However, already after the first epoch, we see a significant jump: the loss drops to about 47.9 and accuracy more than doubles, reaching 51.5%. Validation follows a similar trend with 40% accuracy, indicating a decent level of generalization right from the beginning.

Subsequent epochs show a steady progression, without sudden leaps but with consistent improvement, we notice the loss gradually decreases to just below 29 by the final epoch, while training accuracy climbs to over 71%, meanwhile, validation maintains solid performance, peaking at 71.43% accuracy in epoch 8, when the best model checkpoint is saved.

It’s normal that validation accuracy doesn’t perfectly mirror training accuracy every epoch — for example, at epoch 3 validation dips slightly compared to epoch 2, this reflects natural fluctuations due to data variability and the complexity of the task and this doesn’t indicate overfitting; overall, the model keeps improving.

Saving the model every time validation accuracy reaches a new high is a smart move because it preserves the most effective version of the model without losing progress.

In [None]:
def get_transforms():
    train_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.Grayscale(num_output_channels=3), 
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    val_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.Grayscale(num_output_channels=3),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
    return train_transform, val_transform
train_transform, val_transform = get_transforms()

train_dataset = FontDataset(train_df, IMG_DIR, transform=train_transform)
test_dataset = FontDataset(test_df, IMG_DIR, transform=val_transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)
import torchvision.models as models
import torch.nn as nn
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
print(" Using device:", device)

mobilenet = models.mobilenet_v2(pretrained=True)

for param in mobilenet.features.parameters():
    param.requires_grad = False

mobilenet.classifier[1] = nn.Linear(mobilenet.classifier[1].in_features, num_classes)
mobilenet = mobilenet.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mobilenet.classifier.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=2, verbose=True)
def train_model(model, train_loader, test_loader, epochs=10):
    best_acc = 0.0

    for epoch in range(epochs):
        model.train()
        running_loss, correct, total = 0.0, 0, 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        train_acc = 100 * correct / total
        print(f"📚 Epoch {epoch+1}: Loss = {running_loss:.4f} | Accuracy = {train_acc:.2f}%")

        # Validation
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        val_acc = 100 * correct / total
        print(f"🔎 Validation Accuracy: {val_acc:.2f}%")
        scheduler.step(val_acc)

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "best_mobilenetv2.pth")
            print("✅ Saved best model")

    print(f"\n Best validation accuracy: {best_acc:.2f}%")


Here we print the model's metrics and confusion matrix, in particular:
The model achieves an overall accuracy of 72% across 252 test images — a solid result, especially considering no heavy augmentation or ensemble methods are involved.
Here's how the model performs across the 11 font classes:
- forum and consul absolutely dominate with near-perfect performance — forum has an F1-score of 0.97 and consul 0.94. This suggests that these fonts have highly distinguishable features, and the model has learned them with confidence.
- aureus and cicero also perform very well (F1 around 0.81–0.82), with balanced precision and recall, meaning the model identifies them correctly and consistently.
- colosseum is interesting: it has the lowest precision (0.47) but a surprisingly high recall (0.70). That means the model predicts "colosseum" too often — including when it shouldn’t — but does manage to catch most true colosseum examples (it's a classic case of over-prediction on a class that may share visual features with others).
- roman stands out as problematic: it has decent precision (0.82), but a very low recall (0.35), indicating that the model rarely predicts "roman" even when it’s the correct label — possibly due to its features being too similar to more dominant classes like "consul" or "trajan."
- laurel, vesta, and trajan have middling scores, with F1s hovering around 0.56 to 0.70, these are likely more ambiguous classes in terms of visual features.

Overall, the macro average (simple average across all classes) and weighted average (which considers class imbalance) are both around 0.72–0.75, confirming that performance is not being dominated by just a few well-performing classes.

In [None]:

label_to_font = {v: k for k, v in font_to_label.items()}

y_true = []
y_pred = []

mobilenet.eval()
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = mobilenet(images)
        _, predicted = torch.max(outputs.data, 1)
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predicted.cpu().numpy())

print("\n Classification Report:")
print(classification_report(y_true, y_pred, target_names=[label_to_font[i] for i in sorted(set(y_true))]))

cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=[label_to_font[i] for i in sorted(set(y_true))],
            yticklabels=[label_to_font[i] for i in sorted(set(y_true))])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix - MobileNetV2 (No Augmentation)")
plt.show()


# 6° Model: MobileNetV2 with augmentation

In this cell we find the same model as before, but with data augmentation and so in particular the difference is that it begins by defining the `get_transforms()` function, where the training transformations are more complex than a simple resize and normalize; in fatc we resize images to 224x224 pixels, but then apply a series of augmentations designed to increase the diversity of training samples, these include random horizontal flips (with a 50% chance), small random rotations up to ±10 degrees, and random changes in brightness and contrast. These augmentations help the model generalize better by simulating realistic variations that fonts might exhibit in scanned or photographed documents.

For what concerns the results:
The output shows the training and validation metrics over seven epochs of our MobileNetV2 fine-tuning process:

- Right from the start, we observe a clear downward trend in the training loss, starting from a high 68.04 in epoch 1 down to around 40 by epoch 7, this drop in loss indicates the model is effectively learning to minimize its prediction errors on the training data. 

- Accuracy on the training set shows a steady and consistent increase — from a modest 27.9% in the first epoch to nearly 57% by the seventh epoch, this rise confirms that the model’s predictions are becoming increasingly accurate on the data it has seen during training.

- What’s particularly encouraging is the behavior of the validation accuracy, which rises even faster than the training accuracy in the early epochs, jumping from 38.1% in the first epoch to nearly 62% by epoch 7, this suggests that the model is not just memorizing training samples, but genuinely learning features that generalize to unseen data.

- Each time the validation accuracy improves beyond the previous best, the model’s weights are saved, as indicated by the “✅ Saved best model” messages, this ensures that even if subsequent training causes some overfitting or accuracy dips, we retain the best version of the model discovered so far.

- The relative closeness between training and validation accuracy, especially in the later epochs, also hints that overfitting is being controlled well — likely thanks to the frozen convolutional layers, data augmentation, and learning rate scheduling.

In summary, this output tells a positive story of effective learning: the model quickly improves its ability to classify fonts with consistent gains in both training and validation accuracy, while steadily lowering its loss, setting a solid foundation for further refinement or deployment.

**The output:**

The output shows the training and validation metrics over seven epochs of our MobileNetV2 fine-tuning process.

Right from the start, we observe a clear downward trend in the training loss, starting from a high 68.04 in epoch 1 down to around 40 by epoch 7, this drop in loss indicates the model is effectively learning to minimize its prediction errors on the training data. 

Accuracy on the training set shows a steady and consistent increase — from a modest 27.9% in the first epoch to nearly 57% by the seventh epoch, this rise confirms that the model’s predictions are becoming increasingly accurate on the data it has seen during training.

What’s particularly encouraging is the behavior of the validation accuracy, which rises even faster than the training accuracy in the early epochs, jumping from 38.1% in the first epoch to nearly 62% by epoch 7, that suggests that the model is not just memorizing training samples, but genuinely learning features that generalize to unseen data.

Each time the validation accuracy improves beyond the previous best, the model’s weights are saved, as indicated by the “✅ Saved best model” messages ensuring that even if subsequent training causes some overfitting or accuracy dips, we retain the best version of the model discovered so far.

The relative closeness between training and validation accuracy, especially in the later epochs, also hints that overfitting is being controlled well — likely thanks to the frozen convolutional layers, data augmentation, and learning rate scheduling.

In [None]:
def get_transforms():
    train_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomRotation(degrees=10),
        transforms.ColorJitter(brightness=0.2, contrast=0.2),
        transforms.Grayscale(num_output_channels=3),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    val_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.Grayscale(num_output_channels=3),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    return train_transform, val_transform
train_transform, val_transform = get_transforms()


train_dataset = FontDataset(train_df, IMG_DIR, transform=train_transform)
test_dataset = FontDataset(test_df, IMG_DIR, transform=val_transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

num_classes = df['label'].nunique()

mobilenet = models.mobilenet_v2(pretrained=True)

for param in mobilenet.features.parameters():
    param.requires_grad = False

mobilenet.classifier[1] = nn.Linear(mobilenet.classifier[1].in_features, num_classes)
mobilenet = mobilenet.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mobilenet.classifier.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=2, verbose=True)


def train_model(model, train_loader, test_loader, epochs=10):
    best_acc = 0.0

    for epoch in range(epochs):
        model.train()
        running_loss, correct, total = 0.0, 0, 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        train_acc = 100 * correct / total
        print(f"Epoch {epoch+1}: Loss = {running_loss:.4f} | Accuracy = {train_acc:.2f}%")

        # Validation
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        val_acc = 100 * correct / total
        print(f"Validation Accuracy: {val_acc:.2f}%")
        scheduler.step(val_acc)

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "best_mobilenetv2.pth")
            print("✅ Saved best model.")

    print(f"\n Best validation accuracy: {best_acc:.2f}%")
train_model(mobilenet, train_loader, test_loader, epochs=10)

# 7° Model : MobileNetV2 Partial Fine Tuned - the final one

After evaluating the previous ResNet-based architecture and realizing its limitations in capturing font-specific features—possibly due to the overly deep structure or insufficient specialization for fine-grained textural differences—we decided to switch to a lighter and more efficient model: MobileNetV2. This change is particularly motivated by the need for faster training, fewer parameters to tune, and a better balance between accuracy and computational cost.

As shown in the code, we also redefined the image preprocessing pipeline to align better with the nature of this new architecture and the task at hand. The transformation function `get_transforms_tl` starts by resizing all input images to a fixed shape of 224×224 pixels, ensuring compatibility with MobileNetV2's expected input size. A `CenterCrop` is applied to remove unnecessary background noise, especially helpful if the dataset contains inconsistencies in how fonts are positioned within images.

When training is enabled, we inject controlled randomness into the dataset using data augmentation techniques. A `RandomHorizontalFlip` with a 50% chance introduces left-right symmetry, which might help the model generalize better to fonts with mirrored features. `RandomRotation(10)` slightly rotates images, simulating natural distortions that might occur in font usage. Finally, `ColorJitter` with small variations in brightness, contrast, saturation, and hue helps the model avoid overfitting to specific lighting or rendering conditions.

Regardless of whether we are in training or validation mode, the final stage includes a conversion to tensor and normalization. We adopt the standard ImageNet normalization statistics—mean = `[0.485, 0.456, 0.406]` and std = `[0.229, 0.224, 0.225]`—which matches the distribution of the data MobileNetV2 was pre-trained on. This ensures a smoother transition and compatibility between the learned filters and the new input domain.

Regarding the model itself, `TLMobileNetV2` loads a pre-trained version of MobileNetV2 and fine-tunes only the last two layers of its convolutional backbone—specifically the ones indexed as "18" and "19". These layers capture higher-level abstractions, and by allowing only them to update, we strike a balance between preserving powerful pre-trained features and tailoring the model to our font classification task.

The classifier head is then replaced: instead of the original dense layer, we introduce a `Dropout(0.4)` regularizer followed by a new `Linear` layer that maps the extracted feature vector to the correct number of font classes. This minimalistic yet effective head is designed to prevent overfitting and keep the number of trainable parameters low.

Training uses `CrossEntropyLoss` combined with class weights to handle potential class imbalance, and the optimizer of choice is `AdamW`, which includes weight decay to further promote generalization. A `ReduceLROnPlateau` scheduler dynamically adjusts the learning rate when the validation loss plateaus, making the training process more adaptive and robust.

We also leverage mixed precision training via `torch.amp` and a `GradScaler`, which speeds up computation on GPU while maintaining stability in gradient updates. Early stopping with a patience of 5 epochs ensures we avoid overfitting by halting training when no further improvement is observed.

In [None]:
def get_transforms_tl(image_size=(224,224), train=True):
    base = [
        transforms.Resize(image_size),
        transforms.CenterCrop(image_size),
    ]
    if train:
        aug = [
            transforms.RandomHorizontalFlip(0.5),
            transforms.RandomRotation(10),
            transforms.ColorJitter(0.1,0.1,0.1,0.1),
        ]
    else:
        aug = []
    norm = [
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485,0.456,0.406],
                             std=[0.229,0.224,0.225])
    ]
    return transforms.Compose(base + aug + norm)

tl_train_ds = FontDataset(df.iloc[train_idx].reset_index(drop=True),
                          img_dir,
                          transform=get_transforms_tl((224,224), train=True))
tl_val_ds   = FontDataset(df.iloc[test_idx].reset_index(drop=True),
                          img_dir,
                          transform=get_transforms_tl((224,224), train=False))

tl_train_loader = DataLoader(tl_train_ds,
                             batch_size=32,
                             shuffle=True,
                             num_workers=0,
                             pin_memory=True)
tl_val_loader = DataLoader(tl_val_ds,
                           batch_size=32,
                           shuffle=False,
                           num_workers=0,
                           pin_memory=True)

class TLMobileNetV2(nn.Module):
    def __init__(self, num_classes, fine_tune=True):
        super().__init__()
        self.backbone = mobilenet_v2(pretrained=True)
        if fine_tune:
            for name, param in self.backbone.features.named_parameters():
                if "18" in name or "19" in name:
                    param.requires_grad = True
                else:
                    param.requires_grad = False
        else:
            for param in self.backbone.parameters():
                param.requires_grad = False

        in_f = self.backbone.classifier[1].in_features
        self.backbone.classifier = nn.Sequential(
            nn.Dropout(0.4),
            nn.Linear(in_f, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)
tl_model = TLMobileNetV2(num_classes=len(font_to_label), fine_tune=True).to(device)

tl_criterion = nn.CrossEntropyLoss(weight=class_weights)
tl_optimizer = optim.AdamW(filter(lambda p: p.requires_grad, tl_model.parameters()),
                           lr=5e-4, weight_decay=1e-4)
tl_scheduler = optim.lr_scheduler.ReduceLROnPlateau(tl_optimizer,
                                                    mode='min',
                                                    patience=3,
                                                    factor=0.5)
tl_scaler = torch.amp.GradScaler(device)

best_val = float('inf')
patience = 5
patience_counter = 0
n_epochs = 30

for epoch in range(1, n_epochs + 1):
    # TRAINING
    tl_model.train()
    train_loss, train_acc = 0, 0

    with tqdm(tl_train_loader, unit="batch") as train_bar:
        for imgs, labels in train_bar:
            imgs, labels = imgs.to(device, non_blocking=True), labels.to(device, non_blocking=True)
    
    train_bar = tqdm(tl_train_loader, 
                    desc=f"Epoch {epoch}/{n_epochs} [Train]", 
                    leave=False)
    
    for imgs, labels in train_bar:
        imgs, labels = imgs.to(device), labels.to(device)
        tl_optimizer.zero_grad()
        
        with autocast(enabled=torch.cuda.is_available()):  
            out = tl_model(imgs)
            loss = tl_criterion(out, labels)

        
        tl_scaler.scale(loss).backward()
        tl_scaler.step(tl_optimizer)
        tl_scaler.update()
    
    train_loss /= len(tl_train_loader.dataset)
    train_acc /= len(tl_train_loader.dataset)

    # VAL
    tl_model.eval()
    val_loss, val_acc = 0, 0
    val_bar = tqdm(tl_val_loader, 
                  desc=f"Epoch {epoch}/{n_epochs} [Val]", 
                  leave=False)
    
    with torch.no_grad():
        for imgs, labels in val_bar:
            imgs, labels = imgs.to(device), labels.to(device)
            out = tl_model(imgs)
            loss = tl_criterion(out, labels)
            
            val_loss += loss.item() * imgs.size(0)
            val_acc += (out.argmax(1) == labels).sum().item()
            
            val_bar.set_postfix({
                'val_loss': f"{loss.item():.4f}",
                'val_acc': f"{(out.argmax(1) == labels).float().mean().item():.4f}"
            })

    val_loss /= len(tl_val_loader.dataset)
    val_acc /= len(tl_val_loader.dataset)

    tl_scheduler.step(val_loss)

    print(f"\n[Epoch {epoch}/{n_epochs}] "
          f"Train Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | "
          f"Val Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | "
          f"LR: {tl_optimizer.param_groups[0]['lr']:.2e}")

    if val_loss < best_val:
        best_val = val_loss
        patience_counter = 0
        torch.save(tl_model.state_dict(), "best_tl_mobilenetv2.pth")
        print("💾 Miglior modello salvato!")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping attivato all'epoca {epoch}")
            break

Once training is completed, we move to the final evaluation phase to assess how well the model generalizes to unseen data. To do this, we load the weights corresponding to the best validation performance and set the model in evaluation mode — ensuring deterministic behavior by disabling mechanisms like dropout.

We then iterate over the entire validation set, performing inference on each image and recording both predicted and true labels. This allows us to compute aggregate metrics like precision, recall, and F1-score for each class. To complement these metrics, we generate a confusion matrix — a powerful visual tool that shows exactly where the model is making mistakes and which classes are most frequently confused.

**Results Analysis**
The model achieves an overall accuracy of 73%, a significant improvement over the results obtained with ResNet18. But the real insight lies in the per-class breakdown:
- Strongly classified fonts include consul and forum, with impressively high F1-scores (above 0.90). This suggests the model has learned to distinguish these fonts with high confidence and consistency.
- Fonts like aureus, cicero, vesta, and trajan also show good balance between precision and recall, meaning the learned representations are both reliable and generalizable.
- Some fonts remain challenging. Notably, augustus and roman exhibit high precision but low recall, this means the model is conservative in predicting these classes — it only does so when very confident — but often fails to detect them, misclassifying them as other fonts. For instance, roman has a recall of only 31%, indicating frequent false negatives. Colosseum and laurel land in a more mediocre range, with F1-scores around 0.50–0.60, this may be due to visual similarity to other font styles.

Both macro and weighted averages hover around 0.70, reflecting a reasonably balanced overall performance with room for improvement on the more ambiguous classes. The confusion matrix reinforces this, showing specific class-level confusions — likely tied to shared typographic features — that might benefit from targeted data augmentation or more discriminative loss functions.

In [None]:
print("\n Final evaluation on the Validation Set")
tl_model.load_state_dict(torch.load("best_tl_mobilenetv2.pth"))
tl_model.eval()

all_preds = []
all_labels = []

# Barra di avanzamento per la valutazione
eval_bar = tqdm(tl_val_loader, desc="Evaluating", leave=True)

with torch.no_grad():
    for imgs, labels in eval_bar:
        imgs, labels = imgs.to(device), labels.to(device)
        out = tl_model(imgs)
        preds = out.argmax(1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(classification_report(all_labels, all_preds, target_names=font_names))

# Confusion Matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=font_names, yticklabels=font_names, cmap="Blues")
plt.title("Confusion Matrix - Validation Set")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.tight_layout()
plt.show()

In the cel below we are just selecting the best model till now:

In [None]:
tl_model.load_state_dict(torch.load("best_tl_mobilenetv2.pth"))
tl_model.eval()

And here we are printing the results, so that we can see the results in terms of confusion matrix and metrics, that we have already examinated before.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

all_preds = []
all_labels = []

with torch.no_grad():
    for imgs, labels in tl_val_loader:
        imgs = imgs.to(device)
        out = tl_model(imgs)
        preds = out.argmax(1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.numpy())
print("\n CLASSIFICATION REPORT")
print(classification_report(all_labels, all_preds, target_names=font_names))

cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', xticklabels=font_names, yticklabels=font_names, cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.tight_layout()
plt.show()


Finally in the code below we decided to create a function that prints a 3x3 matrix of sample images and shows us in practice the correct font and the font classified by our model. We wanted to include this, to give a more practical and even more effective from the visual point of view of the result obtained. So for each image it shows:
- T: the true label (i.e., the correct font label),
- P: The predicted (i.e., model prediction)
and we can see that in this case we get 6 right predictions out of 9, in particular we wrong forum which is classified as augustus, laurel as cicero and roman as laurel.

In [None]:
def plot_predictions(dataset, model, device, font_names, num_images=9):
    model.eval()
    fig, axs = plt.subplots(3, 3, figsize=(12, 10))
    axs = axs.ravel()

    count = 0
    for i in range(len(dataset)):
        img, true_label = dataset[i]
        with torch.no_grad():
            output = model(img.unsqueeze(0).to(device))
            pred_label = output.argmax(1).item()

        ax = axs[count]
        img_disp = img.permute(1, 2, 0).numpy()
        img_disp = img_disp * [0.229, 0.224, 0.225] + [0.485, 0.456, 0.406]  # un-normalize
        img_disp = img_disp.clip(0, 1)

        ax.imshow(img_disp)
        ax.set_title(f"T: {font_names[true_label]}\nP: {font_names[pred_label]}", fontsize=10)
        ax.axis("off")

        count += 1
        if count >= num_images:
            break

    plt.tight_layout()
    plt.show()

# Esempi dal validation set
plot_predictions(tl_val_ds, tl_model, device, font_names, num_images=9)z


# THE END