# Final Experiment Run â€“ LayoutLMv3 on FUNSD

## Purpose
This notebook executes the **final training and evaluation run** of **LayoutLMv3** on the **full FUNSD dataset**.
The experimental setup is designed to be **comparable to the LayoutLM baseline**:
- same dataset split strategy (train/test)
- same evaluation metric (entity-level Precision/Recall/F1 using seqeval)
- same random seed and training procedure (fine-tuning, Trainer-based)

## Key difference to LayoutLM
LayoutLMv3 incorporates **visual information** (document images) in addition to text and layout.
Therefore, inputs include `pixel_values` besides `input_ids`, `bbox`, and `labels`.

## Imports and Reproducibility
We fix random seeds to improve reproducibility and disable tokenizers parallelism to avoid notebook deadlocks.

In [1]:
import os
import json
import random
from datetime import datetime

import numpy as np
import torch
from datasets import load_dataset, Dataset

from transformers import (
    LayoutLMv3Processor,
    LayoutLMv3ForTokenClassification,
    TrainingArguments,
    Trainer,
    default_data_collator,
)

from seqeval.metrics import precision_score, recall_score, f1_score

os.environ["TOKENIZERS_PARALLELISM"] = "false"

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x10ba6eed0>

## Full-Run Configuration (Comparable to LayoutLM)
To ensure comparability, we use:
- full train/test splits
- batch size = 1 (CPU-only environment)
- fixed number of epochs (e.g., 3)
- identical seed

In [2]:
# Full-run parameters (match your LayoutLM baseline as closely as possible)
epochs = 3
batch_size = 1

dataset_name = "nielsr/funsd-layoutlmv3"
model_name = "microsoft/layoutlmv3-base"

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = os.path.join("results", "funsd", "layoutlmv3", "fullrun", run_id)
os.makedirs(output_dir, exist_ok=True)

run_config = {
    "dataset": dataset_name,
    "model": model_name,
    "train_split": "train",
    "eval_split": "test",
    "epochs": epochs,
    "batch_size": batch_size,
    "seed": seed,
    "device": "cpu",
    "run_id": run_id,
    "output_dir": output_dir,
}

## Dataset Loading and Label Space
We load the LayoutLMv3-ready FUNSD dataset variant and extract the BIO label set.

In [3]:
funsd = load_dataset(dataset_name)
print(funsd)

train_split = "train"
eval_split = "test"

train_raw = funsd[train_split]
eval_raw = funsd[eval_split]

run_config["train_split_size"] = len(train_raw)
run_config["eval_split_size"] = len(eval_raw)

sample = train_raw[0]
print("Sample keys:", list(sample.keys()))

token_field = "words" if "words" in sample else ("tokens" if "tokens" in sample else None)
if token_field is None:
    raise KeyError("Expected 'words' or 'tokens' field in the dataset sample.")
print("Using token field:", token_field)

# Label list (BIO tags)
if "ner_tags" in funsd[train_split].features:
    label_list = funsd[train_split].features["ner_tags"].feature.names
else:
    raise KeyError("Expected 'ner_tags' in dataset features.")
num_labels = len(label_list)

run_config["num_labels"] = num_labels
run_config["label_list_preview"] = label_list[:10]

print("Number of labels:", num_labels)
print("First labels:", label_list[:10])

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 149
    })
    test: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 50
    })
})
Sample keys: ['id', 'tokens', 'bboxes', 'ner_tags', 'image']
Using token field: tokens
Number of labels: 7
First labels: ['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER']


## Processor and Model Initialization
LayoutLMv3 requires a processor (tokenizer + image processor).  
We explicitly set `apply_ocr=False` because we provide **existing bounding boxes** from the dataset.

In [4]:
processor = LayoutLMv3Processor.from_pretrained(model_name, apply_ocr=False)

model = LayoutLMv3ForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
)

# Label mappings for clean logs/analysis
id2label = {i: lab for i, lab in enumerate(label_list)}
label2id = {lab: i for i, lab in enumerate(label_list)}
model.config.id2label = id2label
model.config.label2id = label2id

Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Encoding (Text + Layout + Image)
To make batching stable (especially for `pixel_values`), we **pad to a fixed max length** (512).  
Labels are aligned to tokens such that only the **first subtoken** of each word receives a label; subsequent subtokens are set to `-100`. This is required for correct entity-level evaluation with seqeval.

In [5]:
def encode_funsd_examples_layoutlmv3(raw_ds, processor, token_field, max_length=512):
    encoded = []

    for example in raw_ds:
        words = example[token_field]

        # bboxes key can differ across dataset variants
        word_boxes = example["bboxes"] if "bboxes" in example else example["bbox"]

        # labels key (usually ner_tags)
        word_labels = example["ner_tags"] if "ner_tags" in example else example["labels"]

        if "image" not in example:
            raise KeyError("Expected an 'image' field for LayoutLMv3 inputs.")
        image = example["image"]

        # Processor combines image + tokens + boxes
        encoding = processor(
            image,
            words,
            boxes=word_boxes,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_tensors="np",
        )

        input_ids = encoding["input_ids"][0]
        attention_mask = encoding["attention_mask"][0]
        bbox = encoding["bbox"][0]
        pixel_values = encoding["pixel_values"][0]

        # Obtain word_ids (LayoutLMv3 tokenizer requires boxes if called separately)
        try:
            word_ids = encoding.word_ids(batch_index=0)
        except Exception:
            tok = processor.tokenizer(
                words,
                boxes=word_boxes,
                truncation=True,
                padding="max_length",
                max_length=max_length,
                is_split_into_words=True,
            )
            word_ids = tok.word_ids()

        labels = []
        previous_word_id = None
        for word_id in word_ids:
            if word_id is None:
                labels.append(-100)
            else:
                if word_id != previous_word_id:
                    labels.append(int(word_labels[word_id]))
                else:
                    labels.append(-100)
            previous_word_id = word_id

        item = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "bbox": bbox,
            "pixel_values": pixel_values,
            "labels": np.array(labels, dtype=np.int64),
        }
        encoded.append(item)

    return Dataset.from_list(encoded)

## Build Torch Datasets
We encode the full train and test splits and convert them to torch tensors for Trainer compatibility.

In [6]:
train_dataset = encode_funsd_examples_layoutlmv3(
    train_raw, processor=processor, token_field=token_field, max_length=512
)
eval_dataset = encode_funsd_examples_layoutlmv3(
    eval_raw, processor=processor, token_field=token_field, max_length=512
)

train_dataset.set_format("torch")
eval_dataset.set_format("torch")

print("Train size:", len(train_dataset))
print("Eval size:", len(eval_dataset))

# sanity check: equal lengths after max_length padding
print(len(train_dataset[0]["input_ids"]), len(train_dataset[1]["input_ids"]))
print("Example keys:", train_dataset[0].keys())

Train size: 149
Eval size: 50
512 512
Example keys: dict_keys(['input_ids', 'attention_mask', 'bbox', 'pixel_values', 'labels'])


## Entity-level Evaluation (seqeval)
We report entity-level Precision/Recall/F1 based on BIO tags.  
Tokens with label `-100` are ignored (special tokens and non-first subtokens).

In [7]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    true_labels = []
    true_preds = []

    for pred_seq, label_seq in zip(preds, labels):
        seq_true = []
        seq_pred = []
        for p, l in zip(pred_seq, label_seq):
            l = int(l)
            if l == -100:
                continue
            p = int(p)
            seq_true.append(label_list[l])
            seq_pred.append(label_list[p])

        if len(seq_true) > 0:
            true_labels.append(seq_true)
            true_preds.append(seq_pred)

    return {
        "precision": precision_score(true_labels, true_preds),
        "recall": recall_score(true_labels, true_preds),
        "f1": f1_score(true_labels, true_preds),
    }

## Training and Evaluation
We fine-tune LayoutLMv3 on CPU using the HuggingFace Trainer API.  
To keep the run stable in notebooks, we set `dataloader_num_workers=0` and disable checkpoint saving.
Results are evaluated on the test split after each epoch.

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,

    evaluation_strategy="epoch",
    save_strategy="no",

    logging_strategy="steps",
    logging_steps=50,

    seed=seed,
    report_to="none",

    dataloader_num_workers=0,
    dataloader_pin_memory=False,

    use_cpu=True,

    # Explicit hyperparams (optional, but good for reproducibility)
    learning_rate=5e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
)

# For LayoutLMv3 with fixed-length inputs, default_data_collator is robust (stacks pixel_values cleanly)
data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=processor.tokenizer,
    compute_metrics=compute_metrics,
)

train_result = trainer.train()
eval_metrics = trainer.evaluate()

print("Eval metrics:", eval_metrics)
print("Train result:", train_result)

## Saving Results for Reproducibility
We store the run configuration and evaluation metrics as JSON files.
This supports transparent reporting and later comparison with LayoutLM.

In [None]:
with open(os.path.join(output_dir, "run_config.json"), "w", encoding="utf-8") as f:
    json.dump(run_config, f, indent=2)

with open(os.path.join(output_dir, "metrics.json"), "w", encoding="utf-8") as f:
    json.dump(eval_metrics, f, indent=2)

print(f"Saved run_config.json and metrics.json to: {output_dir}")