<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, let's install the relevant libraries:
* 🤗 Transformers, for the model
* 🤗 Datasets, for loading + processing the data
* PyTorch Lightning, for training the model 
* Weights and Biases, for logging metrics during training
* Sentencepiece, used for tokenization.

We'll use PyTorch Lightning for training here, but note that this is optional, you can of course also just train in native PyTorch or use 🤗 Accelerate, or the 🤗 Trainer.

In [1]:
from shared import CONFIGS_DIR
from hydra import compose, initialize_config_dir
tests_config_dir = CONFIGS_DIR / "llm"
with initialize_config_dir(config_dir=str(tests_config_dir), job_name="test", version_base="1.1"):
    config =  compose(config_name="evaluation_config")

In [2]:
from dataset.utils import get_dataset

dataset = get_dataset(config)
dataset.shuffle()

/Users/user/Projects/AI_Osrodek/configs
/Users/user/Projects/AI_Osrodek/configs


Using the latest cached version of the module from /Users/user/.cache/huggingface/modules/datasets_modules/datasets/artpods56--EcclesialSchematisms/a724fb67b8d27b253099abcec772b48750d32437756298596b42ac20c9a96a1b (last modified on Sun Jun 22 18:05:43 2025) since it couldn't be found locally at artpods56/EcclesialSchematisms, or remotely on the Hugging Face Hub.


Using the latest cached version of the module from /Users/user/.cache/huggingface/modules/datasets_modules/datasets/artpods56--EcclesialSchematisms/a724fb67b8d27b253099abcec772b48750d32437756298596b42ac20c9a96a1b (last modified on Sun Jun 22 18:05:43 2025) since it couldn't be found locally at artpods56/EcclesialSchematisms, or remotely on the Hugging Face Hub.


Generating train split:   0%|          | 0/482 [00:00<?, ? examples/s]

Loading label annotations…
Loading label annotations…
Loading label annotations…
Loading label annotations…
Loading label annotations…
Loading label annotations…
Loading label annotations…
Loading label annotations…
Loaded 482 annotations.
Loaded 482 annotations.
Loaded 482 annotations.
Iterating /Users/user/.cache/huggingface/hub/datasets--artpods56--EcclesialSchematisms/snapshots/e6ec756a6518b714428e73fb399333d68dc05a6a/images/images_part5.tar
Iterating /Users/user/.cache/huggingface/hub/datasets--artpods56--EcclesialSchematisms/snapshots/e6ec756a6518b714428e73fb399333d68dc05a6a/images/images_part13.tar
Iterating /Users/user/.cache/huggingface/hub/datasets--artpods56--EcclesialSchematisms/snapshots/e6ec756a6518b714428e73fb399333d68dc05a6a/images/images_part1.tar
Loaded 482 annotations.
Iterating /Users/user/.cache/huggingface/hub/datasets--artpods56--EcclesialSchematisms/snapshots/e6ec756a6518b714428e73fb399333d68dc05a6a/images/images_part3.tar
Loaded 482 annotations.
Iterating /User

Dataset({
    features: ['image_pil', 'image', 'width', 'height', 'words', 'bboxes', 'labels', 'conf'],
    num_rows: 482
})

Let's take a look at the first training example:

In [None]:
#  ⬇️ ONE-CELL DONUT FINE-TUNING — NOW WITH YOUR DonutDataset ⬇️
#  (put this in a fresh cell, adjust paths / hyper-params as needed)

from datasets import load_dataset
from transformers import (
    DonutProcessor,
    VisionEncoderDecoderModel,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    default_data_collator,
)
from torch.utils.data import Dataset
import torch, itertools, re, random
from typing import Any, List
from donut.utils import DonutDataset

# ─────────────────── 1.  DonutDataset (minor tweak: dict output) ──────────────────
added_tokens = []


image_size = [1280, 960]
max_length = 768

# --- config
config = VisionEncoderDecoderConfig.from_pretrained("naver-clova-ix/donut-base")
config.encoder.image_size = image_size # (height, width)
config.decoder.max_length = max_length


# ─────────────────── 2.  Load raw HF splits ───────────────────────────────────────
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
# ─────────────────── 3.  Model & Processor ────────────────────────────────────────
base_ckpt = "naver-clova-ix/donut-base"
processor = DonutProcessor.from_pretrained(base_ckpt)
model     = VisionEncoderDecoderModel.from_pretrained(base_ckpt, config=config)

# ─────────────────── 4.  Wrap with DonutDataset ───────────────────────────────────
MAX_LEN = 512
train_ds = DonutDataset(train_test_split['train'], processor, model, max_length=MAX_LEN, split="train")
val_ds   = DonutDataset(train_test_split['test'], processor, model, max_length=MAX_LEN, split="train")


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [6]:

# ─────────────────── 5.  Simple exact-match metric ────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = torch.tensor(logits).argmax(-1)
    p_txt = processor.batch_decode(preds,  skip_special_tokens=True)
    l_txt = processor.batch_decode(labels, skip_special_tokens=True)
    return {"exact_match": sum(p.strip() == t.strip() for p, t in zip(p_txt, l_txt)) / len(p_txt)}

# ─────────────────── 6.  Training setup ───────────────────────────────────────────
args = Seq2SeqTrainingArguments(
    output_dir="donut-kv",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    learning_rate=2e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    predict_with_generate=False,
    logging_steps=50,
)

trainer = Seq2SeqTrainer(
    model           = model,
    args            = args,
    train_dataset   = train_ds,
    eval_dataset    = val_ds,
    data_collator   = default_data_collator,   # works because items are dicts
    compute_metrics = compute_metrics,
)


In [7]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33martpods56[0m ([33martpods56-ar-prod[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Keyword argument `random_padding` is not a valid argument for this processor and will be ignored.


ValueError: Make sure to set the decoder_start_token_id attribute of the model's configuration.

In [None]:
from transformers import VisionEncoderDecoderConfig

image_size = [1280, 960]
max_length = 768

config = VisionEncoderDecoderConfig.from_pretrained("naver-clova-ix/donut-base")
config.encoder.image_size = image_size # (height, width)
config.decoder.max_length = max_length

Next, we instantiate the model with our custom config, as well as the processor. Make sure that all pre-trained weights are correctly loaded (a warning would tell you if that's not the case).

In [None]:
from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-small")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-small", config=config)

import torch
device = torch.device("cpu")      # or accelerator.device
model.to(device)   

## Create PyTorch dataset

Here we create a regular PyTorch dataset.

The model doesn't directly take the (image, JSON) pairs as input and labels. Rather, we create `pixel_values` and `labels`. Both are PyTorch tensors. The `pixel_values` are the input images (resized, padded and normalized), and the `labels` are the `input_ids` of the target sequence (which is a flattened version of the JSON), with padding tokens replaced by -100 (to make sure these are ignored by the loss function). Both are created using `DonutProcessor` (which internally combines an image processor, for the image modality, and a tokenizer, for the text modality).

Note that we're also adding tokens to the vocabulary of the decoder (and corresponding tokenizer) for all keys of the dictionaries in our dataset, like "\<s_menu>". This makes sure the model learns an embedding vector for them. Without doing this, some keys might get split up into multiple subword tokens, in which case the model just learns an embedding for the subword tokens, rather than a direct embedding for these keys.

In [None]:
%load_ext autoreload
%autoreload 2

import json
from donut.utils import DonutDataset
    
train_dataset = DonutDataset(dataset=train_test_split["train"],
                             processor=processor,
                             model=model,
                             max_length=max_length,
                             split="train", task_start_token="<s_cord-v2>", prompt_end_token="<s_cord-v2>",
                             sort_json_key=False, # cord dataset is preprocessed, so no need for this
                             )

val_dataset = DonutDataset(dataset=train_test_split["test"], 
                           processor=processor, 
                           model=model,
                           max_length=max_length,
                           split="train", task_start_token="<s_cord-v2>", prompt_end_token="<s_cord-v2>",
                           sort_json_key=True, # cord dataset is preprocessed, so no need for this
                             )

Next, we instantiate the datasets:

In [None]:
# we update some settings which differ from pretraining; namely the size of the images + no rotation required
# source: https://github.com/clovaai/donut/blob/master/config/train_cord.yaml
processor.image_processor.size = image_size[::-1] # should be (width, height)
processor.image_processor.do_align_long_axis = False



# val_dataset = DonutDataset("naver-clova-ix/cord-v2", max_length=max_length,
#                              split="validation", task_start_token="<s_cord-v2>", prompt_end_token="<s_cord-v2>",
#                              sort_json_key=False, # cord dataset is preprocessed, so no need for this
#                              )

In [None]:
train_dataset

Let's check which tokens are added:

In [None]:
# the vocab size attribute stays constants (might be a bit unintuitive - but doesn't include special tokens)
print("Original number of tokens:", processor.tokenizer.vocab_size)
print("Number of tokens after adding special tokens:", len(processor.tokenizer))

You can verify that a token like `</s_unitprice>` was added to the vocabulary of the tokenizer (and the model):

In [None]:
processor.decode([57560])

As always, it's very important to verify whether our data is prepared correctly. Let's check the first training example:

In [None]:
pixel_values, labels, target_sequence = train_dataset[0]

This returns the `pixel_values` (the image, but prepared for the model as a PyTorch tensor), the `labels` (which are the encoded `input_ids` of the target sequence, which we want Donut to learn to generate) and the original `target_sequence`. The reason we also return the latter is because this will allow us to compute metrics between the generated sequences and the ground truth target sequences.

In [None]:
print(pixel_values.shape)

In [None]:
# let's print the labels (the first 30 token ID's)
for id in labels.tolist()[:30]:
  if id != -100:
    print(processor.decode([id]))
  else:
    print(id)

In [None]:
# let's check the corresponding target sequence, as a string
print(target_sequence)

In [None]:
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s_cord-v2>'])[0]

In [None]:
# sanity check
print("Pad token ID:", processor.decode([model.config.pad_token_id]))
print("Decoder start token ID:", processor.decode([model.config.decoder_start_token_id]))

## Create PyTorch DataLoaders

Next, we create corresponding PyTorch DataLoaders, which allow us to loop over the dataset in batches:

In [None]:
from torch.utils.data import DataLoader

# feel free to increase the batch size if you have a lot of memory
# I'm fine-tuning on Colab and given the large image size, batch size > 1 is not feasible
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=4)
val_dataloader = DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=4)

Let's verify a batch:

In [None]:
batch = next(iter(train_dataloader))
pixel_values, labels, target_sequences = batch
print(pixel_values.shape)

In [None]:
for id in labels.squeeze().tolist()[:30]:
  if id != -100:
    print(processor.decode([id]))
  else:
    print(id)

In [None]:
print(len(train_dataset))
print(len(val_dataset))

In [None]:
# let's check the first validation batch
batch = next(iter(train_dataloader))
pixel_values, labels, target_sequences = batch
print(pixel_values.shape)

In [None]:
import torch
from dataclasses import dataclass
from typing import List, Dict, Union, Any

@dataclass
class DonutCollator:
    """Dynamic padding for pixel values (same H, W per batch) and labels."""

    pad_token_id: int = processor.tokenizer.pad_token_id

    def __call__(
        self, features: List[Dict[str, Union[torch.Tensor, Any]]]
    ) -> Dict[str, torch.Tensor]:
        """So the features is actually a list of lists, not dictionaries
        every list containts pixel_values, labels, target_sequence in that order.
        """
        
        
        pixel_values = torch.stack([f[0] for f in features])
        labels = torch.nn.utils.rnn.pad_sequence(
            [f[1] for f in features],
            batch_first=True,
            padding_value=self.pad_token_id,
        )
        labels[labels == self.pad_token_id] = -100
        return dict(pixel_values=pixel_values, labels=labels)

collator = DonutCollator()

In [None]:
train_dataset[0]

In [None]:
from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir          = "/Users/user/Projects/AI_Osrodek/src/donut",
    overwrite_output_dir= True,              # re-use the folder across runs
    per_device_train_batch_size = 2,         # fits on a 16 GB GPU at 840×840
    per_device_eval_batch_size  = 2,
    gradient_accumulation_steps = 8,         # effective batch = 16 images
    num_train_epochs    = 5,
    learning_rate       = 5e-5,              # start high, reduce if unstable
    warmup_ratio        = 0.05,
    lr_scheduler_type   = "cosine",
    weight_decay        = 0.05,
    logging_steps       = 50,
    save_strategy       = "epoch",
    eval_strategy = "epoch",             # disable on Apple M-series (see note)           # greedy is fastest / good enough
    push_to_hub         = False,
    remove_unused_columns=True# flip to True if you want auto-upload
)


# --- 7️⃣  Trainer -----------------------------------------------------------------

trainer = Trainer(
    model=model,
    args=args,
    data_collator=collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    
)

In [None]:
trainer.train()

We'll use a custom callback to push our model to the hub during training (after each epoch + end of training). For that we'll log into our HuggingFace account.

In [None]:
device = torch.device("cpu")      # or accelerator.device
model.to(device)   

import time, torch
x = torch.randn(1, 3, 840, 840, device="mps")
with torch.no_grad():
    t0 = time.time()
    _ = model.encoder(pixel_values=x)
    print("seconds:", time.time() - t0)
