<a href="https://colab.research.google.com/github/fsommers/ICMR24/blob/main/ICMR24_LayoutLMv3_3_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LayoutLMv3 on a Custom Document Dataset

Having created our dataset, we can use that dataset to fine-tune the LayoutLMv3 BASE model.

We will use Torch Lightning. It's a good idea to import this library upfront, as it requires that we restart the Colab session:

In [None]:
!pip -qqq install pytorch-lightning

Next, we specify the location of the files and labels, as before:

In [1]:
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

BASE_DIR = Path('/content/drive/MyDrive')

EXAMPLES_DIR = Path(BASE_DIR / 'DOC_EXAMPLES' / 'TRAINING')
UNSEEN_DIR = Path(BASE_DIR / 'DOC_EXAMPLES' / 'UNSEEN')
DATASET_DIR = Path(BASE_DIR / 'DOC_DATA')
LOG_DIR = Path(BASE_DIR / 'DOC_LOGS')
MODELS_DIR = Path(BASE_DIR / 'UPDATED_DOC_MODELS')
DOCUMENT_LABELS = [
    'AZCONTRACT',
    'BOOKSHEET',
    'BUYERSGUIDE',
    'CACONTRACT',
    'CREDITAPP',
    'INSURANCE',
    'NVCONTRACT',

]

# The assumption is that training documents are pre-categorized and collected into subdirectories,
# with directory names corresponding to the document class names
training_dirs = { doc_label: EXAMPLES_DIR.joinpath(doc_label) for doc_label in DOCUMENT_LABELS }
print(training_dirs)

unseen_dirs = { doc_label: UNSEEN_DIR.joinpath(doc_label) for doc_label in DOCUMENT_LABELS }
print(unseen_dirs)

# We will need to keep a stable ordering of the class labels.
DOCUMENT_CLASSES = sorted(training_dirs.keys())
print(DOCUMENT_CLASSES)

import PIL.Image
PIL.Image.MAX_IMAGE_PIXELS = 503120000 # this is needed in case of larger images

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
{'AZCONTRACT': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/AZCONTRACT'), 'BOOKSHEET': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/BOOKSHEET'), 'BUYERSGUIDE': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/BUYERSGUIDE'), 'CACONTRACT': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/CACONTRACT'), 'CREDITAPP': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/CREDITAPP'), 'INSURANCE': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/INSURANCE'), 'NVCONTRACT': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/TRAINING/NVCONTRACT')}
{'AZCONTRACT': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/UNSEEN/AZCONTRACT'), 'BOOKSHEET': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/UNSEEN/BOOKSHEET'), 'BUYERSGUIDE': PosixPath('/content/drive/MyDrive/DOC_EXAMPLES/UNSEEN/BUYERSGUIDE'), 'CACONTRACT': PosixPath('/content/drive

Since the custom Dataset for the documents needs to prepare each document for the LayoutLMv3 model input, we need to again declare the image preparation functions. This is verbatim the same code as the one we used in creating the dataset in the previous notebook:

In [2]:
from typing import List
import json
from PIL import Image

def scale_bounding_box(box: List[int], w_scale: float = 1.0, h_scale: float = 1.0):
  return [
      int(box[0] * w_scale),
      int(box[1] * h_scale),
      int(box[2] * w_scale),
      int(box[3] * h_scale)
  ]

def prepare_image(im_path):
  """
  Prepare an image to be passed to the processor.
  The image, the words, and the scaled bounding boxes for LayoutLM are returned.
  """
  img = Image.open(im_path).convert('RGB')
  width, height = img.size
  width_scale = 1000 / width
  height_scale = 1000 / height
  with im_path.with_suffix(".json").open("r") as f:
    ocr_data = json.load(f)
  words = []
  boxes = []
  for w, b in zip(ocr_data["words"], ocr_data["bbox"]):
    words.append(w)
    boxes.append(scale_bounding_box(b, width_scale, height_scale))

  assert len(words) == len(boxes)
  for bo in boxes:
    for z in bo:
      if (z > 1000):
        raise

  return img, words, boxes

In [3]:
from torch.utils.data import Dataset

class DocumentClassficationDataset(Dataset):

  def __init__(self, image_paths, processor):
    self.image_paths = image_paths
    self.processor = processor

  def __len__(self):
    return len(self.image_paths)

  def __getitem__(self, item):
    image_path = self.image_paths[item]
    image, words, bbox = prepare_image(image_path)
    encoding = self.processor(
        image,
        words,
        boxes=bbox,
        max_length=512,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    label = DOCUMENT_CLASSES.index(image_path.parent.name) # The parent directory name represents the label

    return dict(
        input_ids=encoding['input_ids'].flatten(),
        attention_mask=encoding['attention_mask'].flatten(),
        bbox=encoding['bbox'].flatten(end_dim=1),
        pixel_values=encoding['pixel_values'].flatten(end_dim=1),
        labels=torch.tensor(label, dtype=torch.long)
    )


We can now load the dataset:

In [4]:
import torch

train_dataset = torch.load(DATASET_DIR / 'train_dataset')
test_dataset = torch.load(DATASET_DIR / 'test_dataset')

In [5]:
for item in train_dataset:
  print(item['bbox'].shape)
  print(item['pixel_values'].shape)
  print(item['labels'].item())
  break

print(DOCUMENT_CLASSES)

torch.Size([512, 4])
torch.Size([3, 224, 224])
6
['AZCONTRACT', 'BOOKSHEET', 'BUYERSGUIDE', 'CACONTRACT', 'CREDITAPP', 'INSURANCE', 'NVCONTRACT']


Alternatively, it's sometimes useful load a subset of our dataset to test out the training loop. This is how this could be accomplished, but we'll be using the full dataset here:



In [None]:
# To load a subset of the data. Not used below, may be useful to test the training loop:
from torch.utils.data import Subset

indices = list(range(10))
train_subset = Subset(train_dataset, indices)
test_subset = Subset(test_dataset, indices)

for item in train_subset:
  print(item['bbox'].shape)
  print(item['pixel_values'].shape)
  print(item['labels'].item())

To fine-tune LayoutLMv3, we will use a custom Torch Lightning module. This is similar to a PyTorch module, but Lightning provides a nice training API. Another option would be to use the HuggingFace Trainer.

In [6]:
import pytorch_lightning as L
from pytorch_lightning.callbacks import ModelCheckpoint
import torchvision.transforms as T
from torchmetrics import Accuracy

class ModelModule(L.LightningModule):
    def __init__(self, n_classes:int):
        super().__init__()
        self.model = LayoutLMv3ForSequenceClassification.from_pretrained(
            "microsoft/layoutlmv3-base",
            num_labels=n_classes
        )
        self.model.config.id2label = {k: v for k, v in enumerate(DOCUMENT_CLASSES)}
        self.model.config.label2id = {v: k for k, v in enumerate(DOCUMENT_CLASSES)}
        self.train_accuracy = Accuracy(task="multiclass", num_classes=n_classes)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=n_classes)

    def forward(self, input_ids, attention_mask, bbox, pixel_values, labels=None):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            bbox=bbox,
            pixel_values=pixel_values,
            labels=labels
        )

    def training_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        bbox = batch["bbox"]
        pixel_values = batch["pixel_values"]
        labels = batch["labels"]
        output = self(input_ids, attention_mask, bbox, pixel_values, labels)
        self.log("train_loss", output.loss)
        self.log("train_acc", self.train_accuracy(output.logits, labels), on_step=True, on_epoch=True)
        return output.loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        bbox = batch["bbox"]
        pixel_values = batch["pixel_values"]
        labels = batch["labels"]
        output = self(input_ids, attention_mask, bbox, pixel_values, labels)
        self.log("val_loss", output.loss)
        self.log("val_acc", self.val_accuracy(output.logits, labels), on_step=False, on_epoch=True)
        return output.loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.model.parameters(), lr=0.00001) #1e-5
        return optimizer


We can load the LayoutMLv3 feature extractor, tokenizer, and processor. We're not going to apply OCR for the feature extractor, since we've already pre-processed our documents:

In [7]:
from transformers import LayoutLMv3ImageProcessor, LayoutLMv3TokenizerFast, LayoutLMv3Processor, LayoutLMv3ForSequenceClassification

feature_extractor = LayoutLMv3ImageProcessor(apply_ocr=False)
tokenizer = LayoutLMv3TokenizerFast.from_pretrained("microsoft/layoutlmv3-base", num_labels=len(DOCUMENT_CLASSES))
processor = LayoutLMv3Processor(feature_extractor, tokenizer)

tokenizer_config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

We can now define 2 data loaders, one for the training data, and one for the test data:

In [8]:
from torch.utils.data import DataLoader

train_data_loader = DataLoader(
    train_dataset,
    batch_size=5,
    shuffle=False,
    num_workers=2
)

test_data_loader = DataLoader(
    test_dataset,
    batch_size=5,
    shuffle=False,
    num_workers=2
)

Before starting the training, we need to define a model checkpoint. Here the file name fill include the less metric, which will help is select the checkpoint with the lowest loss. We also save the top 3 checkpoints:

In [9]:
model_checkpoint = ModelCheckpoint(
    filename="{epoch}-{step}-{val_loss:.4f}",
    save_last=True,
    save_top_k=3,
    monitor="val_loss",
    mode="min"
)

In [10]:
model_module = ModelModule(len(DOCUMENT_CLASSES))
print("Document classes: ", len(DOCUMENT_CLASSES))

config.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of LayoutLMv3ForSequenceClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Document classes:  7


The final task is define a Trainer. Se specify that we're training for 5 epochs, use float 16 precision, and use the GPU:

In [11]:
trainer = L.Trainer(
    default_root_dir=MODELS_DIR,
    accelerator='gpu',
    precision=16,
    devices=1,
    max_epochs=5,
    callbacks=[
        model_checkpoint
    ],
)

/usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


The training for 5 epochs itself takes about 2 hours on the dataset of about 3500 documents on a A100.

Note that it may not be necessary to train for 5 epochs. This is largely a matter of experience and trial and error, to see where the best checkpoints converge:

In [12]:
trainer.fit(model_module, train_data_loader, test_data_loader)

INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type                                | Params
-----------------------------------------------------------------------
0 | model          | LayoutLMv3ForSequenceClassification | 125 M 
1 | train_accuracy | MulticlassAccuracy                  | 0     
2 | val_accuracy   | MulticlassAccuracy                  | 0     
-----------------------------------------------------------------------
125 M     Trainable params
0         Non-trainable param

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Once the training has completed, we can view the best 3 models:

In [13]:
model_checkpoint.best_k_models

{'/content/drive/MyDrive/UPDATED_DOC_MODELS/lightning_logs/version_1/checkpoints/epoch=1-step=1120-val_loss=0.0894.ckpt': tensor(0.0894, device='cuda:0'),
 '/content/drive/MyDrive/UPDATED_DOC_MODELS/lightning_logs/version_1/checkpoints/epoch=2-step=1680-val_loss=0.0698.ckpt': tensor(0.0698, device='cuda:0'),
 '/content/drive/MyDrive/UPDATED_DOC_MODELS/lightning_logs/version_1/checkpoints/epoch=4-step=2800-val_loss=0.0836.ckpt': tensor(0.0836, device='cuda:0')}

## Evaluation

To evalute the model, first load the best model:

In [None]:
trained_model = ModelModule.load_from_checkpoint(
    Path(MODELS_DIR / "lightning_logs/version_1/checkpoints/epoch=2-step=1680-val_loss=0.0698.ckpt"),
    #Path(MODELS_DIR / "lightning_logs/version_0/checkpoints/epoch=2-step=1680-val_loss=0.0698.ckpt"),
    n_classes=len(DOCUMENT_CLASSES),
    local_files_only=True
)

In [19]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [21]:
trained_model.eval().to(DEVICE)

ModelModule(
  (model): LayoutLMv3ForSequenceClassification(
    (layoutlmv3): LayoutLMv3Model(
      (embeddings): LayoutLMv3TextEmbeddings(
        (word_embeddings): Embedding(50265, 768, padding_idx=1)
        (token_type_embeddings): Embedding(1, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (position_embeddings): Embedding(514, 768, padding_idx=1)
        (x_position_embeddings): Embedding(1024, 128)
        (y_position_embeddings): Embedding(1024, 128)
        (h_position_embeddings): Embedding(1024, 128)
        (w_position_embeddings): Embedding(1024, 128)
      )
      (patch_embed): LayoutLMv3PatchEmbeddings(
        (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (pos_drop): Dropout(p=0.0, inplace=False)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (norm): LayerNorm((768,), eps

As you can see above, the classification head's output projection now has 7 features.

In [24]:
def predict(im):
  image, words, bbox = prepare_image(im)
  encoding = processor(
      image,
      words,
      boxes=bbox,
      max_length=512,
      padding="max_length",
      truncation=True,
      return_tensors="pt"
  )
  with torch.inference_mode():
    output = trained_model(
        input_ids=encoding["input_ids"].to(DEVICE),
        attention_mask=encoding["attention_mask"].to(DEVICE),
        bbox=encoding["bbox"].to(DEVICE),
        pixel_values=encoding["pixel_values"].to(DEVICE)
    )
    predicted_class = output.logits.argmax()
    item = predicted_class.item()
    return DOCUMENT_CLASSES[item]

In [26]:
test_image = list(unseen_dirs['AZCONTRACT'].glob('*.jpg'))[2]
predicted = predict(test_image)
print(f"Predicted class: {predicted}")
# im = Image.open(test_image)
# im

Predicted class: AZCONTRACT


We can now use the unseen document collection to test our classifier model with. This will run for a while (about 30 minutes on an A100):

In [28]:
from tqdm import tqdm

labels = []
images = []
predictions = []
for lab, dir in unseen_dirs.items():
  ims = list(dir.glob('*.jpg'))
  for f in ims:
    labels.append(lab)
    images.append(f)

for image_path in tqdm(images):
  pred = predict(image_path)
  predictions.append(pred)


100%|██████████| 700/700 [09:33<00:00,  1.22it/s]


In [None]:
%matplotlib inline

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(labels, predictions, labels=DOCUMENT_CLASSES)
cm_display = ConfusionMatrixDisplay(
    confusion_matrix = cm,
    display_labels=DOCUMENT_CLASSES
)
cm_display.plot()
cm_display.ax_.set_xticklabels(DOCUMENT_CLASSES, rotation=45)
cm_display.figure_.set_size_inches(16, 8)

plt.show()

# Upload the model to HuggingFace

In [31]:
!pip install -qqq huggingface_hub

In [32]:
from huggingface_hub import notebook_login

In [35]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trained_model.model.push_to_hub("layoutlmv3-autofinance-classification-us-v02")

CommitInfo(commit_url='https://huggingface.co/fsommers/layoutlmv3-autofinance-classification-us-v01/commit/567aebfce3ff01b0c0baa2ff1d9865e142d2f7fd', commit_message='Upload LayoutLMv3ForSequenceClassification', commit_description='', oid='567aebfce3ff01b0c0baa2ff1d9865e142d2f7fd', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
%load_ext tensorboard
%tensorboard --logdir DOC_LOGS