# Experiment with LayoutLM Model

Notebooks and Documentation:
* [LayoutLMv3 model doc](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)
* [Inference for LayoutLMv2 with labels](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Inference_with_LayoutLMv2ForTokenClassification.ipynb) and [Inference for LayoutLMv2 with no labels](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/True_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb), also has examples of displaying images with bounding boxes inline
* [Original LayoutLM paper](https://arxiv.org/abs/1912.13318)

Blog Posts:
* Fine-tuning Transformer Model For Invoice Recognition: [v1](https://towardsdatascience.com/fine-tuning-transformer-model-for-invoice-recognition-1e55869336d4), [v2](https://towardsdatascience.com/fine-tuning-layoutlm-v2-for-invoice-recognition-91bf2546b19e), [v3](https://towardsdatascience.com/fine-tuning-layoutlm-v3-for-invoice-processing-e64f8d2c87cf)
* [Extracting entities from structured documents: a practical guide](https://medium.com/@ravisatvik.192/unleashing-the-power-of-layoutlm-extracting-entities-from-structured-documents-made-easy-5d82c6290ec7)
* [How to train LayoutLM on a custom dataset](https://medium.com/@matt.noe/tutorial-how-to-train-layoutlm-on-a-custom-dataset-with-hugging-face-cda58c96571c)

Datasets:
* DocBank: DocBank is a large-scale dataset for document layout analysis, understanding, and retrieval. It contains labeled tables and other elements in documents. 
* FUNSD: The [FUNSD dataset](https://huggingface.co/datasets/nielsr/funsd-layoutlmv3) contains labeled annotations for various elements in forms and tables in documents. While it primarily focuses on forms, it may contain some relevant data for table extraction.
* SciIE: The SciIE dataset focuses on scientific documents and contains annotations for entities, relations, and events. 
* CORD-19: This dataset contains a large collection of scholarly articles related to COVID-19. While it may not have explicit annotations for tables, it contains a wide range of document types, including tables, figures, and text.

Annotation:
* [UBIAI](https://ubiai.tools/) - basic free level allows unlimited annotation, accepts HTML
* [UBIAI tutorial video](https://www.youtube.com/watch?v=r1aoFj974FU&ab_channel=KarndeepSingh)
* [Label Studio](https://labelstud.io/guide/get_started.html#Quick-start), open source, accepts html

Notes:
* From the LayoutLMv2 [usage guide](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#usage-layoutlmv2processor): LayoutLMv2Processor uses PyTesseract, a Python wrapper around Googleâ€™s Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of choice, and provide the words and normalized boxes yourself. This requires initializing LayoutLMv2ImageProcessor with apply_ocr set to False.
* See Use case 3: token classification (training), apply_ocr=False
* Need to make a decision between using PyTorch to load training dataset or using Hugging Face Trainer

In [None]:
!pip install -q datasets

In [None]:
# had to run brew install tesseract as well, not sure how to put into environment/package deps
!pip install -q pytesseract

# Try LayoutLMv3 Inference with FUNSD Image

In [None]:
from datasets import load_dataset

# this dataset uses the new Image feature :)
dataset = load_dataset("nielsr/funsd-layoutlmv3")

In [None]:
dataset["train"].features

In [None]:
labels = dataset["train"].features["ner_tags"].feature.names
labels

In [None]:
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}
label2id

In [None]:
example = dataset["test"][1]
example.keys()

In [None]:
image = example["image"]
image

In [None]:
from transformers import AutoProcessor

# processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base-uncased")
# we'll use the Auto API here - it will load LayoutLMv3Processor behind the scenes,
# based on the checkpoint we provide from the hub
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)

encoding = processor(image, return_offsets_mapping=True, return_tensors="pt")
offset_mapping = encoding.pop("offset_mapping")
encoding.keys()

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for k,v in encoding.items():
    encoding[k] = v.to(device)

In [None]:
from transformers import LayoutLMv3ForTokenClassification

# load the fine-tuned model from the hub
model = LayoutLMv3ForTokenClassification.from_pretrained("nielsr/layoutlmv3-finetuned-funsd")
model.to(device)

# forward pass
outputs = model(**encoding)
outputs.logits.shape

In [None]:
def unnormalize_box(bbox, width, height):
     return [
         width * (bbox[0] / 1000),
         height * (bbox[1] / 1000),
         width * (bbox[2] / 1000),
         height * (bbox[3] / 1000),
     ]

predictions = outputs.logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()

width, height = image.size

In [None]:
import numpy as np

is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0

true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]

In [None]:
print(true_predictions)
print(true_boxes)

In [None]:
from PIL import ImageDraw, ImageFont

draw = ImageDraw.Draw(image)

font = ImageFont.load_default()

def iob_to_label(label):
    label = label[2:]
    if not label:
        return "other"
    return label

label2color = {"question":"blue", "answer":"green", "header":"orange", "other":"violet"}

for prediction, box in zip(true_predictions, true_boxes):
    predicted_label = iob_to_label(prediction).lower()
    draw.rectangle(box, outline=label2color[predicted_label])
    draw.text((box[0]+10, box[1]-10), text=predicted_label, fill=label2color[predicted_label], font=font)

image

# Inference with Ex. 21 tables

In [None]:
from PIL import Image

image = Image.open("trans_lux_corp_table.png")

In [None]:
encoding = processor(image, return_offsets_mapping=True, return_tensors="pt")
offset_mapping = encoding.pop("offset_mapping")
encoding.keys()

In [None]:
for k,v in encoding.items():
    encoding[k] = v.to(device)

In [None]:
outputs = model(**encoding)
outputs.logits.shape

In [None]:
predictions = outputs.logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()

width, height = image.size

In [None]:
is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0

true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]

In [None]:
draw = ImageDraw.Draw(image)

font = ImageFont.load_default()

def iob_to_label(label):
    label = label[2:]
    if not label:
        return "other"
    return label

label2color = {"question":"blue", "answer":"green", "header":"orange", "other":"violet"}

for prediction, box in zip(true_predictions, true_boxes):
    predicted_label = iob_to_label(prediction).lower()
    draw.rectangle(box, outline=label2color[predicted_label])
    draw.text((box[0]+10, box[1]-10), text=predicted_label, fill=label2color[predicted_label], font=font)

image