# Finetune Layoutlm

Exhibit 21 extraction is built on top of [layoutlmv3](https://huggingface.co/microsoft/layoutlmv3-base/tree/main). This model works by labels for words as being one of three entities: `subsidiary`, `location`, and `owner percentage`. We have a separate inference portion of the extraction pipeline, which takes these predictions and assembles records in a table from them.

## Load configuration

The following cell configures where to pull training data from. If running this notebook from dagster, you can set this value in the launchpad, and it will replace whatever is in this cell. Otherwise, set this value directly below.

- `ex21_training_data`: Dataset containing labeled data produced in label-studio to train `layoutlm`

In [1]:
import dagstermill

context = dagstermill.get_context(op_config={
    "ex21_training_data": "v0.2",
})

## Prepare training data

The config value set above will now be used to download training data from the specified folder in GCS and transform it into a format that can be used by layoutlm.

In [2]:
from pathlib import Path
from tempfile import TemporaryDirectory

from mozilla_sec_eia.models.sec10k.ex_21.data.training import format_as_ner_annotations

with TemporaryDirectory() as temp_dir:
    ex21_training_data = format_as_ner_annotations(
        labeled_json_path=Path(temp_dir) / "sec10k_filings" / "labeled_jsons",
        pdfs_path=Path(temp_dir) / "sec10k_filings" / "pdfs",
        gcs_folder_name=f"labeled{context.op_config['ex21_training_data']}",
    )

<table> is empty
'<c> The Southwest Companies Nevada PriMerit Bank Federally chartered stock savings bank Paiute Pipeline Company Nevada Carson Water Company Nevada Southwest Gas Transmission Company Partnership between Southwest Gas Corporation and Utility Financial Corp. Utility Financial Corp. Nevada Southwest Gas Corporation of Arizona Nevada PRIMERIT BANK SUBSIDIARIES AT DECEMBER 31, 1993'
<table> is empty
'<c> TCA Management Company.................................................... Texas Teleservice Corporation of America........................................ Texas Texas Community Antennas, Inc............................................. Texas Texas Telecable, Inc...................................................... Texas TCA Cable of Amarillo, Inc................................................ Texas Telecable Associates, Inc................................................. Texas Delta Cablevision, Inc.................................................... Arkansas Sun Valley

## Define training metrics
The method `compute_metrics` will be used to score the model. It computes precision, recall, f1 score, and accuracy on bounding box labels output by `layoutlm`.

In [3]:

import numpy as np


def compute_metrics(p, metric, label_list, return_entity_level_metrics=False):
    """Compute metrics to train and evaluate the model on."""
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[pred] for (pred, lab) in zip(prediction, label) if lab != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[lab] for (pred, lab) in zip(prediction, label) if lab != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

## Finetune Model
The next cell will use the functions defined in the previous section to actually construct a huggingface dataset from labeled data and finetune the `layoutlm` model. Model finetuning will only be run if configured to do so, otherwise a pretrained version will be used from the `mlflow` tracking server.

Model training contains several steps implemented below:
1. Use temporary path to convert filings to PDF's and stash labels
2. Use PDF's and labels to convert PDF's and labels to NER annotations
3. Construct huggingface dataset from NER annotations and split into train and test sets
4. Load pretrained model from huggingface
5. Finetune model on training data and evaluate on test data

In [5]:
import dagstermill
import mlflow
from datasets import (
    Array2D,
    Array3D,
    Dataset,
    Features,
    Sequence,
    Value,
    load_metric,
)
from dotenv import load_dotenv
from transformers import (
    AutoProcessor,
    LayoutLMv3ForTokenClassification,
    Trainer,
    TrainingArguments,
)
from transformers.data.data_collator import default_data_collator

from mozilla_sec_eia.library.mlflow import configure_mlflow
from mozilla_sec_eia.models.sec10k.ex_21.data.common import (
    LABELS,
    get_id_label_conversions,
)

load_dotenv()


configure_mlflow()
mlflow.set_experiment("exhibit21_extraction_test")


def _prepare_dataset(annotations, processor, label2id):
    """Put the dataset in its final format for training LayoutLM."""

    def _convert_ner_tags_to_id(ner_tags, label2id):
        return [int(label2id[ner_tag]) for ner_tag in ner_tags]

    images = annotations["image"]
    words = annotations["tokens"]
    boxes = annotations["bboxes"]
    # Map over labels and convert to numeric id for each ner_tag
    ner_tags = [
        _convert_ner_tags_to_id(ner_tags, label2id)
        for ner_tags in annotations["ner_tags"]
    ]

    encoding = processor(
        images,
        words,
        boxes=boxes,
        word_labels=ner_tags,
        truncation=True,
        padding="max_length",
    )

    return encoding

id2label, label2id = get_id_label_conversions(LABELS)
# Change temp_dir to save training data locally for inspection
# Cache/prepare training data
dataset = Dataset.from_list(ex21_training_data)

# Load pretrained model
model = LayoutLMv3ForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base", id2label=id2label, label2id=label2id
)
processor = AutoProcessor.from_pretrained(
    "microsoft/layoutlmv3-base", apply_ocr=False
)

# Prepare our train & eval dataset
column_names = dataset.column_names
features = Features(
    {
        "pixel_values": Array3D(dtype="float32", shape=(3, 224, 224)),
        "input_ids": Sequence(feature=Value(dtype="int64")),
        "attention_mask": Sequence(Value(dtype="int64")),
        "bbox": Array2D(dtype="int64", shape=(512, 4)),
        "labels": Sequence(feature=Value(dtype="int64")),
    }
)
dataset = dataset.map(
    lambda annotations: _prepare_dataset(annotations, processor, label2id),
    batched=True,
    remove_columns=column_names,
    features=features,
)
dataset.set_format("torch")
split_dataset = dataset.train_test_split(test_size=0.2)
train_dataset, eval_dataset = split_dataset["train"], split_dataset["test"]

# Initialize our Trainer
metric = load_metric("seqeval")
training_args = TrainingArguments(
    max_steps=1000,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=1e-5,
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    output_dir="./layoutlm",
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor,
    data_collator=default_data_collator,
    compute_metrics=lambda p: compute_metrics(p, metric=metric, label_list=LABELS),
)

with mlflow.start_run() as training_run:
    # Train inside mlflow run. Mlflow will automatically handle logging training metrcis
    trainer.train()

    # Log finetuend model with mlflow
    model = {"model": trainer.model, "tokenizer": trainer.tokenizer}
    mlflow.transformers.log_model(
        model, artifact_path="layoutlm_extractor", task="token-classification"
    )

    # Return output from notebook (URI of logged model)
    dagstermill.yield_result(mlflow.get_artifact_uri("layoutlm_extractor"), output_name="finetuned_layoutlm_uri")

Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/159 [00:00<?, ? examples/s]

  metric = load_metric("seqeval")
max_steps is given, it will override any value given in num_train_epochs
2024/10/24 11:53:58 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,No log,0.373904,0.727903,0.759494,0.743363,0.88327
200,No log,0.28605,0.757225,0.829114,0.791541,0.90566
300,No log,0.243627,0.828571,0.891501,0.858885,0.937358
400,No log,0.212126,0.825976,0.879747,0.852014,0.940881
500,0.292200,0.180077,0.856187,0.925859,0.889661,0.952453
600,0.292200,0.214453,0.867797,0.925859,0.895888,0.953711
700,0.292200,0.187753,0.911636,0.942134,0.926634,0.958491
800,0.292200,0.203482,0.889565,0.924955,0.906915,0.956981
900,0.292200,0.202835,0.880208,0.916817,0.89814,0.958491
1000,0.036400,0.189829,0.884083,0.924051,0.903625,0.960252


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model URI: gs://mlflow-artifacts-mozilla/13/55f045a0fa264bf58b95e4199cddaf93/artifacts/layoutlm_extractor


2024/10/24 12:13:32 INFO mlflow.tracking._tracking_service.client: 🏃 View run casual-lynx-575 at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/13/runs/55f045a0fa264bf58b95e4199cddaf93.
2024/10/24 12:13:32 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/13.
2024/10/24 12:13:32 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/10/24 12:13:33 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
