# Compiling and Benchmarking the BEiT model

### In this notebook, we will run through the compilation and benchmarking code for the BEiT model. 

##### The first step is to compile the model for inferentia. Here are the necessary imports.

In [None]:
from transformers import AutoProcessor, AutoModel
from datasets import load_dataset
import neuronperf.torch
import torch_neuron
import os
import torch
from torch import nn, tensor
from einops import repeat, rearrange

#### Then we load the model from Hugging Face. https://huggingface.co/docs/transformers/model_doc/beit

In [None]:
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = AutoModel.from_pretrained("microsoft/layoutlmv3-base")

These experiments were initially ran on an Inf.2x large instance. As such, the benchmarking was compiled with pipeline sizes of 1 and 4, with batch sizes of 1 through 10. The training loop is simple, compiling the models to the directory that this notebook exists in. Changing the file path will affect the benchmarking script below. This part of the script can be run in the notebook, but a lot of efficiency is lost. It is reccommended to run the layoutCompile.py file from the compile folder locally on your instance rather than in the notebook.

For layoutLMv3, the model has an input structure which is not compatible with the Neuron trace call. As such, we have to implement a simple function wrapper which allows the trace call to commence. In this instance, it is as simple as breaking down a dictionary input into its readable components for the actual model.

In [None]:
class NeuronCompatibilityWrapper(nn.Module):
    def __init__(self, model):
        super(NeuronCompatibilityWrapper, self).__init__()
        self.model = model

    def forward(self, encoding):
        out = self.model(input_ids = encoding['input_ids'], attention_mask = encoding['attention_mask'], bbox= encoding['bbox'], pixel_values = encoding['pixel_values'])
        return out


Declare the model

In [None]:
layout_for_trace = NeuronCompatibilityWrapper(model)

For the input, we use the example data from the Hugging Face docs. We run the input through the processor. 
https://huggingface.co/docs/transformers/model_doc/layoutlmv3#transformers.LayoutLMv3Model

In [None]:
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]

encoding = processor(image, words, boxes=boxes, return_tensors="pt")

model = model.eval()

For the batch sizes, it is reccomended to test 1 through 10. These are typical in terms of workloads that the model can handle for Inferentia on the Inf.2x instance. The pipeline sizes refer to the neuron cores that are utilized during the runs, extending from 1 core to all 4 available on the Inf.2xl instance.

In [None]:
pipeline_sizes = [1, 4]
batch_sizes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Here, the example data is enlarged to fit the different batch sizes. We use the repeat operation from the einops library. These are inserted into a dictionary for input, before the compilation commences. 

In [None]:
for ncp in pipeline_sizes:
    for bs in batch_sizes:
        model_file = f"layout_neuron_ncp{ncp}_bs{bs}.pt"
        encoding['input_ids'] = repeat(encoding['input_ids'], 'b h -> (x b) h', x = bs)
        encoding['bbox'] = repeat(encoding['bbox'], 'b h w -> (x b) h w', x = bs)
        encoding['pixel_values'] = repeat(encoding['pixel_values'], 'b c h w -> (x b) c h w', x = bs)
        encoding['attention_mask'] = repeat(encoding['attention_mask'], 'b w -> (x b) w', x = bs)
        for_trace_dict = {'input_ids':encoding['input_ids'], 'attention_mask':encoding['attention_mask'], 'bbox':encoding['bbox'],'pixel_values':encoding['pixel_values']}
        
        print(f"ncp: {ncp}  bs: {bs}")

        if not os.path.exists(model_file):
            print("Attempting model compilation")
            nmod = torch.neuron.trace(layout_for_trace, example_inputs=for_trace_dict, compiler_args=['--neuroncore-pipeline-cores', f"{ncp}"], strict=False)
            nmod.save(model_file)
            del(nmod) # we need to release the model from memory so it doesn't affect benchmarking later on
        else:
            print(f"Found previously compiled model. Skipping compilation\n")

#### The models have been compiled in the folder that this notebook exists in. Now we can commence the benchmarking. We iterate throguh the pipline sizes and the batch sizes, creating separate csv files for each size. For more information on the details of the neuronperf library, please see the documentation here. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuronperf/index.html

(Just a note: the data is refit in the benchmarking runs similarly to the compilation runs above)

In [None]:
for ncp in pipeline_sizes:
    for bs in batch_sizes:
        model_file = f"layout_neuron_ncp{ncp}_bs{bs}.pt"
        report_file = f"layout_neuron_ncp{ncp}_bs{bs}_benchmark.csv"

        model_file = f"layout_neuron_ncp{ncp}_bs{bs}.pt"
        encoding['input_ids'] = repeat(encoding['input_ids'], 'b h -> (x b) h', x = bs)
        encoding['bbox'] = repeat(encoding['bbox'], 'b h w -> (x b) h w', x = bs)
        encoding['pixel_values'] = repeat(encoding['pixel_values'], 'b c h w -> (x b) c h w', x = bs)
        encoding['attention_mask'] = repeat(encoding['attention_mask'], 'b w -> (x b) w', x = bs)
        for_trace_dict = {'input_ids':encoding['input_ids'], 'attention_mask':encoding['attention_mask'], 'bbox':encoding['bbox'],'pixel_values':encoding['pixel_values']}

        print(f"ncp: {ncp}  bs: {bs}")

        if not os.path.exists(report_file):
            reports = neuronperf.torch.benchmark(model_filename=model_file, inputs=for_trace_dict, batch_sizes=[bs], pipeline_sizes=[ncp])
            neuronperf.print_reports(reports)
            neuronperf.write_csv(reports, report_file)
        else:
            print(f"Report file {report_file} already exists. Skipping this benchmark run.\n")

### You can aggregate the beit data frames from the compilation here, and look at your results in an organized fashion.

In [None]:
import pandas as pd
from glob import glob

"""
For sorting through the created dataframes
"""

dataframes = []

for csv in glob("layout*.csv"):
    dataframes.append(pd.read_csv(csv))
    print(csv)

aggr_df = pd.concat(dataframes)
aggr_df

# Lowest p90 latency
lowest_cost = aggr_df.sort_values(by='latency_p90', ascending=True)[0:5]

# Cheapest Price
lowest_price = aggr_df.sort_values(by='cost per 1 m instances', ascending=True)[0:5]

# Highest average throughput
highest_throughput = aggr_df.sort_values(by='throughput average', ascending=False)[0:5]

To save the dataframes. You can specify the filepath further if you wish to store these files in a directory other than the current one.

In [None]:
lowest_cost.to_csv('layout_lowest_latency.csv')
lowest_price.to_csv('layout_lowest_latency.csv')
highest_throughput.to_csv('layout_lowest_latency.csv')