# HF Pretrained Perceiver Language Inference on Trn1 / Inf2

## Introduction

This notebook demonstrates how to compile and run a Perceiver Language model for accelerated inference on Neuron. This notebook will use the [`PerceiverForMaksedLM`](https://huggingface.co/deepmind/language-perceiver) model.

This Jupyter notebook should be run on a Trn1 or Inf2 instance (`trn1.2xlarge` or `inf2.xlarge` or larger).

## Install Dependencies
This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `transformers`

Most of these packages will be installed when configuring your environment using the Trn1 setup guide.

In [None]:
!pip install transformers==4.32.0

## Compile the model into an AWS Neuron optimized TorchScript

In the following section, we load the model, get s sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()`, and save the optimized model as `TorchScript`.

In [None]:
from transformers import AutoTokenizer, PerceiverForMaskedLM
import torch
import torch_neuronx
from torch import nn

# We will use this simple wrapper to convert the keyword arguments for the model into positional
# arguments, which is required for `torch.jit.trace()`, which is used by `torch_neuronx.trace()`,
class MaskedLMWrapper(nn.Module):
    def __init__(self, perc):
        super().__init__()
        self.perc = perc

    def forward(self, attention_mask, input_ids):
        return self.perc(
            attention_mask=attention_mask,
            input_ids=input_ids
        )
    

# Create the model and image processor
tokenizer = AutoTokenizer.from_pretrained("deepmind/language-perceiver")
model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")
model.eval()

# Wrap the model
model = MaskedLMWrapper(model)

# Create the masked input
text = "This is an incomplete sentence where some words are missing."
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
# mask bytes corresponding to " missing.". Note that the model performs much better if the masked span starts with a space.
encoding["input_ids"][0, 52:61] = tokenizer.mask_token_id

inputs = (
    encoding["attention_mask"],
    encoding["input_ids"],
)

# Run inference on CPU
output_cpu = model(*inputs)['logits']

# Compile the model
model_neuron = torch_neuronx.trace(model, inputs)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

## Run inference and compare results

In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.

In [None]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*inputs)['logits']

# Compare the results
print(f"CPU tensor:    {output_cpu[0][0][0:10]}")
print(f"Neuron tensor: {output_neuron[0][0][0:10]}")

# Compare the predicted output
cpu_masked_tokens_predictions = output_cpu[0, 52:61].argmax(dim=-1).tolist()
cpu_prediction = tokenizer.decode(cpu_masked_tokens_predictions)
print(f"CPU prediction:      '{cpu_prediction}'")

neuron_masked_tokens_predictions = output_neuron[0, 52:61].argmax(dim=-1).tolist()
neuron_prediction = tokenizer.decode(neuron_masked_tokens_predictions)
print(f"Neuron prediction:   '{neuron_prediction}'")
