# HuggingFace Pretrained DistilBERT Inference on Trn1

## Introduction

This notebook demonstrates how to compile and run a HuggingFace 🤗 Transformers DistilBERT model for accelerated inference on Neuron. This notebook will use the [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) model, which is primarily used for masked language modeling and next sentence prediction. 

This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger).

## Install Dependencies
This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `transformers`

Most of these packages will be installed when configuring your environment using the Trn1 setup guide. The additional dependencies must be installed here:

In [None]:
!pip install -U transformers

## Compile the model into an AWS Neuron optimized TorchScript

In the following section, we load the model and tokenizer, get s sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()` and save the optimized model as `TorchScript`.

`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference.

In [None]:
import torch
import torch_neuronx
from transformers import DistilBertTokenizer, DistilBertModel

# Create the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model.eval()

# Get an example input
text = "Replace me by any text you'd like."

encoded_input = tokenizer(
    text,
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

example = (
    encoded_input['input_ids'],
    encoded_input['attention_mask'],
)

# Run inference on CPU
output_cpu = model(*example)

# Compile the model
model_neuron = torch_neuronx.trace(model, example)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

## Run inference and compare results

In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.

In [None]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*example)

# Compare the results
print(f"CPU last_hidden_state:    {output_cpu['last_hidden_state'][0][0][:10]}")
print(f"Neuron last_hidden_state: {output_neuron['last_hidden_state'][0][0][:10]}")