# Run HuggingFace Pretrained Wav2Vec2-Conformer with Rotary Position Embeddings Inference on Inf2

## Introduction
This notebook demonstrates how to compile and run a HuggingFace 🤗 Wav2Vec2-Conformer model with rotary position embeddings for accelerated inference on Neuron. This notebook will use the facebook/wav2vec2-conformer-rope-large-960h-ft model. 

This Jupyter notebook should be run on an Inf2 or Trn1 instance, of size Inf2.8xlarge or Trn1.2xlarge or larger.

Note: for deployment, it is recommended to pre-compile the model on a compute instance using torch_neuronx.trace(), save the compiled model as a .pt file, and then distribute the .pt to Inf2.8xlarge instances for inference.

Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the PyTorch Installation Guide. You can select the kernel from the 'Kernel -> Change Kernel' option on the top of this Jupyter notebook page.

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the inference samples folder
```
cd aws-neuron-samples/torch-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`hf_pretrained_wav2vec2_conformer_rope_inference_on_inf2.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `transformers`
 - `datasets`
 - `librosa`


Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
!pip install -U transformers datasets librosa

# Compile the model into an AWS Neuron optimized TorchScript
In the following section, we load the model, and input preprocessor, get a sample input, run inference on CPU, compile the model for Neuron using torch_neuronx.trace(), and save the optimized model as TorchScript.

torch_neuronx.trace() expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the input preprocessor's output. Additionally, the input shape that's used during compilation must match the input shape that's used during inference.

In [None]:
import torch
import torch_neuronx
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
model.eval()

# take the first entry in the dataset as our input
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest", sampling_rate=16_000).input_values

# retrieve the result from cpu and decode to human-readable transcript 
output_cpu = model(input_values)
def decode_to_transcript(logits):
    predicted_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(predicted_ids)
transcription_cpu = decode_to_transcript(output_cpu.logits)

# Compile the model
model_neuron = torch_neuronx.trace(model, input_values, compiler_args="--model-type=transformer")

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

# Run inference and compare results
In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.

In [None]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(input_values)
transcription_neuron = decode_to_transcript(output_neuron["logits"])

# Compare the results
print(f"CPU transcription:    {transcription_cpu}")
print(f"Neuron transcription: {transcription_neuron}")