# Run Hugging Face `Llama 3.1 405B` autoregressive sampling on Trn1/Trn1n with 16k sequence length

In this example we compile and deploy the Hugging Face [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model for tensor parallel inference on Neuron using the `transformers-neuronx` package. We use a sequence length of 16k.

The example has the following main sections:
1. Set up the Jupyter Notebook
2. Install dependencies
3. Download the model
4. Perform autoregressive sampling using tensor parallelism

This Jupyter Notebook can be run on 4 Trn1/Trn1n instances (`trn1.32xlarge`/`trn1n.32xlarge`) using multinode inference.

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`llama-3.1-405b-multinode-16k-sampling.ipynb`) and launch it after setting the environment variables described below. Follow the rest of the instructions in this tutorial. Note that the notebook needs to be run on all 4 nodes.

## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `transformers-neuronx`


Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
!pip install transformers-neuronx sentencepiece 
!pip install transformers>=4.43.3 # need recent transformers version for RoPE scaling in Llama 3.1

## Access the model

Use of the Llama 3.1 model is governed by the Meta license and must be downloaded prior to running this sample. Follow the steps described in [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to get access to the Llama 3.1 model from Meta.

Note: For the purpose of this sample, we assume you have access to the model from Hugging Face and it is saved in the directory `Meta-Llama-3.1-405B-Instruct`.

## Perform autoregressive sampling using tensor parallelism

Now we have all of the necessary files for running `meta-llama/Meta-Llama-3.1-405B-Instruct` autoregressive sampling.

The memory required to host any model can be computed with:
```
total memory = bytes per parameter * number of parameters
```
When using `bfloat16` weights for a 8 billion parameter model, this works out to `2 * 405B` or ~810GB of weights. Each NeuronCore has 16GB of memory which means that a 810GB model would not fit on a single NeuronCore. In reality, the total space required is often greater than just the number of parameters due to caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size.

To get very large language models to fit on Trn1, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores. The number of NeuronCores that the weights are split across can be controlled by setting the `tp_degree` parameter. This parallelism degree must be chosen to ensure that the memory usage per NeuronCore will be less than the physical 16GB limit. When configuring tensor parallelism, the memory per NeuronCore can be computed with:

```
memory per core = (bytes per parameter * number of parameters) / tp_degree
```

This can be used to compute the minimum instance sizing by ensuring that the value selected for `tp_degree` results in less than 16GB allocated per NeuronCore.

Note that increasing the `tp_degree` beyond the minimum requirement almost always results in a faster model. Increasing the tensor parallelism degree improves memory bandwidth which improves model performance. To optimize performance it's recommended to use the highest tensor parallelism degree that is supported by the instance. 

## Multinode tensor parallelism

For the 405B model, even trn1.32xlarge is not sufficient to fit the model. Therefore we use multinode inference using 4 trn1.32xlarge nodes (or trn1n.32xlarge for better cross-node network bandwidth). In this case we will have `tp_degree` of 128 (4 times 32). You can find details about configuring multinode inference in [trn1 multi-node setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html) and [transformers-neuronx developer guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html). In short, the following environment variables and configuration need to be set:

### EFA related
- `FI_EFA_USE_DEVICE_RDMA=1`
- `FI_PROVIDER=efa`
- `FI_EFA_FORK_SAFE=1` (only needed for older Linux kernel, see [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html#fi-efa-fork-safe))
- `CCOM_SOCKET_IFNAME=eth0` (only for containerized environments)

### Setting up communication and rank id
- `NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423` (this is the node address and port for the master node, where the port can be any free port)
- `NEURON_RANK_ID=0` or `1` or `2` or `3` (the master node has rank id 0 while the other three will have 1, 2, 3. This is the only environment variable that needs to be set differently across the nodes).
- `NEURON_LOCAL_TP=32` (set to 32 because each node is a trn1.32xlarge)

### Deterministic sampling across the nodes
- To make sure the inputs/outputs at each step are consistent across the nodes, we set `torch.manual_seed`.

### How to execute notebook/script on multiple nodes?
The same script/notebook needs to be run on all the nodes after setting the above environment variables. This can be done manually, through a SLURM batch job, or using a containerized solution (e.g., kubernetes nodes in an EKS cluster). Note that all the nodes need access to the model path (either on local disk or on shared storage).


### Configuration

We will use the Neuron `LlamaForSampling` class to implement tensor parallelism for the Llama based model. We supply the `n_positions` and `context_length_estimate` to precompile various possible prompt lengths. Tensor parallelism is enabled through the argument `tp_degree=128`. The model computational graph is compiled by `neuronx-cc` for optimized inference on Neuron.


We also set some additional configurations to improve the performance and/or support longer context:
- `attention_layout`: Layout to be used for attention computation. In this case, we use "BSH".
- `fuse_qkv`: Fuses the QKV projection into a single matrix multiplication. It helps in improving the loading efficiency of Q/K/V weights.
- `group_query_attention`: The KV cache sharding strategy. For more details on this, please refer [Grouped Query Attention in transformers neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta).
- `sequence_parallel_norm`: Use sequence parallel sharding for RMSNorm. This helps reduce the time taken for the norm and also reduces the memory requirements for the intermediate tensors.
- `shard_over_sequence`: Shard the KV cache along the sequence dimention to avoid replicating KV cache for GQA models. This helps reduce the memory requirements and time for loading KV cache at higher sequence lengths.
- `context_unroll`: Setting context unroll factor to 1 compiles only single layer of the context encoding model (which is then executed multiple times). This avoids OOM issues and improves compile time with only minimal impact on performance. 

In [None]:
## Load the model

from transformers_neuronx import LlamaForSampling, NeuronConfig, GQA
import torch

model_path = "Meta-Llama-3.1-405B-Instruct"

# load Meta-Llama-3.1-405B-Instruct to the NeuronCores with 128-way tensor parallelism and run compilation
# we pass n_positions and context_length_estimate buckets that allows us to get low context encoding/token generation 
# latency across sequence lengths upto 16k
buckets = [2048, 4096, 8192, 16384]

# set manual seed properly to ensure each node using same inputs per sampling iteration
torch.manual_seed(1234)

neuron_config = NeuronConfig(
                    attention_layout='BSH',
                    fuse_qkv=True,
                    group_query_attention=GQA.REPLICATED_HEADS,
                    sequence_parallel_norm=True,
                    shard_over_sequence=True,
              )

neuron_model = LlamaForSampling.from_pretrained(model_path, n_positions=buckets, neuron_config=neuron_config, \
                                                context_length_estimate=buckets, context_unroll=1, \
                                                batch_size=1, tp_degree=128, amp='bf16')

Notice that buckets are used via `n_positions` and `context_length_estimate` to improve the latency. For more details about how to effectively use bucketing, please refer the [developer guide for bucketing](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html?highlight=bucketing#bucketing).

In [None]:
# Load model on neuron cores and compile

neuron_model.to_neuron()

In [None]:
## Perform autoregressive sampling

import time
import torch
from transformers import AutoTokenizer
import requests

# construct a tokenizer and encode prompt text
# For the prompt we take a Python library, and ask the model to write some tests.
# The input length is ~13k tokens.
tokenizer = AutoTokenizer.from_pretrained(model_path)
prompt = requests.get("https://raw.githubusercontent.com/huggingface/transformers/e55b33ceb4b0ba3c8c11f20b6e8d6ca4b48246d4/src/transformers/generation/configuration_utils.py").text
prompt += "\n\n## ========================THE END======================\n"
prompt += "Write 4-5 tests for the above codebase."
# put in prompt format https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#prompt-format
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|> {prompt} <|eot_id|><|start_header_id|>assistant<|end_header_id|>"

input_ids = tokenizer.encode(prompt, return_tensors="pt") 
num_input_tokens = len(input_ids[0]) # ~13k tokens
print(f"num_input_tokens: {num_input_tokens}")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=16384, top_k=10)
    elapsed = time.time() - start

# display the new generated tokens
generated_sequences = [tokenizer.decode(seq[num_input_tokens:]) for seq in generated_sequences]
print(f'generated sequence {generated_sequences[0]} in {elapsed} seconds')

## Save and load the compiled model

The ```save``` and ```load``` functions can be used to save and load compiled model artifacts respectively. Loading compiled model artifacts from a provided directory will avoid model recompilation.

In [None]:
neuron_model.save('./neuron_artifacts') # can be copied and used on a different neuron instance
del neuron_model

neuron_model = LlamaForSampling.from_pretrained(model_path, n_positions=buckets, neuron_config=neuron_config, \
                                                context_length_estimate=buckets, context_unroll=1, \
                                                batch_size=1, tp_degree=128, amp='bf16')

neuron_model.load('neuron_artifacts') # Load the compiled Neuron artifacts
neuron_model.to_neuron() # will skip compile

with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=16384, top_k=10)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq[num_input_tokens:]) for seq in generated_sequences]
print(f'generated sequence {generated_sequences[0]} in {elapsed} seconds')