# Run Hugging Face `upstage/SOLAR-10.7B-Instruct-v1.0` autoregressive sampling on Inf2 & Trn1

In this example we compile and deploy the Hugging Face [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) model for tensor parallel inference on Neuron using the `transformers-neuronx` package.

The example has the following main sections:
1. Set up the Jupyter Notebook
1. Install dependencies
1. Download the model
1. Construct the model|
1. Split the model `state_dict` into multiple files
1. Perform autoregressive sampling using tensor parallelism

This Jupyter Notebook can be run on an Inf2 instance (`inf2.48xlarge`) or Trn1 instance (`trn1.32xlarge`).

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`SOLAR-10.7B-Instruct-v1.0-sampling.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies
This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `sentencepiece`
 - `transformers`
 - `transformers-neuronx`


Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [1]:
!pip install transformers-neuronx sentencepiece

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## Download the model

Follow the steps described in [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) to get access to the SOLLAR model from Meta and download the weights and tokenizer.

Note: For the purposes of this sample we assume you have saved the SOLAR-10.7B-Instruct-v1.0 model in a directory called `SOLAR-10.7B-Instruct-v1.0` with the following format:
```
SOLAR-10.7B-Instruct-v1.0/
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00003.bin
├── pytorch_model-00002-of-00003.bin
├── pytorch_model-00003-of-00003.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
```

## Construct the model

After downloading the model and converting it to the Hugging Face format we construct the model

In [1]:
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained('upstage/SOLAR-10.7B-Instruct-v1.0')

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

## Split the model state_dict into multiple files

For the sake of reducing host memory usage, it is recommended to save the model `state_dict` as
multiple files, as opposed to one monolithic file given by `torch.save`. This "split-format"
`state_dict` can be created using the `save_pretrained_split` function. With this checkpoint format,
the Neuron model loader can load parameters to the Neuron device high-bandwidth memory (HBM) directly
by keeping at most one layer of model parameters in the CPU main memory.

In [3]:
import torch
from transformers_neuronx.module import save_pretrained_split

save_pretrained_split(model, './SOLAR-10.7B-Instruct-v1.0-split')

## Perform autoregressive sampling using tensor parallelism

Now we have all of the necessary files for running `upstage/SOLAR-10.7B-Instruct-v1.0` autoregressive sampling. 

The memory required to host any model can be computed with:
```
total memory = bytes per parameter * number of parameters
```
When using `float16` casted weights for a 13 billion parameter model, this works out to `2 * 13B` or ~26GB of weights. Each NeuronCore has 16GB of memory which means that a 26GB model cannot fit on a single NeuronCore. In reality, the total space required is often greater than just the number of parameters due to caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size.

To get very large language models to fit on Inf2 & Trn1, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores. The number of NeuronCores that the weights are split across can be controlled by setting the `tp_degree` parameter. This parallelism degree must be chosen to ensure that the memory usage per NeuronCore will be less than the physical 16GB limit. When configuring tensor   , the memory per NeuronCore can be computed with:

```
memory per core = (bytes per parameter * number of parameters) / tp_degree
```

This can be used to compute the minimum instance sizing by ensuring that the value selected for `tp_degree` results in less than 16GB allocated per NeuronCore.

Note that increasing the `tp_degree` beyond the minimum requirement almost always results in a faster model. Increasing the tensor parallelism degree improves memory bandwidth which improves model performance. To optimize performance it's recommended to use the highest tensor parallelism degree that is supported by the instance. In this sample we use tensor parallelism degree 24 to optimize performance on `inf2.48xlarge`, but this should be changed to 32 if you are using a `trn1.32xlarge`. 

We will use the Neuron `LlamaForSampling` class to implement tensor parallelism for the SOLAR model. The default model config supports sampling up to sequence length 2048. Tensor parallelism is enabled through the argument `tp_degree=24`. We enable `float16` casting with the `amp='f16'` flag. The model computational graph is compiled by `neuronx-cc` for optimized inference on Neuron.

In [4]:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

import os
# Compiler flag -O1 is a workaround for “Too many instructions after unroll” in SDK 2.14
# os.environ['NEURON_CC_FLAGS'] = '-O1'

# load upstage/SOLAR-10.7B-Instruct-v1.0 to the NeuronCores with 24-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('./SOLAR-10.7B-Instruct-v1.0-split', batch_size=1, tp_degree=24, amp='f16')
neuron_model.to_neuron()

# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('upstage/SOLAR-10.7B-Instruct-v1.0')
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')



2024-03-29 07:48:09.000876:  12995  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-29 07:48:09.000942:  12995  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_3ccbbf8fad9f8653719c+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-03-29 07:48:09.000967:  12997  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-29 07:48:10.000012:  12997  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_159bbea91adf9a015aed+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-03-29 07:48:10.000208:  12998  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-29 07:48:10.000253:  13000  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-29 07:48:10.000268:  12998  INFO ||NEURON_CC_WRAPPER||: Using a cached neff 

## Save and load the compiled model

The ```save``` and ```load``` functions can be used to save and load compiled model artifacts respectively. Loading compiled model artifacts from a provided directory will avoid model recompilation.

In [5]:
neuron_model.save('./solar_neuron_artifacts') # can be copied and used on a different neuron instance
del neuron_model
neuron_model = LlamaForSampling.from_pretrained('./SOLAR-10.7B-Instruct-v1.0-split', batch_size=1, tp_degree=24, amp='f16')
neuron_model.load('solar_neuron_artifacts') # Load the compiled Neuron artifacts
neuron_model.to_neuron() # will skip compile

with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

print(f'generated sequences {generated_sequences} in {elapsed} seconds')

generated sequences tensor([[    1, 22557, 28725,   315, 28742, 28719,   264,  3842,  2229, 28725,
           304,   315,   541,  3084,  6516, 15988,   297,   456,  2758, 28723,
          1791,  2225,   264, 28705, 28740, 28757,   401,  5550,   356,   264,
           808,   302,  1178,  2078,   297,   264, 28705, 28750, 28757,  2293,
          1413, 21366, 28725,   368,   541,   938,   272,  2522,   508, 28724,
          7607, 28742, 28713,   285,   632, 28732,  2654, 28725, 11022, 28746,
          5364, 28731,   908, 28723,  1047,   574,  1178,   349,   297,   264,
         28705, 28750, 28757,  2293,   970,   624,  9711, 10651,   272, 11010,
         28725,   368,   541, 15759,   272,  2293,   304,   272, 11022,  5270,
           298, 15759,  2267,   690, 11022,   368,   947,   272,   401,  5550,
           298,   347, 16860, 28723,    13,    13, 15423,   349,   396,  2757,
          1413, 15755,  6218,   304,  2522,   508, 28724, 28747,    13,    13,
         13940, 28832, 17667,   