# Run Hugging Face `EleutherAI/gpt-j-6B` autoregressive sampling on Inf2 & Trn1 with Data Parallel

To make the most of this tutorial and use (24 cores) in three processes, use an Inf2.48xlarge or trn1.32xlarge.
If you are using Inf2.24xlarge, modify the last section to run only two processes (16 cores)

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`gptj-6b-sampling.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies

This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `transformers`
 - `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/setup-inference.html). The additional dependencies must be installed here:

In [None]:
!pip install transformers-neuronx -U

## Download and construct the model

We download and construct the `EleutherAI/gpt-j-6B` model using the Hugging Face `from_pretrained` method.

In [None]:
from transformers.models.auto import AutoModelForCausalLM

hf_model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', low_cpu_mem_usage=True)

## Split the model state_dict into multiple files

For the sake of reducing host memory usage, it is recommended to save the model `state_dict` as
multiple files, as opposed to one monolithic file given by `torch.save`. This "split-format"
`state_dict` can be created using the `save_pretrained_split` function. With this checkpoint format,
the Neuron model loader can load parameters to the Neuron device high-bandwidth memory (HBM) directly
by keeping at most one layer of model parameters in the CPU main memory.

To reduce memory usage during compilation and deployment, we cast the attention and mlp to `float16` precision before saving them. We keep the layernorms in `float32`. To do this, we implement a callback function that casts each layer in the model. 

In [None]:
import torch
from transformers_neuronx.module import save_pretrained_split

def amp_callback(model, dtype):
    # cast attention and mlp to low precisions only; layernorms stay as f32
    for block in model.transformer.h:
        block.attn.to(dtype)
        block.mlp.to(dtype)
    model.lm_head.to(dtype)

amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './gptj-6b-split')

Utilizing more cores is possible by running multiple processes (Data Parallel)

# Data Parallel Optimization for Throughput

This is an example to show case that it is possible to run the same program in multiple processes. For example running 2 or 3 proceeses with 8 cores each utiizes 24 cores instead of previously only 16 cores. This is useful to increase throughput. This code below runs a batch size of 64. 

In [None]:
#Print the code from the python file which runs the data parallel optimization
with open('gpt-j-dp.py', 'r') as f:
    print(f.read())

In [None]:
#Run the python script
!python gpt-j-dp.py