# Inference Demo with Llama2 Locally

Once the models you want have been downloaded, you can run the model locally using the command below:

Different models require different model-parallel (MP) values:


|  Model | MP |
|--------|----|
| 7B     | 1  |
| 13B    | 2  |
| 70B    | 8  |

## Llama2 7b

In [1]:
%%html

<pre>
   Step 1: Open a terminal session 
   Step 2: Navigate to the directory by executing: : <font color="green">cd /data/ai/tutorial/Llama2_on_HPG/llama</font>
   Step 3: Request compute resources by executing: <font color="green">srun -p hpg-ai --gpus=1 --ntasks=1 --cpus-per-task=4 --mem 50gb  --pty -u bash -i</font>
   Step 4: Load the Llama2 module by executing: <font color="green">ml llama/2</font>
   Step 5: Run the Llama2 7b pretraining on 1 GPUs, example 1: <font color="green">
           torchrun --nproc_per_node 1 example_chat_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-7b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-7b/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>
   Step 6: Run the Llama2 7b pretraining on 1 GPUs, example 2: <font color="green">
           torchrun --nproc_per_node 1 example_text_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-7b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-7b/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>
</pre>

**Note**
- The `–nproc_per_node` should be set to the `MP` value for the model you are using.
- Adjust the `max_seq_len` and `max_batch_size` parameters as needed.
- This example runs the [example_chat_completion.py](./llama3/example_chat_completion.py) found in this repository, but you can change that to a different .py file.

## Llama2 13b

In [2]:
%%html

<pre>
   Step 1: Open a terminal session 
   Step 2: Navigate to the directory by executing: : <font color="green">cd /data/ai/tutorial/Llama2_on_HPG/llama</font>
   Step 3: Request compute resources by executing: <font color="green">srun -p hpg-ai --gpus=2 --ntasks=2 --cpus-per-task=4 --mem 50gb  --pty -u bash -i</font>
   Step 4: Load the Llama2 module by executing: <font color="green">ml llama/2</font>
   Step 5: Run the Llama2 13b pretraining on 2 GPUs, example 1: <font color="green">
           torchrun --nproc_per_node 2 example_chat_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-13b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-13b/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>
  Step 6: Run the Llama2 13b pretraining on 2 GPUs, example 2: <font color="green">
           torchrun --nproc_per_node 2 example_text_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-13b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-13b/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>

## Llama2 70b

In [3]:
%%html

<pre>
   Step 1: Open a terminal session 
   Step 2: Navigate to the directory by executing: : <font color="green">cd /data/ai/tutorial/Llama2_on_HPG/llama</font>
   Step 3: Request compute resources by executing: <font color="green">srun -p hpg-ai --gpus=8 --ntasks=8 --cpus-per-task=4 --mem 50gb  --pty -u bash -i</font>
   Step 4: Load the Llama2 module by executing: <font color="green">ml llama/2</font>
   Step 5: Run the Llama2 70b pretraining on 8 GPUs, example 1: <font color="green">
           torchrun --nproc_per_node 8 example_chat_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-70b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-70/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>
  Step 6: Run the Llama2 70b pretraining on 8 GPUs, example 2: <font color="green">
           torchrun --nproc_per_node 8 example_text_completion.py \
               --ckpt_dir /data/ai/models/nlp/llama/models_llama2/llama-2-70b/ \
               --tokenizer_path /data/ai/models/nlp/llama/models_llama2/llama-2-70b/tokenizer.model \
               --max_seq_len 512 --max_batch_size 6</font>
</pre>

If you encounter `RuntimeError: CUDA error: invalid device ordinal`, here is something you can do.

* Check Available GPUs: Verify the number of available GPUs on your system to ensure the local rank is within the valid range. Run the following in your terminal

   ```shell
   python
   import torch
   print(torch.cuda.device_count())
   ```
   
If you cannot get 8 GPUs available from srun, try allocating the resources from a Jupyter notebook on Open OnDemand (OOD).

In [4]:
import torch
print(torch.cuda.device_count())

8
