## Running LLM inference on Tenstorrent hardware

If you want to run an LLM, you don’t have to start from scratch! 

Tenstorrent provides optimized implementations of many common models, like many version of LLama and Qwen.

We also maintain a fork of the inference engine vLLM, which lets you create production deployments!

For a list of supported models and the required hardware, see the table [here](https://tenstorrent.com/developers)

In this tutorial, we will we using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). We will first run a demo to verify our setup, then deploy it on an inference server.

### Jupyter terminal tip
To get access to a terminal when accessing Jupyter server through a browser, click on the Jupyter logo to go to the dashboard, then click `New > Other > Terminal`


### Verifying access

_Llama-3.1-8B-Instruct_ is a gated model, which means you have to accept the license agreement before obtaining access.

At the start of this tutorial, you requested this access. Let's verify that everything has been processed correctly

Make sure you are logged into your huggingface account:

In [None]:
!hf auth whoami

If not, then log in. Copy the following command into your terminal, and follow the instructions

In [None]:
# !hf auth login

And download the model weights

In [None]:
!hf download meta-llama/Llama-3.1-8B-Instruct

Finally, try accessing the config:

In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

### Running a demo

Demo scripts are included with your tt-metal installation

The scripts rely on model implementations from the TT-Transformers kit.
Have a peek at the source!

In [None]:
!head -n 50 $TT_METAL_HOME/models/tt_transformers/tt/model.py

In [None]:
%env HF_MODEL=meta-llama/Llama-3.1-8B-Instruct
# export does not persist across cells

In [None]:
!echo $HF_MODEL
!cd $TT_METAL_HOME && pytest models/tt_transformers/demo/simple_text_demo.py -k "performance and batch-32"

### Building the vllm environment

tt-metal comes pre-built in your environment, vllm does not. Start by cloning the repository:

In [None]:
!git clone https://github.com/tenstorrent/vllm.git


In [None]:
!cd vllm && git checkout dev

Note that the Tenstorrent fork's main branch is `dev`. You can find the README [here](https://github.com/tenstorrent/vllm/blob/dev/tt_metal/README.md)

Then build an environment with vllm and it's dependencies by running the following commands **in your terminal**:

```
bash
cd vllm
export vllm_dir=$(pwd)
source $vllm_dir/tt_metal/setup-metal.sh
cd $TT_METAL_HOME
./create_venv.sh
cd $vllm_dir
echo '
# Custom environment variables for tt-metal
export TT_METAL_HOME="'"$TT_METAL_HOME"'"
export PYTHONPATH="${TT_METAL_HOME}"
# HF caching
export HF_HUB_CACHE="'"$(pwd)/../hf_cache"'"
export HF_XET_CACHE="'"$(pwd)/../hf_xet"'"
mkdir -p $HF_XET_CACHE
mkdir -p $HF_HUB_CACHE
export TT_CACHE_PATH="${TT_METAL_HOME}/../tt_cache"
mkdir -p $TT_CACHE_PATH
' >> $PYTHON_ENV_DIR/bin/activate
source $PYTHON_ENV_DIR/bin/activate
pip3 install --upgrade pip
cd $vllm_dir && pip install -e . --extra-index-url https://download.pytorch.org/whl/cpu
pip install -r $TT_METAL_HOME/models/tt_transformers/requirements.txt
echo "vllm environment created in $PYTHON_ENV_DIR"
```

### Running an inference server

### Run the following commands in **your terminal**

Activate the newly created environment

```
source $TT_METAL_HOME/build/python_env_vllm/bin/activate
cd vllm
```

Test the vllm installation with an offline inference example:

```
HF_MODEL="meta-llama/Llama-3.1-8B-Instruct" python examples/offline_inference_tt.py --measure_perf --model "meta-llama/Llama-3.1-8B-Instruct" --max_model_len 65536
```

If everything worked, you can now start a server with an OpenAI-compatible API:
(still in *your terminal*)

```
VLLM_RPC_TIMEOUT=100000 HF_MODEL="meta-llama/Llama-3.1-8B-Instruct" MESH_DEVICE=N150 python examples/server_example_tt.py --model "meta-llama/Llama-3.1-8B-Instruct" --max_model_len 65536 --num_scheduler_steps 1
```

You can benchmark its performance with another script:

**start another terminal and enter the vllm virtual env before running**
```
python3 vllm/benchmarks/benchmark_serving.py \
            --backend vllm \
            --model "meta-llama/Llama-3.1-8B-Instruct" \
            --dataset-name random \
            --num-prompts 32 \
            --random-input-len 100 \
            --random-output-len 100 \
            --ignore-eos \
            --percentile-metrics ttft,tpot,itl,e2el
```

Watch the prompts' progress in your previous terminal, where you started the server!

### Play with the server!

Here's an example of how you can interact with the server through the API. Experiment with it!

Note: As we did not perform a warm-up here, the first requests will be slow, as they will trigger kernel compilation

In [None]:
import os
import math
from pydantic import BaseModel
from enum import Enum
from openai import OpenAI
tt_base_url = "http://localhost:8000/v1"
model_id = "meta-llama/Llama-3.1-8B-Instruct"

In [None]:
client = OpenAI(
    base_url=tt_base_url,
    api_key="foobar" #not used in this example but required to be supplied
)

In [None]:
class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"


class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType


json_schema = CarDescription.model_json_schema()

completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the second most iconic car from the 90's",
        }
    ],
    extra_body={"guided_json": json_schema},
    logprobs=True,
    top_logprobs=5,
)

#print(completion.choices[0].text)
print(completion.choices[0].message.content)

In [None]:
completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {
            "role": "user",
            "content": "Hi! How are you doing?",
        }
    ],
)

In [None]:
print(completion.choices[0].message.content)