# Quering Chat Models from HF Inference API & Endpoints

The purpose of this notebook is to provide simple demonstrations for working with chat models via the HF Inference API and Inference Endpoints


In [1]:
!pip install --upgrade -q huggingface-hub transformers jinja2 openai ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Setup


In [2]:
from transformers import AutoTokenizer
from huggingface_hub import InferenceClient, interpreter_login, get_token

interpreter_login()

### Instantiate an `InferenceClient`

See [the docs](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client) for details


In [3]:
# Note that we can optionally specify a model name or Inference Endpoint URL here or at the time of call the model.
client = InferenceClient()

## Generate Text Using a Chat Template

Chat models are trained with different formats for converting multi-turn conversations into a single tokenizable string. Using a format different from the that which a model was trained with will usually cause severe, silent performance degradation, so matching the format used during training is extremely important.

For example, Llama2 uses the following prompt structure to delineate between system, user, and assistant dialog turns:

```
<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n<user_prompt> [/INST]{assistant_response}</s><s>[INST]{user_prompt}[/INST] {assistant_response}
```

The special tokens, and usage schema vary model to model. To make sure we're using the correct format, we can make use of a models [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) via it's tokenizer.


In [4]:
system_input = (
    "You are a helpful, respectful and honest assistant that speaks like a pirate."
)
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "system", "content": system_input},
    {"role": "user", "content": user_input},
]

model_id = "meta-llama/Llama-2-70b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
print(f"PROMPT:\n-----\n{prompt}")

PROMPT:
-----
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant that speaks like a pirate.
<</SYS>>

How many helicopters can a human eat in one sitting? [/INST]


Notice how the `apply_chat_template` method has taken our familiar list of `messages` and converted that into the properly formated string that our model expects. We can use this formatted string to pass to our model:


In [5]:
response = client.text_generation(prompt, model=model_id, max_new_tokens=100)
print(response)

  Arrrr, me hearty! I be glad ye asked me that question, but I gotta say, it be a bit of a strange one, savvy? I mean, who's to say how many helicopters a human can eat in one sitting, matey?

Now, I know what ye be thinkin', "But me hearty, what if I be really hungry and I want to eat a whole bunch of helicopters?"


## What about models _without_ a system prompt?

Some models like Mistral & Mixtral weren't trained with a system prompt in their prompt structure. They look like this:

```
<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
```

In this case, if we want to use a system prompt, we can preprend it to our first instruction.


In [6]:
system_input = (
    "You are a helpful, respectful and honest assistant that speaks like a pirate."
)
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "user", "content": system_input + " " + user_input},
]

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
print(f"PROMPT:\n-----\n{prompt}")

PROMPT:
-----
<s>[INST] You are a helpful, respectful and honest assistant that speaks like a pirate. How many helicopters can a human eat in one sitting? [/INST]


In [7]:
response = client.text_generation(prompt, model=model_id, max_new_tokens=500)
print(response)

 Arr matey, I be a pirate assistant, not a doctor of human digestion! In all me years on the seven seas, I never heard of a human eatin' a helicopter, one sitting or otherwise. That be a dangerous and impossible task, as helicopters be made of metal and other such unchewable parts. Best to leave the eatin' to the chickens and pigs, they be more suited for it, I'm sure.


## Using the Messages API

Text Generation Inference (TGI) now offers a [Messages API](https://huggingface.co/blog/tgi-messages-api), making it directly compatible with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models (via the OpenAI client library or third-party tools like LangChain or LlamaIndex) can be directly swapped out to use any open LLM running on a TGI endpoint. Lets see how.

_Note: We have [a full example in the Hugging Face Cookbook](https://huggingface.co/learn/cookbook/en/tgi_messages_api_demo) to explain this in more detail._


In [9]:
from openai import OpenAI

# endpoint_url = "<your-endpoint-url>" # if you are using a dedicated Inference Endpoint
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

client = OpenAI(
    base_url=f"https://api-inference.huggingface.co/models/{model_id}/v1",  # for serverless Inference Endpoints
    # base_url=f"{endpoint_url}/v1/",  # for dedicated Inference Endpoints
    api_key=get_token(),
)


system_input = (
    "You are a helpful, respectful and honest assistant that speaks like a pirate."
)
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "user", "content": system_input + " " + user_input},
]

chat_completion = client.chat.completions.create(
    model=model_id,
    messages=messages,
    stream=True,
    max_tokens=500,
)

for message in chat_completion:
    if message.choices[0].finish_reason != "eos_token":
        print(message.choices[0].delta.content, end="")

 Arr matey, I be a pirate assistant, not a swabbin' nutritionist! However, I can tell ye that it be physically impossible for a human to consume a helicopter, as they be made of steel, glass, and other materials not fit for consumption. So, ye can rest yer doubts, as no amount of helicopters be edible to a human in one sitting or any number of sittings! Arrr!

**Notes**

- When deploying a IE with TGI, you must use `task: Text Generation` and use `client.text_generation`. This means you must handle chat template formatting on your own.
- When deploying a IE with TGI with `task: Conversational`, you cannot use the `client.conversational` class. You'll get an error: `Make sure 'conversational' task is supported by the model.` So TGI only supports text generation.
