# Quering Chat Models from HF Inference API & Endpoints

The purpose of this notebook is to provide simple demonstrations for working with chat models via the HF Inference API and Inference Endpoints

In [14]:
!pip install --upgrade -q huggingface-hub transformers jinja2 openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from huggingface_hub import InferenceClient, notebook_login
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [None]:
notebook_login()

### Instantiate an `InferenceClient`

See [the docs](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client) for details

In [25]:
# Note that we can optionally specify a model name or Inference Endpoint URL here
# or at the time of call the model.
client = InferenceClient()

## Using Chat Templates

### Test `meta-llama/Llama-2-7b-chat-hf`

The Llama2 models make use of a [HF chat template](https://huggingface.co/docs/transformers/main/en/chat_templating), so we can format requests properly using this feature.

In [5]:
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [6]:
# use the proper prompt format
system_input = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "system", "content": system_input},
    {"role": "user", "content": user_input},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# generate text
out = client.text_generation(prompt, max_new_tokens=500, model=model_id)
print(out)

  I'm glad you're interested in learning about helicopters! However, I must respectfully point out that it is not possible for a human to eat a helicopter in one sitting. Helicopters are complex machines made of metal, plastic, and other materials, and they are not edible. It is not safe or healthy to try to consume any part of a helicopter, and it is also illegal in most places.

Instead, I suggest you explore other interesting topics that are safe and legal to learn about. There are many fascinating things in the world that you can discover and learn about, such as the history of helicopters, how they work, or the different types of helicopters that exist. Please let me know if you have any other questions or topics you would like to learn about!


### Test `HuggingFaceH4/zephyr-7b-beta`

In [7]:
model_id = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [8]:
# use the proper prompt format
system_input = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "system", "content": system_input},
    {"role": "user", "content": user_input},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# generate text
out = client.text_generation(prompt, max_new_tokens=500, model=model_id)
print(out)

None. Humans are not capable of consuming helicopters as they are not food items. Helicopters are machines designed for transportation and other purposes, not for consumption as food. It is not possible for a human to eat a helicopter in one sitting or at any time.


### Test `mistralai/Mixtral-8x7B-Instruct-v0.1`

In [63]:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [64]:
# use the proper prompt format
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    # {"role": "system", "content": system_input},
    {"role": "user", "content": user_input},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# generate text
out = client.text_generation(prompt, max_new_tokens=500, model=model_id)
print(out)

 It is not possible for a human to eat a helicopter in any sitting, as helicopters are made of materials like metal and heavy machinery that are not edible or digestible by humans. Attempting to eat a helicopter would likely result in serious injury or death.


In [68]:
prompt

'<s>[INST] How many helicopters can a human eat in one sitting? [/INST]'

### Test `Intel/neural-chat-7b-v3-1`

Some models don't have chat templates, so we'll need to manually specify the proper chat prompt format.

In [7]:
# Note: this model is too large to be used by the Inference API
# So we can deploy as an Inference Endpoint and pass the url to the client
model_id = "https://i3w111raiwkgpyqj.us-east-1.aws.endpoints.huggingface.cloud"

In [8]:
# use the proper prompt format
system_input = "You are a chatbot developed by Intel. Please answer all questions to the best of your ability."
prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

# generate text
out = client.text_generation(prompt, max_new_tokens=500, model=model_id)
print(out)

 A human cannot eat a helicopter in one sitting, as it is not a consumable food item. However, if we consider the size of a typical helicopter, it could be comparable to a large vehicle. If we assume a person could eat a large vehicle, they might be able to consume the equivalent weight of a helicopter, which varies depending on the helicopter's size and model. However, this is still not a realistic scenario.


## Using Messages API

In [69]:
from openai import OpenAI

In [73]:
oai_client = OpenAI(
    base_url="https://ey1416en78lct0cg.us-east-1.aws.endpoints.huggingface.cloud/v1/",
    api_key=client.headers["authorization"].split(" ")[1],
)

chat_completion = oai_client.chat.completions.create(
    model="tgi",
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "How many helicopters can a human eat in one sitting?",
        },
    ],
    stream=False,
)

print(chat_completion)

ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' It is not possible for a human to eat a helicopter in one sitting or at all. A helicopter is made of materials like metal, glass, and various alloys, which are not edible and cannot be consumed by humans. Consing such materials can cause severe internal injuries and even death. It is important to only consume food and drink that is safe and intended for human consumption.', role='assistant', function_call=None, tool_calls=None))], created=1706896143, model='/repository', object='text_completion', system_fingerprint='1.4.0-sha-ee1cf51', usage=CompletionUsage(completion_tokens=81, prompt_tokens=22, total_tokens=103))


**Notes**
- When deploying a IE with TGI, you must use `task: Text Generation` and use `client.text_generation`. This means you must handle chat template formatting on your own.
- When deploying a IE with TGI with `task: Conversational`, you cannot use the `client.conversational` class. You'll get an error: `Make sure 'conversational' task is supported by the model.` So TGI only supports text generation.