In [4]:
!pip install -q huggingface_hub

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()

hf_token= os.getenv("HF_TOKEN")

### Serverless API

In the Hugging Face ecosystem, there is convenient feature called Serverless API that allows you to easily run inference on many models. There's no installation or deployment required.

In [5]:
from huggingface_hub import InferenceClient

os.environ["HF_TOKEN"]= hf_token

client = InferenceClient("meta-llama/Llama-3.2-3B-Instruct")

In [6]:
client

<InferenceClient(model='meta-llama/Llama-3.2-3B-Instruct', timeout=None)>

if we just do decoding, the model will only stop when it predicts an EOS token, and this does not happen here because this is a conversational (chat) model and we didn’t apply the chat template it expects.

In [8]:
output= client.text_generation(
    "The capital of pakistan is",
    max_new_tokens=100,
)

print(f"The capital of pakistan is {output}")

The capital of pakistan is  islamabad
The capital of Pakistan is actually Islamabad, not Karachi or Lahore. Islamabad is a planned city located in the north of the country, and it has been the capital of Pakistan since 1959. It was chosen as the capital due to its strategic location and accessibility to the country's northern regions.

Here are some interesting facts about Islamabad:

1. **Planned city**: Islamabad was designed and built as a planned city in the 1960s, with the aim of creating a


If we now add the special tokens related to the Llama-3.2-3B-Instruct model that we’re using, the behavior changes and it now produces the expected EOS

In [9]:
prompt="""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
The capital of France is<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

#<|start_header_id|>user<|end_header_id|>
#<|eot_id|> signals the end of the user’s turn, allowing the model to determine where the assistant's response should start.

output = client.text_generation(
    prompt,
    max_new_tokens=100,
)

print(output)



...Paris!


Using the `chat` method is a much more convenient and reliable way to apply chat templates:

In [11]:
output= client.chat.completions.create(
    messages=[
        {'role': 'user', 'content': 'The capital of France is'}
    ],
    stream=False,
    max_tokens= 1024,
    )

In [16]:
print(output)

print('\n')

print(output.choices)

print('\n')

print(output.choices[0])


print('\n')

print(output.choices[0].message)

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Paris.', tool_calls=None), logprobs=None)], created=1742811095, id='', model='meta-llama/Llama-3.2-3B-Instruct', system_fingerprint='3.2.1-native', usage=ChatCompletionOutputUsage(completion_tokens=3, prompt_tokens=40, total_tokens=43), object='chat.completion')


[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Paris.', tool_calls=None), logprobs=None)]


ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Paris.', tool_calls=None), logprobs=None)


ChatCompletionOutputMessage(role='assistant', content='Paris.', tool_calls=None)


In [13]:
print(output.choices[0].message.content)

Paris.
