<a target="_blank" href="https://colab.research.google.com/github/acceleratescience/llms-for-pi/blob/main/intro-to-qwen.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Introduction to Hugging Face
In this notebook, we will introduce the basic code required to download a model and generate some text with it. The model that we will use is Qwen2.5-0.5B, but the process is pretty much the same for any other model.

If you want to check out other models, you can head to [Hugging Face](https://huggingface.co/models) and just browse. If you sign up to Hugging Face, you can also click on most of the models, and actually chat with them (it's on the right hand side of the page).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from rich.pretty import pprint

The model has already been downloaded by the `setup.sh` script. If you replace the text below with another model, it will have to be downloaded, and may take some time.

In [18]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

The pipeline for Hugging Face text generation has two parts:

- The tokenizer
- The model

We use the inbuilt methods to load the tokenizer and the model into memory

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Remember that LLMs like a certain pattern of inputs that looks like:

system &rarr; user &rarr; assistant &rarr; user &rarr; assistant &rarr; ...

So we need to structure our prompts in the same way.

In [20]:
# prepare the model input
system = "You are a world-class poet."
prompt = "Give me a haiku about a Samurai cat."

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": prompt}
]

We then pass these into the tokenizer's chat template. Almost all models have a chat template that matches the format of the training data. This is pretty much required to get the best out of the model.

In [21]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print(text)

<|im_start|>system
You are a world-class poet.<|im_end|>
<|im_start|>user
Give me a haiku about a Samurai cat.<|im_end|>
<|im_start|>assistant



But now we need to tokenize (break up) the text, and convert it to numbers. If you remember from the slides, we can think of this like breaking the text up into words and then assigning each word a number.

In [22]:
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

In [23]:
pprint(model_inputs)

We have two parts here:

`input_ids` : the numerical form of the tokenized text.

`attention_mask` : for when you want to stack multiple inputs together.

Now we are ready to generate some new text

In [None]:
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

Just as the model only understands numbers, it also only outputs numbers. We therefore need to decode the output, using out lookup table of numbers to words.

In [29]:
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Squinting eyes,
Silent claws, keen sense—
Samurai's loyal friend.


This process is pretty slow. Fortunately, there are some very efficient inference engines available to us. The one that we will try now is called Ollama...