# Lab 1: Hugging Face Transformers — Encoder vs Decoder

Goal: Learn to use the `transformers` library to work with models, both encoder-only (e.g., BERT) and decoder-only (e.g., Llama 1B).

In this lab we will cover:
- Overview of the library and model types
- Import and use of BERT, focusing on `cache_dir` and loading/inference parameters
- Use of a Llama 1B model, focusing on special tokens, tokenization, and chat template
- Two hands-on exercises: embedding similarity and how to construct effective prompts

## Step 0 - What is Huggingface

<style>
img[src="images/image4.png"] {
    width: 40%;
    height: auto;
}
</style>
![alt text](images/image4.png)


## Lab overview: encoder-only vs decoder-only

- Encoder-only models (e.g., BERT): produce text representations (embeddings). Ideal for classification, semantic search, and similarity.
- Decoder-only models (e.g., Llama): generate text autoregressively and are suited for dialogue, completion, and extraction via prompting.

The `transformers` library offers uniform interfaces for tokenizers and models, supports caching with `cache_dir`, and includes utilities such as a chat template for conversation-trained models.

In [1]:
# Setup: import and config
import os, json
import numpy as np
import torch

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

print('Transformers version:', __import__('transformers').__version__)

# Main paths (local cache for models)
BASE_DIR = os.path.join('lab1')
DATA_DIR = os.path.join(BASE_DIR, 'data')
CACHE_DIR = os.path.join(BASE_DIR, 'models_cache')
os.makedirs(CACHE_DIR, exist_ok=True)


  from .autonotebook import tqdm as notebook_tqdm


Transformers version: 4.56.0


We will use `bert-base-uncased` to generate sentence embeddings. Focus:
- `cache_dir`: local directory to save weights/tokenizer
- Loading parameters: dtype, device map, etc. (here, CPU-only)
- Inference: obtain embeddings from `last_hidden_state` and compute cosine similarity


In [3]:
# Caricamento tokenizer e modello BERT
bert_model_id = 'bert-base-uncased'
tokenizer_bert = AutoTokenizer.from_pretrained(bert_model_id, cache_dir=CACHE_DIR)
model_bert = AutoModel.from_pretrained(bert_model_id, cache_dir=CACHE_DIR)

print('Model structure:\n', model_bert, "\n\n")
print('The model has been loaded on memory using this data type:', next(model_bert.parameters()).dtype)
print('The model has been loaded on this device:', next(model_bert.parameters()).device)

Model structure:
 BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=

Example on how to embed a sequence using BERT

In [11]:
# Example on how to embed a sequence using BERT
text = "Hello, how are you?"
inputs = tokenizer_bert(text, return_tensors='pt')

print('input: ', inputs)
outputs = model_bert(**inputs)
embeddings = outputs.last_hidden_state
print('\nEmbeddings shape:', embeddings.shape, '\n')
print('Embeddings:', embeddings)

input:  {'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Embeddings shape: torch.Size([1, 8, 768]) 

Embeddings: tensor([[[-0.0824,  0.0667, -0.2880,  ..., -0.3566,  0.1960,  0.5381],
         [ 0.0310, -0.1448,  0.0952,  ..., -0.1560,  1.0151,  0.0947],
         [-0.8935,  0.3240,  0.4184,  ..., -0.5498,  0.2853,  0.1149],
         ...,
         [-0.2812, -0.8531,  0.6912,  ..., -0.5051,  0.4716, -0.6854],
         [-0.4429, -0.7820, -0.8055,  ...,  0.1949,  0.1081,  0.0130],
         [ 0.5570, -0.1080, -0.2412,  ...,  0.2817, -0.3996, -0.1882]]],
       grad_fn=<NativeLayerNormBackward0>)


The same can be done providing a list of values as input

In [20]:
inputs_of_a_list = tokenizer_bert([text]*2, return_tensors='pt')
outputs_of_a_list = model_bert(**inputs_of_a_list)
embeddings_of_a_list = outputs_of_a_list.last_hidden_state
print('Embeddings of a list shape:', embeddings_of_a_list.shape, '\n')

Embeddings of a list shape: torch.Size([2, 8, 768]) 



Text embeddings are commonly used to define `similarity` between text sequences. They act as vectorial representations of the text itself.

The similarity between two sequences can then be calculated as the cosine similarity:
<style>
img[src="images/image.png"],
img[src="images/image2.png"] {
    width: 20%;
    height: auto;
}
</style>

![alt text](images/image.png) 
![alt text](images/image2.png)


## Llama 1B: special tokens, tokenization, chat template, prompting


Preliminary note: to use some models from HuggingFace, you often have to accept their terms of use. For this lab, it is not required, but you are highly encouraged to signup to HuggingFace 

<style>
img[src="images/image3.png"] {
    width: 30%;
    height: auto;
}
</style>


![alt text](images/image3.png)

When you work inside a script/notebook, you can let the environment know that it is you by using your access token ([here](https://huggingface.co/docs/hub/security-tokens) more info), by doing so

```python
from huggingface_hub import login
login(YOUR_ACCESS_TOKEN)
```


In [15]:
# Setup: import and config
import os, json
import numpy as np
import torch

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

print('Transformers version:', __import__('transformers').__version__)

# Main paths (local cache for models)
BASE_DIR = os.path.join('../../lab1')
DATA_DIR = os.path.join(BASE_DIR, 'data')
CACHE_DIR = os.path.join(BASE_DIR, 'models_cache')
os.makedirs(CACHE_DIR, exist_ok=True)

os.environ['CUDA_VISIBLE_DEVICES'] = '6'

Transformers version: 4.52.3


In [2]:
model_path= 'google/gemma-3-270m-it' # 'unsloth/gemma-3-1B-it' # USED TO AVOID HAVING TO ACCEPT TERMS OF SERVICE FOR THE GEMMA MODEL
device = 'cuda'
tokenizer = AutoTokenizer.from_pretrained(model_path, cache_dir=CACHE_DIR)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    cache_dir="/data01/pferrazzi/.cache", # "CACHE_DIR, 
    torch_dtype=torch.float16, 
    device_map=device
    )

print('Gemma model:', model, '\n\n')
print('Special tokens:', tokenizer.special_tokens_map, '\n\n')

Gemma model: Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear(in_features=640, out_features=1024, bias=False)
          (k_proj): Linear(in_features=640, out_features=256, bias=False)
          (v_proj): Linear(in_features=640, out_features=256, bias=False)
          (o_proj): Linear(in_features=1024, out_features=640, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
          (up_proj): Linear(in_features=640, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=640, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma3RMSNor

In [3]:
# Basic inference

prompt = 'What are the best things to do in Padova?'

# first, you need to tokenize the prompt 
tokenized_prompt = tokenizer(
    prompt, 
    return_tensors='pt',
    add_special_tokens=False
).to(device)
print('Tokenized prompt:', tokenized_prompt, '\n\n')

# then, you can generate a response from the model
generated_ids = model.generate(
    **tokenized_prompt,
    max_new_tokens=10,
    temperature=0.001
)
print('Generated token ids:', generated_ids, '\n\n')

# now we need to use the tokenizer to map back the generated token ids to text
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
print('Generated text:', generated_text, '\n\n')

Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.


Tokenized prompt: {'input_ids': tensor([[  3689,    659,    506,   1791,   2432,    531,    776,    528, 194486,
         236881]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')} 


Generated token ids: tensor([[  3689,    659,    506,   1791,   2432,    531,    776,    528, 194486,
         236881,  21950,  63582,  65785, 125378,  18628, 202891, 202891, 202891,
         202891, 202891]], device='cuda:0') 


Generated text: ['What are the best things to do in Padova?？” moyens”?الجبالключенияключенияключенияключенияключения'] 




This does not work properly because models have been trained with specific special tokens, that need to be inserted in the right positions.

In [None]:
prompt = 'What are the best things to do in Padova?'

# let's turn add_special_tokens to True
tokenized_prompt = tokenizer(
    prompt, 
    return_tensors='pt',
    add_special_tokens=True,
).to(device)

# then, you can generate a response from the model
generated_ids = model.generate(
    **tokenized_prompt,
    max_new_tokens=10,
    temperature=0.001
)
print('Generated token ids:', generated_ids, '\n\n')

# now we need to use the tokenizer to map back the generated token ids to text
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
print('Generated text:', generated_text, '\n\n')

Even more, instruction-tuned models have undergone a training phase with a specific chat-like structure

In [None]:
messages = [
    {"role": "user", "content": "What are the best things to do in Padova?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

### Parametri di generazione utili
- `max_new_tokens`: massimo numero di token generati
- `temperature`: 0.0 per output più deterministici
- `top_p`/`top_k`: campionamento nucleare
- `repetition_penalty`: penalizza ripetizioni

Nota: l'esecuzione decoder-only è più fluida con GPU. In CPU, limitare `max_new_tokens` e usare temperature basse.

### Excercise 1
You are provided a list of sentences at `lab1/data/sentences.txt`. Your need to:
- find the pair of sentences with the highest similarity
- find the sentence that has the highest overall similarity to the others. The overall similarity of a string *_s_* can be calculated as the sum of the similarities of *_s_* and all other strings.

**Hints**
function to calculate the norm of a vector: `np.linalg.norm()`

### Excercise 2

You are provided 3 sentences at `lab1/data/target_text.txt`. You need to construct an effective prompt to extract the mentions of people from the text. You can use the examples provided in `lab1/data/few_shot_examples.json`. You are expected to try different prompt setups, included elaborated system prompts and few-shot example prompting. \\
You need to perform inference :
- just providing the target text and a brief description of the task
- providing a proper system prompt
- providing an example on how the task needs to be performed in the user prompt
- providing an example of how the task needs to be performed as multi-turn conversation
- providing multiple examples of how the task needs to be performed as multi-turn conversation