## Login to HuggingFace

We do this to get access to Llama models, we need permission from meta and thus once granted we need to "identify" using our HuggingFace account

Model requirements

* Llama 3.1: https://llamaimodel.com/requirements/
* LLama 3.2: https://llamaimodel.com/requirements-3-2/

In [36]:
import os
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from huggingface_hub import login

In [35]:
# Check if a GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the Hugging Face token from the environment variable
hf_token = os.getenv("HUGGINGFACE_TOKEN")

# Log in to Hugging Face
if hf_token:
    login(hf_token)
else:
    print("Hugging Face token not found. Make sure it's set in the .zshrc file.")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/fernando/.cache/huggingface/token
Login successful


## Load Llama model

In [2]:
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf.co/repos/46/00/46001ae0478831903029d3dfca642974a652dfe8c5a4a525a48a3c700a7a99dd/68a2e4be76fa709455a60272fba8e512c02d81c46e6c671cc9449e374fd6809a?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1728903082&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODkwMzA4Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzQ2LzAwLzQ2MDAxYWUwNDc4ODMxOTAzMDI5ZDNkZmNhNjQyOTc0YTY1MmRmZThjNWE0YTUyNWE0OGEzYzcwMGE3YTk5ZGQvNjhhMmU0YmU3NmZhNzA5NDU1YTYwMjcyZmJhOGU1MTJjMDJkODFjNDZlNmM2NzFjYzk0NDllMzc0ZmQ2ODA5YT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=KSol4Z-twYs07k0IfRumP79QEHV0MSoNoTTH-H-jVShJal56yUdq31%7E2PNKx3yShWO7xZVq5oT%7EEVa6J-7sGpObWNTeyNhgQzhbMKf2qlRmPuMK756lGpG3Sk8JvK0DBgxZ1vkBQxldHdToFjDd-zzVNiZiwExsnpYUnoXoTZjMA5YEBdpjJYO7F436Z4uuVqIO6uN-nRVGVET-yUg6vJ6gq6RFJOLeosEFZxff%7EvlCeGTxY1n7bni6SgqNsfM

model.safetensors:  99%|#########8| 2.44G/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

## Notes on Embedding Dimension and Tokenizer's Vocabulary

The LLaMA 3.2 tokenizer includes several special tokens. Here are some key examples:

```
128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128004: AddedToken("<|finetune_right_pad_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128005: AddedToken("<|reserved_special_token_2|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128006: AddedToken("<|start_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128007: AddedToken("<|end_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128008: AddedToken("<|eom_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128009: AddedToken("<|eot_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128010: AddedToken("<|python_tag|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
```

These tokens are fundamental to the tokenizer, but there are also several "extra" special tokens that appear to have been introduced primarily to reach a fixed embedding size of 128,256, likely for numerical reasons:

```
 128011: AddedToken("<|reserved_special_token_3|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 128012: AddedToken("<|reserved_special_token_4|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
...
```


When we inspect the `tokenizer.special_tokens_map`, we only see a few essential tokens:

{'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>'}


As noted in [this Hugging Face forum discussion](https://discuss.huggingface.co/t/how-to-set-the-pad-token-for-meta-llama-llama-3-models/103418), the LLaMA tokenizer doesn't come with a dedicated padding token by default. While this is not a critical issue for inference tasks, it becomes important during fine-tuning. As a best practice, we recommend using one of the reserved tokens for padding, specifically `"<|finetune_right_pad_id|>"`.


In [32]:
print(tokenizer.vocab_size)
print(model.get_input_embeddings())

# Set the "<|finetune_right_pad_id|>" as the padding token
tokenizer.pad_token = tokenizer.added_tokens_decoder[128004].content

# Verify that the pad_token_id is correctly set to the ID of "<|finetune_right_pad_id|>"
pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
print(f"Pad token ID is set to: {pad_token_id}")

128000
Embedding(128256, 2048)
Pad token ID is set to: 128004


## Testing the model

In [49]:
temperature = 0.6 # default
do_sample = True # without it, it works weird, need to check chat models

### Using `generate()`

In [46]:
# Prepare a prompt
prompt = "What is the color of the sky?"

# Tokenize the input prompt with padding
inputs = tokenizer(prompt, return_tensors="pt", padding=True)

# Ensure the attention mask is set properly
attention_mask = inputs.attention_mask

# Generate response from the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=attention_mask,
    max_length=50,
    pad_token_id=pad_token_id,  # Use the custom pad token ID
    temperature=temperature,
    do_sample=do_sample
)

# Decode the generated tokens to text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

What is the color of the sky? Is it blue? Is it green? Or is it a combination of both? In this article, we will explore the different shades of sky, their meanings, and how to use them in your own art


### Using `pipeline()`

In [51]:
# Create text generation pipeline
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    torch_dtype=torch.bfloat16, 
    device=device
)

# Generate text with specified max length
output = pipe(
    prompt, 
    max_length=50, 
    pad_token_id=tokenizer.pad_token_id, 
    truncation=True,
    temperature=temperature,
    do_sample=do_sample,
)

print(output)

[{'generated_text': 'What is the color of the sky? The sky is blue. What is the color of the grass? The grass is green. What is the color of the ocean? The ocean is blue. What is the color of the sky? The sky'}]
