# Llama 3.2 3B Instruct - Hugging Face API

Este notebook demonstra como usar o modelo **meta-llama/Llama-3.2-3B-Instruct** via Hugging Face Inference API.

## 1. Instalar depend√™ncias

In [1]:
!pip install -q transformers huggingface_hub accelerate

## 2. Configurar autentica√ß√£o

Voc√™ precisa de um token do Hugging Face. Obtenha em: https://huggingface.co/settings/tokens

In [None]:
import os
from huggingface_hub import login

# Op√ß√£o 1: Definir como vari√°vel de ambiente
os.environ["HF_TOKEN"] = "HF_TOKEN_AQUI" 

# Op√ß√£o 2: Login interativo (vai pedir o token)
# login()

# Op√ß√£o 3: Passar o token diretamente (n√£o recomendado para c√≥digo p√∫blico)
HF_TOKEN = os.environ.get("HF_TOKEN", None)

if HF_TOKEN:
    login(token=HF_TOKEN)
    print("‚úÖ Autenticado no Hugging Face!")
else:
    print("‚ö†Ô∏è Defina HF_TOKEN ou rode login() para se autenticar")

HTTPError: Invalid user token. The token from HF_TOKEN environment variable is invalid. Note that HF_TOKEN takes precedence over `hf auth login`.

## 3. Usar via Inference API (modo mais simples)

In [None]:
from huggingface_hub import InferenceClient

client = InferenceClient(token=HF_TOKEN)

# Chat b√°sico
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "Explique o que √© Deep Learning em 3 frases."}
    ]
)

print(response.choices[0].message.content)

## 4. Chat com contexto (m√∫ltiplas mensagens)

In [None]:
messages = [
    {"role": "system", "content": "Voc√™ √© um assistente especializado em IA. Seja conciso e t√©cnico."},
    {"role": "user", "content": "Qual a diferen√ßa entre RNN e Transformer?"},
]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=messages
)

print(response.choices[0].message.content)

## 5. Streaming (resposta em tempo real)

In [None]:
print("ü§ñ Llama 3.2: ", end="")
for chunk in client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Liste 3 aplica√ß√µes pr√°ticas de LLMs."}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

## 6. Com par√¢metros avan√ßados

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Gerar um t√≠tulo criativo para um artigo sobre IA."}],
    temperature=0.9,
    max_tokens=50,
    top_p=0.9
)

print(response.choices[0].message.content)

## 7. Usar via API REST direta

In [None]:
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-3B-Instruct/v1/chat/completions"

headers = {"Authorization": f"Bearer {HF_TOKEN}"}
payload = {
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Diga ol√° em portugu√™s."}],
}

response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])

## 8. Carregar modelo localmente (requer GPU)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Carregar tokenizer e modelo
model_name = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=HF_TOKEN,
    torch_dtype=torch.float16,
    device_map="auto"
)

print(f"‚úÖ Modelo carregado: {model_name}")

## 9. Gera√ß√£o local com transformers

In [None]:
# Formatar prompt no estilo Llama
def format_prompt(user_input, system_prompt=""):
    if system_prompt:
        return f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

prompt = format_prompt("Explique transformers em uma frase.", "Seja t√©cnico e conciso.")

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

## 10. Pipeline simplificado

In [None]:
from transformers import pipeline

# Criar pipeline de text generation
pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.2-3B-Instruct",
    token=HF_TOKEN,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Qual a capital do Brasil?"}]
outputs = pipe(messages, max_new_tokens=50)
print(outputs[0]["generated_text"][-1]["content"])

## Notas

- **Inference API**: Mais simples, n√£o requer GPU, mas tem limites de rate
- **Modelo local**: Mais controle, sem limites, mas requer GPU (pelo menos 8GB VRAM para 3B)
- **Token**: Necess√°rio aceitar os termos do modelo em https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct