## 🚀 Tech Challenge - Fine-tuning com LLaMA 3.2 3B

# Este notebook apresenta a solução para o desafio de fine-tuning de um foundation model com base no dataset **The AmazonTitles-1.3MM**, utilizando **LoRA** e o modelo **LLaMA 3.2 3B** por meio do pacote **Unsloth**.


### 📦 1. Montar o Google Drive
# Montamos o Google Drive para acessar os arquivos de entrada (dataset) e salvar o modelo treinado.

[Link para download do modelo treinado](https://drive.google.com/drive/folders/15oPPYUnI7rRjFz0DYaTCqIApd3EQpWTohttps://drive.google.com/drive/folders/15oPPYUnI7rRjFz0DYaTCqIApd3EQpWTo)

In [1]:
# ===============================
# 📦 Montar o Drive
# ===============================
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 📦 2. Instalar Dependências
# Instalamos as bibliotecas necessárias para trabalhar com o modelo LLaMA, executar o fine-tuning com LoRA, e manipular os dados.


In [2]:
# ===============================
# 📦 Instalar Dependências
# ===============================
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth


Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting xformers==0.0.29.post3
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting trl==0.15.2
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting cut_cross_entropy
  Downloading cut_cross_entropy-25.1.1-py3-none-any.whl.metadata (9.3 kB)
Collecting unsloth_zoo
  Downloading unsloth_zoo-2025.5.8-py3-none-any.whl.metadata (8.0 kB)
Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl (43.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.15.2-py3-none-any.whl (318 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━

### 📁 3. Preparar o Dataset
#
# Utilizamos o arquivo `trn.json`, extraindo os campos `title` (título do produto) e `content` (descrição do produto).
#
# Para cada exemplo, foi construído um prompt no estilo:
# ```
# Pergunta: Qual a descrição do produto 'Nome do Produto'?
# Resposta: (conteúdo do campo content)
# ```
# O resultado foi salvo em formato `.jsonl`, pronto para fine-tuning.


In [3]:
DATA_PATH = "/content/drive/MyDrive/TechChallenge3/"
OUTPUT_PATH_DATASET = "/content/drive/MyDrive/TechChallenge3/output/"
MODEL_PATH_DATASET = "/content/drive/MyDrive/TechChallenge3/model/"


In [4]:
with open(f"{DATA_PATH}trn.json", "r") as f:
    first_line = f.readline()
    print(first_line)

{"uid": "0000031909", "title": "Girls Ballet Tutu Neon Pink", "content": "High quality 3 layer ballet tutu. 12 inches in length", "target_ind": [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 111], "target_rel": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}



In [None]:
# ===============================
# 🔄 Preparar Dados de Treinamento
# ===============================
def prepare_dataset_robust(input_file, output_file):
    with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
        for line_number, line in enumerate(infile, 1):
            try:
                item = json.loads(line)
            except json.JSONDecodeError:
                print(f"Linha {line_number} inválida, ignorada.")
                continue
            title = item.get("title", "").strip()
            content = item.get("content", "").strip()
            if content:
                prompt = f"Pergunta: Qual a descrição do produto '{title}'?\nResposta:"
                response = content
                json.dump({"prompt": prompt, "response": response}, outfile, ensure_ascii=False)
                outfile.write("\n")
    print(f"Dataset preparado e salvo em {output_file}")

prepare_dataset_robust(f"{DATA_PATH}trn.json", f"{OUTPUT_PATH_DATASET}train_prepared.jsonl")
prepare_dataset_robust(f"{DATA_PATH}tst.json", f"{OUTPUT_PATH_DATASET}test_prepared.jsonl")

Linha 83950 inválida, ignorada.
Dataset preparado e salvo em /content/drive/MyDrive/TechChallenge3/output/train_prepared.jsonl
Linha 660779 inválida, ignorada.
Dataset preparado e salvo em /content/drive/MyDrive/TechChallenge3/output/test_prepared.jsonl


In [5]:
import json

with open(f"{OUTPUT_PATH_DATASET}train_prepared.jsonl") as f:
    for i in range(3):
        print(json.loads(f.readline()))

{'prompt': "Pergunta: Qual a descrição do produto 'Girls Ballet Tutu Neon Pink'?\nResposta:", 'response': 'High quality 3 layer ballet tutu. 12 inches in length'}
{'prompt': "Pergunta: Qual a descrição do produto 'Mog's Kittens'?\nResposta:", 'response': 'Judith Kerr&#8217;s best&#8211;selling adventures of that endearing (and exasperating) cat Mog have entertained children for more than 30 years. Now, even infants and toddlers can enjoy meeting this loveable feline. These sturdy little board books&#8212;with their bright, simple pictures, easy text, and hand&#8211;friendly formats&#8212;are just the thing to delight the very young. Ages 6 months&#8211;2 years.'}
{'prompt': "Pergunta: Qual a descrição do produto 'Girls Ballet Tutu Neon Blue'?\nResposta:", 'response': 'Dance tutu for girls ages 2-8 years. Perfect for dance practice, recitals and performances, costumes or just for fun!'}


### 🧠 4. Carregar Modelo LLaMA com LoRA

# Carregamos o modelo LLaMA 3.2 3B da Unsloth, já otimizado para fine-tuning com quantização em 4-bit.

# Aplicamos LoRA com os parâmetros padrão recomendados.


In [6]:
# ===============================
# 🧠 Setup para rodar LLaMA 3 4B 4-bit no Tesla T4
# ===============================
# Imports básicos
# Carregar modelo com suporte LoRA
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",  # ou 3B se tiver mais RAM
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Aplicar LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=42,
)

# Preparar modelo para treino (gradient checkpointing ativado automaticamente)
model = FastLanguageModel.for_training(model)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.5.7 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


# 🧪 4. Teste de Inferência Antes do Treinamento

In [7]:
# Carrega exemplos do dataset de teste
test_dataset = load_dataset("json", data_files=f"{OUTPUT_PATH_DATASET}test_prepared.jsonl", split="train")
test_dataset = test_dataset.shuffle(seed=42).select(range(5))  # Seleciona 5 exemplos aleatórios

print("📊 Testando modelo original antes do fine-tuning:\n")
for example in test_dataset:
    prompt = example["prompt"]
    true_response = example["response"]

    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

    outputs = model.generate(input_ids=inputs, max_new_tokens=100, use_cache=True, temperature=0.7)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("🟡 Pergunta:", prompt)
    print("🟢 Resposta esperada:", true_response[:200], "...")
    print("🔵 Resposta do modelo:", generated.split("Resposta:")[-1].strip()[:200], "...")
    print("="*80)

Generating train split: 0 examples [00:00, ? examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


📊 Testando modelo original antes do fine-tuning:

🟡 Pergunta: Pergunta: Qual a descrição do produto 'Farberware P08-034 side handle for sauce pot 836, 838 &amp; 862.'?
Resposta:
🟢 Resposta esperada: Farberware P08-034 side handle for sauce pot 836, 838 & 862. ...
🔵 Resposta do modelo: assistant

Lamento, mas não consegui encontrar informações sobre o produto "Farberware P08-034 side handle for sauce pot 836, 838 &amp; 862". É possível que seja um produto específico e raro ou que nã ...
🟡 Pergunta: Pergunta: Qual a descrição do produto 'F Is for Fetish (Erotic Alphabet)'?
Resposta:
🟢 Resposta esperada: "I began to lust for each fetish soon after opening Alison Tyler's recent bookF is for Fetish. A bike mechanic's smudged, strong hands, golden showers from a goddess in a clawfoot bathtub, and decaden ...
🔵 Resposta do modelo: assistant

Desculpe, mas não consegui encontrar informações sobre um produto chamado "F Is for Fetish (Erotic Alphabet)". Pode ser que seja um produto específico ou

### 📚 5. Preparar o Dataset para o Unsloth

# O Unsloth espera dados formatados no estilo ChatML. Por isso, usamos o `standardize_sharegpt` e mapeamos os exemplos para incluir os headers `<|start_header_id|>user` e `<|start_header_id|>assistant`.


In [None]:
# ===============================
# 📚 Carregar Dataset
# ===============================
from datasets import load_dataset

path = f"{OUTPUT_PATH_DATASET}train_prepared.jsonl"

dataset = load_dataset("json", data_files=path, split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

In [None]:
def format_for_unsloth(example):
    formatted_text = (
        "<|start_header_id|>user<|end_header_id|>\n\n" +
        example["prompt"] + "\n\n" +
        "<|start_header_id|>assistant<|end_header_id|>\n\n" +
        example["response"]
    )
    return {"text": formatted_text}

dataset = dataset.map(format_for_unsloth)


Map:   0%|          | 0/64237 [00:00<?, ? examples/s]

## ⚙️ 6. Definir Parâmetros de Treinamento

# Configuramos o `SFTTrainer` com os seguintes parâmetros principais:

# - `batch_size`: 2
# - `gradient_accumulation`: 4
# - `max_steps`: 60
# - `learning_rate`: 2e-4
# - `fp16/bf16` ativado conforme GPU

# O objetivo foi permitir um treino eficiente com baixo custo computacional.


In [None]:
# ===============================
# ⚙️ Parâmetros de Treinamento
# ===============================
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

def formatting_prompts_func(examples):
    prompts = []
    responses = []
    for p, r in zip(examples["prompt"], examples["response"]):
        prompts.append(p)
        responses.append(r)
    return {"prompt": prompts, "response": responses}

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",  # porque agora a coluna formatada é "text"
    max_seq_length=2048,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    formatting_func=format_for_unsloth,  # aqui passa a função que criamos
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/64237 [00:00<?, ? examples/s]

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/64237 [00:00<?, ? examples/s]

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nPergunta: Qual a descrição do produto 'Worship with Don Moen [VHS]'?\nResposta:\n\n<|start_header_id|>assistant<|end_header_id|>\n\nWorship with Don Moen [VHS]"

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                   Worship with Don Moen [VHS]'

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
7.428 GB of memory reserved.


### 🔁 7. Executar o Fine-Tuning

# Iniciamos o processo de treinamento. Cada passo foi registrado com logs para acompanhar a evolução.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 64,237 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 9,175,040/3,000,000,000 (0.31% trained)


Step,Training Loss
1,2.5897
2,3.072
3,2.5449
4,2.9323
5,2.6142
6,2.9644
7,2.9855
8,2.8401
9,2.7194
10,2.8633


### 🤖 8. Teste de Inferência

# Após o treino, realizamos um teste de inferência usando um exemplo real:

# - Input: "Qual é a descrição do produto com o título: Girls Ballet Tutu Neon Pink?"
# - Output: (modelo responde com base nos dados do fine-tuning)


In [None]:
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Qual é a descrição do produto com o título: Girls Ballet Tutu Neon Pink?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=150, use_cache=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


system

Cutting Knowledge Date: December 2023
Today Date: 26 May 2025

user

Qual é a descrição do produto com o título: Girls Ballet Tutu Neon Pink?assistant

Our ballet tutu is designed to be a fun, playful addition to any little girl's ballet routine. The tutu is made of a sturdy netting material that provides the right amount of structure and support for a full skirt. The neon pink color adds a fun pop of color to any ballet routine. The tutu is adjustable in length to fit any child's height, making it a great choice for ballet classes or home practice. The netting material provides a full skirt with a soft, flowing texture that catches the light and adds to the overall look of the tutu. The adjustable straps in the back make it easy to put on and take off, and the sturdy netting material ensures that the tutu will hold its shape and provide support


### 💾 9. Salvar o Modelo e Tokenizer

# Salvamos o modelo e o tokenizer fine-tunados para uso posterior, exportando para o Google Drive.


In [None]:
# ===============================
# 💾 Salvar Modelo e Tokenizer para uso posterior
# ===============================
model.save_pretrained(MODEL_PATH_DATASET)
tokenizer.save_pretrained(MODEL_PATH_DATASET)
print("✅ Modelo salvo em:", MODEL_PATH_DATASET)

✅ Modelo salvo em: /content/drive/MyDrive/TechChallenge3/model/


### 🧪 10. Recarregar e Usar o Modelo Treinado (opcional)

# Demonstração opcional de como carregar o modelo salvo e realizar inferência com `TextStreamer` (stream de tokens).

In [8]:
from transformers import TextStreamer

if True:
  from unsloth import FastLanguageModel

  # Carregar o modelo fine-tunado
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = f"{MODEL_PATH_DATASET}50k",  # Caminho para o diretório do modelo salvo
      max_seq_length = 2048,
      dtype = None,
      load_in_4bit = True,
  )

  # Otimizar para inferência (ativa modo rápido)
  FastLanguageModel.for_inference(model)

# Preparar a pergunta
messages = [
    {"role": "user", "content": "Qual é a descrição do produto com o título: Girls Ballet Tutu Neon Pink?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

# Geração com streamer opcional (exibe token por token)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Gerar resposta
_ = model.generate(
    input_ids = inputs,
    streamer = streamer,
    max_new_tokens = 150,
    use_cache = True,
    temperature = 0.7,
)

==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
This tutu is perfect for any little ballerina.  It is a bright pink tutu that is made of netting and is a great size for a young dancer.  The netting is a bright pink color and is a very pretty shade that is sure to match any outfit.  It is a great size and is long enough to cover the legs and is just the right size to keep the skirt from getting in the way of movement.  I highly recommend this tutu for any young ballerina.  It is a great way to keep her looking cute and stylish while she is dancing.  It is also easy to put on and take of