# Tech Challenge - Fase 3 - Fine tuning de um foundation model

No Tech Challenge desta fase, você precisa executar o fine-tuning de um
foundation model (Llama, BERT, MISTRAL etc.), utilizando o dataset "The
AmazonTitles-1.3MM". O modelo treinado deverá:
* Receber perguntas com um contexto obtido por meio do arquivo json
“trn.json” que está contido dentro do dataset.
* A partir do prompt formado pela pergunta do usuário sobre o título do
produto, o modelo deverá gerar uma resposta baseada na pergunta do
usuário trazendo como resultado do aprendizado do fine-tuning os
dados da sua descrição.

In [None]:
#@title Conectando ao Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#@title Instalando bibliotecas
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install transformers datasets

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-v83xx8rx/unsloth_9bba522a379e4f9f992c53283f831cb0
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-v83xx8rx/unsloth_9bba522a379e4f9f992c53283f831cb0
  Resolved https://github.com/unslothai/unsloth.git to commit 85f1fa096afde5efe2fb8521d8ceec8d13a00715
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2024.11.8 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2024.12.1-py3-none-any.whl.metadata (16 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

Collecting xformers
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting trl<0.9.0
  Downloading trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m99.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.8.6-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers, trl
  Attempting uninstall: trl
    Found existing installation: trl 0.12.2
    Uninstalling trl-0.12.2:
      Successfully uninstalled trl-0.12.2
Successfully installed trl-0.8.6 xformers-0.0.28.post3


In [None]:
#@title Configurando os parâmetros Unsloth

from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

max_seq_length = 2048   # Can change to whatever number <= 4096
dtype = None            # None for auto detection.
load_in_4bit = True     # Use 4bit quantization to reduce memory usage. Can be False.
fourbit_models = [      # 4bit pre quantized supported models
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
]


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

In [None]:
#@title Limpando e formatando dataset

import json
import html

# Caminho do arquivo JSON no Google Drive
DATA_PATH = "/content/drive/MyDrive/Postech/Fase 3/tc/trn.json"
OUTPUT_PATH_DATASET = "/content/drive/MyDrive/Postech/Fase 3/tc/formatted_trn.json"

titles = []
contents = []

# Lê o arquivo com o dataset
with open(DATA_PATH, "r", encoding="utf-8") as file:
    for i, line in enumerate(file):
        # Remove espaços extras e quebras de linha
        line = line.strip()

        if i == 300000:
            break

        # Extrai o conteúdo das colunas "title" e "content"
        title_start = line.find('"title": "') + len('"title": "')
        title_end = line.find('"', title_start)
        title = line[title_start:title_end]

        content_start = line.find('"content": "') + len('"content": "')
        content_end = line.find('"', content_start)
        content = line[content_start:content_end]

        title_ = title.lower().strip()  # deixa minusculo e remove espaços
        content_ = html.unescape(content)  # remove caracteres especiais
        content_ = (
            content_.lower().strip()
        )  # deixa minusculo e remove caracteres especiais

        # O titulo precisa ter pelo menos 40 caracteres, pois precisa ser distinguível
        # O conteúdo precisa ter pelo menos 400 caracteres, para ser considerado uma "review" com conteúdo substante
        if len(title_) > 40 and len(content_) > 400:

            # Entendi que filtrar "edition" e "book" dos títulos é uma boa estratétida para remover
            #  vários livros que se repetem bastante
            if not "edition" in title_ and not "book" in title_:

                # Optei por remover itens que continham "&" pois a revisão destes sempre se mostrava mais zuada
                # Optei por incluir o "review" pois o texto resultante se parecia mesmo mais com uma "avaliação" do livro
                if not "&" in content_ and "review" in content_:
                    titles.append(title_)
                    contents.append(content_)

# Formata dados
formatted_data = {
  "instruction": ["REVIEW THIS BOOK."] * len(titles),
  "input": titles,
  "output": contents,
}

# Salva conteúdo num arquivo JSON
with open(OUTPUT_PATH_DATASET, "w") as output_file:
  json.dump(formatted_data, output_file, indent=4)

print("Títulos:", len(titles))


Títulos: 5402


In [None]:
# @title Parameter-Efficient Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    # This controls the rank of the low-rank matrices used in the PEFT method.
    r = 16,
    # This specifies which layers or modules within
    # the pre-trained model should be fine-tuned using PEFT.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    # This is a scaling factor used in LoRA to control the
    # impact of the low-rank matrices on the original weights.
    lora_alpha = 16,
    # This sets the dropout rate for the LoRA modules to 0.
    # Dropout is a regularization technique used to prevent overfitting.
    lora_dropout = 0,
    # This indicates that no bias terms are used in the LoRA modules.
    bias = "none",

    # This enables gradient checkpointing, a technique to reduce memory usage
    # during training, especially for large models. It trades off computation
    # time for memory savings.
    use_gradient_checkpointing = "unsloth",
    # This sets a seed for the random number generator, ensuring reproducibility of results
    random_state = 3407,
    # When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling
    # factor to # lora_alpha/math.sqrt(r). Otherwise, it will use the original
    # default value of lora_alpha/r.
    use_rslora = False,
    # The configuration of LoftQ (LoRA-Fine-Tuning-aware Quantization)
    loftq_config = None,
)

Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


  Este trecho de código está usando uma técnica chamada **PEFT** (*Parameter-Efficient Fine-Tuning*) para otimizar um modelo de linguagem Llama pré-treinado para melhor desempenho em uma tarefa específica. Os métodos PEFT permitem que você ajuste um modelo com menos parâmetros, tornando-o mais eficiente em termos de memória e tempo de treinamento.

In [None]:
# @title Preparação do dataset
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):

        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset

dataset = load_dataset("json", data_files=OUTPUT_PATH_DATASET, split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/5402 [00:00<?, ? examples/s]

In [None]:
# @title Treinamento do modelo
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/5402 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


TRL é uma biblioteca projetada para modelos de fundação pós-treinamento usando técnicas avançadas como *Supervised Fine-Tuning* (SFT), *Proximal Policy Optimization* (PPO) e *Direct Preference Optimizatio*n (DPO). Construído sobre o ecossistema *Transformers*, TRL suporta uma variedade de arquiteturas e modalidades de modelos e pode ser ampliado em várias configurações de hardware.

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5,402 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,2.3197
2,2.8232
3,2.6568
4,2.4751
5,2.3955
6,2.4746
7,2.5237
8,2.5035
9,2.1526
10,2.0054


In [None]:
# @title Testando o modelo
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "REVIEW THIS BOOK.",
        "hidden cities: the discovery and loss of ancient north american civilizations", # input
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
REVIEW THIS BOOK.

### Input:
hidden cities: the discovery and loss of ancient north american civilizations

### Response:
“a fascinating, well-researched, and beautifully written book. it is a must read for anyone interested in the ancient history of the americas.”—peter n. pericoli, author of the lost city of the incas and the secret of the incas“a lively and fascinating account of the search for america’s prehistoric past.  the author’s writing is lively, and he has a fine gift for describing the archaeology and the personalities involved in the discovery of the new world’s ancient past.”—john h. l. lloyd, author of the mystery of the maya and the mystery of the maya: deciphering the hieroglyphs“a fascinating account of the search for the earliest inhabitants of north america, from the first e

In [None]:
# @title Salva o modelo no Google Drive
model.save_pretrained("/content/drive/MyDrive/Postech/Fase 3/tc/lora_model") # Local saving
tokenizer.save_pretrained("/content/drive/MyDrive/Postech/Fase 3/tc/lora_model")

('/content/drive/MyDrive/Postech/Fase 3/tc/lora_model/tokenizer_config.json',
 '/content/drive/MyDrive/Postech/Fase 3/tc/lora_model/special_tokens_map.json',
 '/content/drive/MyDrive/Postech/Fase 3/tc/lora_model/tokenizer.json')

In [None]:
# @title Salva o modelo no Hugging face
model.push_to_hub("michaelycus/lora_model", token = "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") # Online saving
tokenizer.push_to_hub("michaelycus/lora_model", token = "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") # Online saving

README.md:   0%|          | 0.00/578 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/michaelycus/lora_model


No files have been modified since last commit. Skipping to prevent empty commit.
