#**Fine-tune Llama 3.1 8B**

## 0. Prelimnares

### Install packages

In [1]:
!pip install -qqq "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --progress-bar off
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install -qqq --no-deps {xformers} trl peft accelerate bitsandbytes triton --progress-bar off


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
grpcio-status 1.62.3 requires protobuf>=4.21.6, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

### Load libraries

In [2]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 1. Load model for PEFT


In [3]:
# Load model
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8b-bnb-4bit",
    max_seq_length=max_seq_length,                      # Longitud máxima de secuencia aceptada en la arquitectura del modelo y durante la tokenización
    load_in_4bit=True,                                  # Reduce la precisión a 4 bits
    dtype=None,
)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]


-  Low-Rank Adaptation (LoRA): es una técnica que permite ajustar grandes modelos de lenguaje de forma eficiente, entrenando solo matrices de bajo rango en capas específicas en lugar de actualizar todos los parámetros del modelo.

- r: el el rango bajo para las matrices de LoRA, que define el tamaño de las matrices de bajo rango utilizadas en LoRA. Un valor más alto permite una mayor capacidad de ajuste, pero incrementa ligeramente el costo computacional.

- lora_alpha: es el factor de escala para las matrices ajustadas de LoRA, que controla la magnitud de los ajustes aplicados al modelo original. LoRA multiplica las matrices ajustadas por este factor antes de sumarlas a los parámetros originales.

- lora_dropout: proporción de dropout aplicada durante el entrenamiento en las matrices de LoRA.

- target_modules: lista de módulos específicos del modelo donde se aplicarán las adaptaciones de LoRA. Que define qué partes del modelo serán ajustadas.
Módulos en la lista:
q_proj: Proyección de la query en los mecanismos de atención.

  *   v_proj: Proyección de la value.
  *   k_proj: Proyección de la key.
  *   up_proj: Parte del feed-forward network que expande las dimensiones.
  *   down_proj: Parte del feed-forward network que reduce las dimensiones.
  *   o_proj: Proyección final de salida en los mecanismos de atención.
  *   gate_proj: Proyección de la puerta (gate) en el feed-forward.

- use_rslora: Es una variante de LoRA diseñada para mejorar la estabilidad y rendimiento durante el fine-tuning.

- use_gradient_checkpointing: es una técnica de optimización para reducir el consumo de memoria durante el entrenamiento

In [4]:
# Preparate model for PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "up_proj",
        "down_proj",
        "o_proj",
        "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    )


Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 2. Prepare data and tokenizer


In [5]:
"""
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm fine, thank you!<|im_end|>
"""

tokenizer = get_chat_template(
    tokenizer,
    # Formato que se usará para estructurar los mensajes
    chat_template="chatml",
    # Define cómo se asignan los roles y contenidos de los mensajes en el formato ChatML
    mapping={"role": "from",
             "content": "value",
             "user": "human",
             "assistant": "gpt"},
)

Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


In [6]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:200]")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [7]:
dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 200
})

In [8]:
def aaply_template(example):
    # Extrae la lista de mensajes de "conversations"
    messages = example["conversations"]

    # Asegúrate de que 'messages' sea una lista de mensajes válidos
    if isinstance(messages, list):
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    else:
        text = ""  # Si no es válido, devuelve una cadena vacía

    return {"text": text}

dataset = dataset.map(aaply_template, batched=False)


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
print(dataset[0])

{'conversations': [{'from': 'human', 'value': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.'}, {'from': 'gpt', 'v

In [10]:
dataset

Dataset({
    features: ['conversations', 'source', 'score', 'text'],
    num_rows: 200
})

In [11]:
dataset['text'][0]

'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\nBoolean operato

## 3. Training


In [12]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="./output"
    ),
)

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 55 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 3
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,1.0649
2,1.1019
3,1.0796


TrainOutput(global_step=3, training_loss=1.082162857055664, metrics={'train_runtime': 293.6535, 'train_samples_per_second': 0.187, 'train_steps_per_second': 0.01, 'total_flos': 4451323701362688.0, 'train_loss': 1.082162857055664, 'epoch': 0.8571428571428571})

## 4. Inference

In [22]:
# Load model for inference
model = FastLanguageModel.for_inference(model)

messages = [
   {"from": "human",
    "value": "Is 9.11 larger than 9.9?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt").to("cuda")

In [23]:
inputs

tensor([[    27,     91,    318,   5011,     91,     29,    882,    198,   3957,
            220,     24,     13,    806,   8294,   1109,    220,     24,     13,
             24,     30, 128001,    198,     27,     91,    318,   5011,     91,
             29,  78191,    198]], device='cuda:0')

In [24]:
tetx_streamer = TextStreamer(tokenizer)

In [25]:
_ = model.generate(
    input_ids=inputs,
    streamer=tetx_streamer,
    max_new_tokens=128,
    use_cache=True)

<|im_start|>user
Is 9.11 larger than 9.9?<|im_end|>
<|im_start|>assistant
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9.9? The answer is yes.
Is 9.11 larger than 9


## 5. Save trained model



In [26]:
model.save_pretrained_merged(
    "model",
    tokenizer=tokenizer,
    save_method="merged_16bit"
)

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 30.16 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 78%|███████▊  | 25/32 [00:01<00:00, 17.01it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:07<00:00,  4.24it/s]


Unsloth: Saving tokenizer... Done.
Done.
