<a href="https://colab.research.google.com/github/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-finetuning-models/02_finetune_v3_malaysian_mistral_7b_32k_instructions_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we finetune [mesolitica/malaysian-mistral-7b-32k-instructions-v4](https://huggingface.co/mesolitica/malaysian-mistral-7b-32k-instructions-v4). We finetune primarily for a **Natural language inference (NLI)** and **reasoning** task. In our case, NLI is the task of determining whether a "hypothesis" is true (*entailment*) or false (*contradiction*) given a question-statement pair, as well as providing step-by-step reasoning for their choice. We select this model primarily due to it's:
- **Context length of 32,000.** This refers to the maximum number of tokens (including words, punctuation, and spaces) that the model can consider at once during input processing. A high context length is important since we'll be doing NLI for text pairs of various length.
- **No. of monthly downloads on HuggingFace.** The consistently high num. of downloads on a monthly basis is a good proxy for model quality.
- **Good ability to comprehend Malay and English texts**, and reply in Malay due to being Instruction-finetuned beforehand.

Overall, solely training on the [Boolq-Malay-With-Chain-of-Thought](https://huggingface.co/datasets/wanadzhar913/boolq-malay-with-chain-of-thought) dataset. It is comprised of both Malay and English versions of the original [Boolq](https://huggingface.co/datasets/google/boolq) dataset, as well a OpenAI 4o-mini generated Chain-of-Thought reasoning column.

we use the following training parameters and obtain the following training results:

- **No. of Epochs:** 1
- **LoRA Rank:** 64
- **Learning Rate:** 2e-4
- **Learning Rate Scheduler Type:** constant
- **Maximum Sequence Lenght:** 32768
- **Load model in 4-bit Precision:** True
- **bf16 (Brain Floating Point 16-bit):** False
- **Train Loss:** 0.3057

The **model** can be found here: https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3

The **Weights and Biases training run** can be found here: https://wandb.ai/adzhar-faiq/finetune-malaysian-mistral-llmasajudge-v3

For NLI benchmarks specifically, the **benchmarking notebook** can be found here: https://github.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-benchmarking-exercises/03_benchmark_malaysian_mistral_llmasajudge_v3.ipynb

In the future, we can do the following to garner better results:
- Set `bf16` parameter to `True` to optimize compute efficiency without significantly sacrificing model accuracy.
- Increase the `gradient_accumulation_steps` to deal with the small GPU constraints or increase the `batch_size` if we've access to a larger GPU. The reasoning is mainly to avoid [Out of Memory Errors (OOM)](https://discuss.huggingface.co/t/batch-size-vs-gradient-accumulation/5260).
- Given more compute resources, we can also increase our `patience` variable and train for more than 10 epochs.
- **Limiting the reasoning portion (in the training dataset) to only be in Malay**. Since the model has been instruction finetuned to mainly reply in Malay, it'd be confusing to have it reason back in English.


In [1]:
!pip install datasets bitsandbytes peft trl wandb -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.4/293.4 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [3]:
# import wandb
# import huggingface_hub
# import pandas as pd
# import datasets
# import torch
# import bitsandbytes
# import peft
# import trl
# import transformers

# print(f"WandB version: {wandb.__version__}")
# print(f"Huggingface Hub version: {huggingface_hub.__version__}")
# print(f"Pandas version: {pd.__version__}")
# print(f"Datasets version: {datasets.__version__}")
# print(f"Torch version: {torch.__version__}")
# print(f"Bitsandbytes version: {bitsandbytes.__version__}")
# print(f"Peft version: {peft.__version__}")
# print(f"TRL version: {trl.__version__}")
# print(f"Transformers version: {transformers.__version__}")

WandB version: 0.19.1
Huggingface Hub version: 0.27.0
Pandas version: 2.2.2
Datasets version: 3.2.0
Torch version: 2.5.1+cu121
Bitsandbytes version: 0.45.0
Peft version: 0.14.0
TRL version: 0.13.0
Transformers version: 4.47.1


In [None]:
import os
import json
import argparse
from random import randint

import wandb
from huggingface_hub import notebook_login

import pandas as pd
from datasets import Dataset, load_dataset

import torch
import bitsandbytes as bnb
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers.trainer_utils import get_last_checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer, \
                         BitsAndBytesConfig, TrainingArguments, \
                         logging, pipeline

In [None]:
os.environ["WANDB_PROJECT"]="finetune-malaysian-mistral-llmasajudge-v3"

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
!nvidia-smi

Tue Oct 22 17:12:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              48W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### 1.0 Load dataset and prepare the prompt input according to the Mistral format

The [mesolitica/malaysian-mistral-7b-32k-instructions-v4](https://huggingface.co/mesolitica/malaysian-mistral-7b-32k-instructions-v4) is a conversational chat model meaning we can chat with it using the following prompt:

> \<s> [INST] User Instruction 1 [/INST] Model answer 1\</s> [INST] User instruction 2 [/INST]

For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.

In [None]:
dataset_train = load_dataset("wanadzhar913/boolq-malay-with-chain-of-thought", split='train')

print(f"Train dataset size: {len(dataset_train)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


boolq-with-reasoning-train.jsonl:   0%|          | 0.00/40.9M [00:00<?, ?B/s]

boolq-with-reasoning-val.jsonl:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18851 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6540 [00:00<?, ? examples/s]

Train dataset size: 18851


In [None]:
dataset_train[0].keys()

dict_keys(['passage', 'question', 'answer', 'language', 'split', 'reasoning'])

In [None]:
dataset_train[0]

{'passage': 'The Bucks have won one league title (1971), two conference titles (1971 and 1974), and 13 division titles (1971--1974, 1976, 1980--1986, 2001). They have featured such notable players as Kareem Abdul-Jabbar, Sidney Moncrief, Oscar Robertson, Bob Dandridge, Bob Lanier, Glenn Robinson, Ray Allen, Sam Cassell, Junior Bridgeman, Michael Redd, Terry Cummings, Vin Baker, Jon McGlocklin, Marques Johnson, and Brian Winters.',
 'question': 'have the milwaukee bucks ever won a championship',
 'answer': 1,
 'language': 'English',
 'split': 'train',
 'reasoning': 'To determine whether the statement "have the Milwaukee Bucks ever won a championship" is factually consistent with the provided passage, we can follow these steps:\n\n1. **Identify Key Information in the Passage**: The passage states that the Milwaukee Bucks have won "one league title (1971)." In the context of professional sports, a "league title" typically refers to a championship title. Therefore, this indicates that the 

In [None]:
# Define the create_prompt function
def create_prompt(sample):
    bos_token = "<s>"
    eos_token = "</s>"

    passage = sample['passage']
    summary = sample['question']
    answer = sample['answer']
    reasoning = sample['reasoning']

    text_row = f"""[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca
              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
              tidak disokong atau bercanggah dengan maklumat dalam dokumen).

              Anda perlu memilih antara dua pilihan berikut:
              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

              Dokumen: {passage}
              Soalan/Kenyataan: {summary}

              Sediakan penjelasan langkah demi langkah untuk pilihan konsistenan berdasarkan Dokumen dan Soalan/Kenyataan yang diberikan.
              Kembalikan jawapan (penjelasan dan konsisten/tak konsisten) dalam format JSON. Sebagai contoh: {{'reasoning': '...', 'consistency': 1}} atau {{'reasoning': '...', 'consistency': 0}}[/INST]"""

    answer_row = f"""{{"reasoning": {reasoning}, "consistency": {answer}}}"""

    sample["prompt"] = bos_token + text_row
    sample["completion"] = answer_row + eos_token

    return sample

In [None]:
dataset_instruct_format_train = dataset_train.shuffle(seed=42).map(create_prompt, remove_columns=['passage','question','answer','language', 'split'])

# print random sample
dataset_instruct_format_train[randint(0, len(dataset_instruct_format_train))]

Map:   0%|          | 0/18851 [00:00<?, ? examples/s]

{'reasoning': 'Langkah-langkah untuk menentukan konsistensi fakta antara pernyataan dan petikan adalah seperti berikut:\n\n1. **Menganalisis Petikan**: Petikan menyatakan bahawa Asia Tenggara adalah subregion Asia yang terletak di selatan China, timur India, barat New Guinea, dan utara Australia. Ia juga menyebut sempadan Asia Tenggara dengan kawasan lain, termasuk Asia Selatan yang merangkumi India.\n\n2. **Memahami Sempadan Geografi**: Dalam petikan, dinyatakan bahawa Asia Tenggara bersempadan dengan Asia Selatan. Ini menunjukkan bahawa Asia Tenggara dan Asia Selatan adalah dua kawasan yang berbeza.\n\n3. **Menilai Pernyataan**: Pernyataan yang diberikan adalah "adakah india sebahagian daripada asia tenggara". Untuk menjawab soalan ini, kita perlu melihat sama ada India termasuk dalam kawasan Asia Tenggara.\n\n4. **Fakta Mengenai India**: India terletak di Asia Selatan, bukan di Asia Tenggara. Oleh itu, India tidak boleh dianggap sebagai sebahagian daripada Asia Tenggara.\n\n5. **Kes

### 2.0 Prepare the configuration for training the LLM
Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA https://huggingface.co/blog/4bit-transformers-bitsandbytes

In [None]:
model_id = "mesolitica/malaysian-mistral-7b-32k-instructions-v4"
new_model = "malaysian-mistral-qlora-7b-32k-instructions-llmasajudge" #set the name of the new model

In [None]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "bfloat16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True


################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 8

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 300

# Log every X updates steps
logging_steps = 20

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = 32768

# Maximum batch size
# dataset_batch_size = 96

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}
#device_map = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Load the base model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map=device_map
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.79M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [None]:
print(base_model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNo

In [None]:
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

In [None]:
# get lora target modules
modules = find_all_linear_names(base_model)

In [None]:
print(modules)

['v_proj', 'up_proj', 'gate_proj', 'k_proj', 'o_proj', 'down_proj', 'q_proj']


Inference using base model only before fine tuning.

In [None]:
eval_prompt = create_prompt(dataset_train[0])["prompt"]
print(eval_prompt)

<s>[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca
              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
              tidak disokong atau bercanggah dengan maklumat dalam dokumen).

              Anda perlu memilih antara dua pilihan berikut:
              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

              Dokumen: The Bucks have won one league title (1971), two conference titles (1971 and 1974), and 13 division titles (1971--1974, 1976, 1980--1986, 2001). They have featured such notable players as Kareem Abdul-Jabbar, Sidney Moncrief, Oscar Robertson, Bob Dandridge, B

In [None]:
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

base_model.eval()
with torch.no_grad():
    print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=1024, pad_token_id=2)[0], skip_special_tokens=True))

[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca
              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
              tidak disokong atau bercanggah dengan maklumat dalam dokumen).

              Anda perlu memilih antara dua pilihan berikut:
              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

              Dokumen: The Bucks have won one league title (1971), two conference titles (1971 and 1974), and 13 division titles (1971--1974, 1976, 1980--1986, 2001). They have featured such notable players as Kareem Abdul-Jabbar, Sidney Moncrief, Oscar Robertson, Bob Dandridge, Bob 

### 3.0 Train the LLM

Train on completions only https://huggingface.co/docs/trl/en/sft_trainer

In [None]:
# Set LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    report_to="wandb",  # enable logging to W&B
    # run_name=f'{new_model} + v5',  # name of the W&B run (optional)
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    gradient_checkpointing=gradient_checkpointing,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    # max_steps=1000, # the total number of training steps to perform
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

In [None]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['prompt'])):
        text = f"{example['prompt'][i]}\n\n ### Jawapan: {example['completion'][i]}"
        output_texts.append(text)
    return output_texts

response_template = "### Jawapan:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

In [None]:
# Initialize the SFTTrainer for fine-tuning
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset_instruct_format_train,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/18851 [00:00<?, ? examples/s]

In [None]:
# Start the training process
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33madzhar-faiq[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
20,0.3238
40,0.4533
60,0.3349
80,0.3109
100,0.4196
120,0.221
140,0.3834
160,0.3319
180,0.2517
200,0.3937


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
20,0.3238
40,0.4533
60,0.3349
80,0.3109
100,0.4196
120,0.221
140,0.3834
160,0.3319
180,0.2517
200,0.3937


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=2357, training_loss=0.27187459419276383, metrics={'train_runtime': 14072.4905, 'train_samples_per_second': 1.34, 'train_steps_per_second': 0.167, 'total_flos': 9.064403773172122e+17, 'train_loss': 0.27187459419276383, 'epoch': 1.0})

In [None]:
wandb.finish()

0,1
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▂▂▄▆▂▄▇▃▆▁▁▅▂▆▂▂▅▆▂▅▂▆▅▆▃▆▃▅▂▂▃▇▇▂▆▂▆█▂█
train/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/loss,█▂▇▇▃▂▄▆▆▃▃▂▄▆▅▃▆▂▆▅▆▆▁▅▃▂▄▃▃▅▂▄▃▃▄▅▂▅▁▁

0,1
total_flos,9.064403773172122e+17
train/epoch,1.0
train/global_step,2357.0
train/grad_norm,0.22363
train/learning_rate,0.0002
train/loss,0.3057
train_loss,0.27187
train_runtime,14072.4905
train_samples_per_second,1.34
train_steps_per_second,0.167


In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)

In [None]:
checkpoint = get_last_checkpoint('./results')
checkpoint

'./results/checkpoint-2357'

In [None]:
model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
model.push_to_hub("loraadapter-malaysian-mistral-llmasajudge-v3", safe_serialization = True)
tokenizer.push_to_hub("loraadapter-malaysian-mistral-llmasajudge-v3", safe_serialization = True)

### 4.0 Merge the trained qlora into the base model

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map=device_map
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
base_model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
     

In [None]:
merged_model= PeftModel.from_pretrained(base_model, new_model)

In [None]:
merged_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): l

In [None]:
merged_model= merged_model.merge_and_unload()

In [None]:
merged_model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
     

### 5.0 Upload model to HuggingFace

In [None]:
merged_model.push_to_hub("malaysian-mistral-llmasajudge-v3", safe_serialization = True)

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3/commit/19b1ee91d6ee18bf6b6b25a0780acd7814b2595a', commit_message='Upload MistralForCausalLM', commit_description='', oid='19b1ee91d6ee18bf6b6b25a0780acd7814b2595a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub("malaysian-mistral-llmasajudge-v3", safe_serialization = True)

In [None]:
# test the model
eval_prompt = create_prompt(dataset_train[222])["prompt"]

In [None]:
dataset_train[222]['answer']

1

In [None]:
eval_prompt

"<s>[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca\n              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang\n              tidak disokong atau bercanggah dengan maklumat dalam dokumen).\n\n              Anda perlu memilih antara dua pilihan berikut:\n              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.\n              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.\n\n              Dokumen: Sperm count, or sperm concentration to avoid confusion with total sperm count, measures the concentration of sperm in a man's ejaculate, distinguished from total sperm count, which is the sperm count multiplied with volume. Over 15 million sperm per

In [None]:
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

merged_model.eval()
with torch.no_grad():
    output = tokenizer.decode(merged_model.generate(**model_input, max_new_tokens=512, pad_token_id=2)[0], skip_special_tokens=True)

In [None]:
output

'[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca\n              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang\n              tidak disokong atau bercanggah dengan maklumat dalam dokumen).\n\n              Anda perlu memilih antara dua pilihan berikut:\n              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.\n              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.\n\n              Dokumen: Sperm count, or sperm concentration to avoid confusion with total sperm count, measures the concentration of sperm in a man\'s ejaculate, distinguished from total sperm count, which is the sperm count multiplied with volume. Over 15 million sperm per m

In [None]:
# Let's tweak the prompt
def create_prompt_v2(sample):
    bos_token = "<s>"
    eos_token = "</s>"

    passage = sample['passage']
    summary = sample['question']
    answer = sample['answer']
    reasoning = sample['reasoning']

    text_row = f"""[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca
              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
              tidak disokong atau bercanggah dengan maklumat dalam dokumen).

              Anda perlu memilih antara dua pilihan berikut:
              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

              Dokumen: {passage}
              Soalan/Kenyataan: {summary}

              Sediakan penjelasan langkah demi langkah untuk pilihan konsistenan berdasarkan Dokumen dan Soalan/Kenyataan yang diberikan.

              Kembalikan jawapan (penjelasan/'reasoning' dan kekonsistenan/'consistency') dalam format JSON. Letakkan penjelasan anda
              langkah demi langkah di dalam 'reasoning' JSON tersebut. Jawab dalam Bahasa Melayu.
              Sebagai contoh: {{'reasoning': 'Untuk menentukan sama ada pernyataan ...', 'consistency': 1}} atau {{'reasoning': 'Untuk menentukan sama ada pernyataan ...', 'consistency': 0}}[/INST]"""

    answer_row = f"""{{"reasoning": {reasoning}, "consistency": {answer}}}"""

    sample["prompt"] = bos_token + text_row
    sample["completion"] = answer_row + eos_token

    return sample

In [None]:
# test the model
eval_prompt_v2 = create_prompt_v2(dataset_train[8000])["prompt"]

In [None]:
dataset_train[8000]['answer']

1

In [None]:
model_input = tokenizer(eval_prompt_v2, return_tensors="pt").to("cuda")

merged_model.eval()
with torch.no_grad():
    output_v2 = tokenizer.decode(merged_model.generate(**model_input, max_new_tokens=512, pad_token_id=2)[0], skip_special_tokens=True)

In [None]:
# why you no give me valid JSON output!!!!! T.T
output_v2 #8000

'[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca\n              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang\n              tidak disokong atau bercanggah dengan maklumat dalam dokumen).\n\n              Anda perlu memilih antara dua pilihan berikut:\n              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.\n              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.\n\n              Dokumen: An application program (app or application for short) is a computer program designed to perform a group of coordinated functions, tasks, or activities for the benefit of the user. Examples of an application include a word processor, a spreadsheet, an a

In [None]:
# why you no give me valid JSON output!!!!! T.T
output_v2 #222

'[INST] Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca\n              dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang\n              tidak disokong atau bercanggah dengan maklumat dalam dokumen).\n\n              Anda perlu memilih antara dua pilihan berikut:\n              - Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.\n              - Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.\n\n              Dokumen: Sperm count, or sperm concentration to avoid confusion with total sperm count, measures the concentration of sperm in a man\'s ejaculate, distinguished from total sperm count, which is the sperm count multiplied with volume. Over 15 million sperm per m