In [39]:
import torch
from trl import SFTTrainer
from datasets import load_dataset, concatenate_datasets
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from evaluate import load
#
from dotenv import load_dotenv
import pandas as pd
from tqdm import tqdm
import os
import wandb
import random

In [1]:
#Set environment variables and device
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name()}")
else:
    print(f"CUDA not found")


# ADD YOUR KEY IN THIS FILE
load_dotenv("all_keys.txt")

# Register HuggingFace
hf_token = os.getenv("HF_TOKEN")

# initizalie wandb with gradient info as well.
wandb_api_key = os.getenv("WANDB_API_KEY")
wandb.login(key=wandb_api_key)
wandb.init(project="llama_custom_towndata")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Using GPU: NVIDIA A100-SXM4-40GB


[34m[1mwandb[0m: Currently logged in as: [33merkara[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ubuntu/.netrc


# Instruction Fine-Tuning with Unsloth

I set out to create a niche dataset all about my tiny home town (not city), [Honaz](https://en.wikipedia.org/wiki/Honaz) in Turkey. This dataset was built entirely from three `Turkish` articles in the [DergiPark](https://dergipark.org.tr/) system. The idea here is simple: take rich, detailed information about a small town in Turkish (because that's the language of the original sources), and see if a model can actually learn from it. Why bother creating such a niche dataset in Turkish? Well, fine-tuning models on localized, specific datasets lets us test if they can adapt to truly unique and specialized topics. This is good because we can easily assess if the model really learns something. 
Here’s the plan:

1. Data Creation: The dataset creation process is detailed in another notebook, where I carefully compiled and cleaned the articles.
   
2. Fine-Tuning: Using the Unsloth library, I fine-tuned Llama-3.2-1B-Instruct on this dataset. The goal? To see if the model can genuinely learn and retain the knowledge in those articles.

3. Evaluation: I put the fine-tuned model to the test with a curated set of questions about Honaz, comparing it against the original model. This is the best way of testing to see if finetuning does anyhting good. In addition, I provide a script to evaluate the BLUE scores on the testing data as well.

Both the dataset and finetuned model can be accessed at:

Dataset: [https://huggingface.co/datasets/erdi28/honaz_instruction_dataset](https://huggingface.co/datasets/erdi28/honaz_instruction_dataset)

Fine-tuned model: [https://huggingface.co/erdi28/finetune_llama_honaz](https://huggingface.co/erdi28/finetune_llama_honaz)

## Dataset

You can find the details in my hub but here is what I simply did. I found three niche articles in Turkish about Honaz and translated them to Turkish via OpenAI APIs, more specifically(GPT4-4o-mini). Articles are about history and local vegitation of Honaz as well as a population exhance event that deeply effected the town in 1920s. Then created an instruction dataset from it based on carefully curated instructions. This process gave us around 1000 instruction-response pairs

Since 1K is not really sufficient to train an LLM, I added 5K more pairs from a public dataset and merged with mine to finalize the dataset. See below

In [3]:
honaz_dataset = load_dataset("erdi28/honaz_instruction_dataset")
honaz_train = honaz_dataset["train"]
honaz_train[0]

{'instruction': 'Answer a question based on the following content. The vegetation of Honaz Mountain and its surroundings generally consists of dry forests dominated by red pines at lower elevations and black pines at higher elevations. The northern slopes of the Honaz massif are influenced by the Mediterranean climate that penetrates along the Büyük Menderes valley, while the interior areas and southern slopes are under the influence of a continental climate. As a result, the vegetation on the northern and southern slopes of the massif differs. On the more humid northern slopes, a richer and more diverse maquis formation has developed, whereas on the southern slopes, a garigue formation consisting of only the most drought-resistant maquis species is prevalent.',
 'output': 'What types of vegetation are found on the northern and southern slopes of Honaz Mountain?\n\nThe northern slopes of Honaz Mountain feature a richer and more diverse maquis formation due to the influence of the Medit

In [4]:
# Load the Alpaca dataset
N = 5000
alpaca_dataset = load_dataset("mlabonne/FineTome-Alpaca-100k",split=f"train[:{N}]")
alpaca_train = alpaca_dataset.remove_columns(['score', 'source'])
alpaca_train[10]

{'instruction': 'Write Python code to solve the task:\nIn programming, hexadecimal notation is often used.\n\nIn hexadecimal notation, besides the ten digits 0, 1, ..., 9, the six letters `A`, `B`, `C`, `D`, `E` and `F` are used to represent the values 10, 11, 12, 13, 14 and 15, respectively.\n\nIn this problem, you are given two letters X and Y. Each X and Y is `A`, `B`, `C`, `D`, `E` or `F`.\n\nWhen X and Y are seen as hexadecimal numbers, which is larger?\n\nConstraints\n\n* Each X and Y is `A`, `B`, `C`, `D`, `E` or `F`.\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nX Y\n\n\nOutput\n\nIf X is smaller, print `<`; if Y is smaller, print `>`; if they are equal, print `=`.\n\nExamples\n\nInput\n\nA B\n\n\nOutput\n\n<\n\n\nInput\n\nE C\n\n\nOutput\n\n>\n\n\nInput\n\nF F\n\n\nOutput\n\n=',
 'output': "Step 1:  To compare two hexadecimal numbers represented by the letters X and Y, we need to understand the hexadecimal number system and how to compare hexadec

In [5]:
#combine both datassets
dataset = concatenate_datasets([honaz_train, alpaca_train])

## Model Tranining

Unsloth can train 2-5X faster with custom kernels. We will start with it. However since it does not support multi-gpu training, later on, we plan to switch to Axolotl. We will load the model with 4bit quantization to reduce memory usage.

In [6]:
max_seq_length = 2048
ref_model, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/Llama-3.2-1B-Instruct",
                                                     max_seq_length = max_seq_length,
                                                     dtype = None,                          # auto detection
                                                     load_in_4bit = True)                   # check if 

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Let's go ahead and see if our baseline model knows something about our data. It is clear that it has no idea about it, which is what we want at this point.

In [7]:
# Define the Alpaca prompt template ( we dont have "Input" field)
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

In [8]:
def generate_streaming_text(model, tokenizer, prompt, max_new_tokens=256, prompt_template=alpaca_prompt):
    """
    Generates text from a model with streaming output.
    """
    # format the input and set up the stremaer
    message = prompt_template.format(prompt, "")
    inputs = tokenizer([message], return_tensors="pt").to(device)
    text_streamer = TextStreamer(tokenizer)
    
    # Generate text with streaming
    _ = model.generate(
        **inputs, 
        streamer=text_streamer, 
        max_new_tokens=max_new_tokens, 
        use_cache=True
    )

# test
ref_model = FastLanguageModel.for_inference(ref_model)
prompt = "How did the population of Honaz evolve during the late 19th century?"
generate_streaming_text(ref_model, tokenizer, prompt, max_new_tokens=100)


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How did the population of Honaz evolve during the late 19th century?

### Response:
The population of Honaz, a small town in Afghanistan, likely grew significantly during the late 19th century due to an increase in agricultural production and trade. As Afghanistan was under British influence, there was an influx of British settlers and traders in the region. The British also established trade routes and missions in the area, which led to an expansion of local agriculture. This led to a surge in population as the local population grew to accommodate the increased agricultural production and trade. Additionally, the town's strategic


In [9]:
prompt = "What are the climatic influences on Honaz Mountain’s vegetation?"
generate_streaming_text(ref_model, tokenizer, prompt, device, max_new_tokens=100)

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the climatic influences on Honaz Mountain’s vegetation?

### Response:
The climatic influences on Honaz Mountain's vegetation are primarily influenced by temperature and precipitation. The mountain's high elevation, with an average elevation of 4,500 meters, results in cooler temperatures. The cooler temperatures, combined with the cloud cover and humidity, support a diverse range of vegetation, including alpine meadows and forests. The precipitation patterns are also crucial, with the mountain receiving significant snowfall during the winter months. This snowpack supports the growth of certain plant species that are adapted to


- We would like to use LORA for fine-tuning. Unsloth supports several accelerated options, see below. We target all linear layers in LLama model. Note that this is model spesific, depending on the arthitecture, we may need to adjust it. However, we are particularly interested in layers that are part of the attention mechanism (e.g., q_proj for query projection, k_proj for key projection, etc.) and feed-forward layers (up_proj, down_proj)
- `Gradient checkpointing` saves GPU memory by not storing all intermediate activations during the forward pass and instead recomputes them on-the-fly during the backward pass, trading memory for a bit more computation. Unsloth optimizes this further for super long sequences with some strategy to where to apply them.
- `LoftQ (LoRA with Quantization)` applies additional quantization optimizations *specifically to the LoRA* adapter weights during training, even if the main model is already loaded in 4-bit precision. In other words, matrices introduced by LoRA are stored in 4-bit.
- Note that some of these options (i.e use_gradient_checkpointing ) will **override TrainingArguments class** below.

In [10]:
model = FastLanguageModel.get_peft_model(
    ref_model,
    r = 32,            
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0,         # dropout after adapter, "0" is optimized
    bias = "none",            # biases in the model remain frozen (not updated), "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    use_rslora = False,     # rank stabilized LoRA
    loftq_config = None,    # And LoftQ
    random_state = 1234,
)

Unsloth 2024.12.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


In [11]:
# Mapping function to format the dataset
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        text = alpaca_prompt.format(instruction, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Apply the mapping function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

In [12]:
idx = 0
print(dataset[0]['instruction'])
print("=====================================================")
print(dataset[0]['output'])
print("=====================================================")
print(dataset[0]['text'])

Answer a question based on the following content. The vegetation of Honaz Mountain and its surroundings generally consists of dry forests dominated by red pines at lower elevations and black pines at higher elevations. The northern slopes of the Honaz massif are influenced by the Mediterranean climate that penetrates along the Büyük Menderes valley, while the interior areas and southern slopes are under the influence of a continental climate. As a result, the vegetation on the northern and southern slopes of the massif differs. On the more humid northern slopes, a richer and more diverse maquis formation has developed, whereas on the southern slopes, a garigue formation consisting of only the most drought-resistant maquis species is prevalent.
What types of vegetation are found on the northern and southern slopes of Honaz Mountain?

The northern slopes of Honaz Mountain feature a richer and more diverse maquis formation due to the influence of the Mediterranean climate, while the south

In [13]:
# do train-test split
dataset = dataset.train_test_split(test_size=0.10)
print(dataset["train"].num_rows)
print(dataset["test"].num_rows)

5409
601


In [14]:
#wandb.watch(model, log="all", log_freq=2) % this slows down the tranining considerably  

This is the main driver function pretty much all transformers based supervised-fine tuning routines nowadays. The list of arguments are overwhelming but let's touch some of the interesting/less obvious ones:

-  **`dataset_num_proc`**: number of parallel processes(CPU cores) to use for preprocessing (i.e tokenization, formatting the input/output pairs, truncation etc.)
  
-  **`packing`**: combines multiple shorter examples into a single sequence (up to max_seq_length), reducing the amount of padding caused by short sequences. There is a nice [blog post](https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-tune-an-LLM-Part-3-The-HuggingFace-Trainer--Vmlldzo1OTEyNjMy) you can check out. This is doing `ConstantLengthDataset` behind the scene.

-  **`per_device_train_batch_size`**: number of samples processed per GPU during training in each forward pass.

-  **`gradient_accumulation_steps`**: accumulates gradients over multiple steps before performing an optimizer update. Basically updates happen in every

                       effective batch size = per_device_train_batch_size * gradient_accumulation_steps

Other arguments are annoated below.

In [15]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,                 # number of parallel proceses for data preprocessing
    packing = True,                       # Combine short sequences
    args = TrainingArguments(
         # Training hyperparameters
        num_train_epochs=3,               # Train for one epoch
        per_device_train_batch_size=2,    # Batch size per device during training
        per_device_eval_batch_size=2,     # Batch size per device during evaluation
        gradient_accumulation_steps=8,    # Accumulate gradients for larger effective batch size
        gradient_checkpointing=True,      # Save memory by recomputing activations in backprop
        
        # Optimization settings
        learning_rate = 3e-4,
        optim = "adamw_8bit",
        weight_decay = 0.01, 
        lr_scheduler_type = "linear",      
        warmup_steps=10,
        
        # Precision settings
        fp16 = not is_bfloat16_supported(),     # Disable FP16 precision (set True if supported)
        bf16 = is_bfloat16_supported(),         # Disable BF16 precision (use True on A100 GPUs)
        
       # Logging and checkpoints
        save_steps=100,                   # Save checkpoint every 100 steps
        save_total_limit=1,               # Keep only the most recent checkpoint
        logging_steps=20,                 # Log training progress every 10 steps
        eval_strategy="steps",            # Run evaluation at regular intervals, dont wait epochs
        eval_steps=10,                    # Evaluate in every such steps
        output_dir= "output",
        run_name="llama_fine_tune",
        report_to="wandb",                # Report metrics to Weights and Biases
        ),
)

In [16]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,185 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 222
 "-____-"     Number of trainable parameters = 22,544,384


Step,Training Loss,Validation Loss
10,No log,1.295812
20,1.312000,1.199491
30,1.312000,1.137736
40,1.130500,1.100059
50,1.130500,1.069004
60,1.071000,1.041719
70,1.071000,1.017723
80,1.066500,1.004602
90,1.066500,0.98828
100,0.923900,0.967951


TrainOutput(global_step=222, training_loss=0.9518617015701156, metrics={'train_runtime': 639.9751, 'train_samples_per_second': 5.555, 'train_steps_per_second': 0.347, 'total_flos': 4.34344090927104e+16, 'train_loss': 0.9518617015701156, 'epoch': 2.9949409780775715})

## Inference and Testing

As we see from the responses below(compare with the baseline model above), the model learned great deal about Honaz. It is not making things up anymore. Also carefully inspect the answers from both model in the curated QA samples I created below.

In [19]:
model = FastLanguageModel.for_inference(model)
prompt = "How did the population of Honaz evolve during the late 19th century?"
generate_streaming_text(ref_model, tokenizer, prompt, device, max_new_tokens=100)


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How did the population of Honaz evolve during the late 19th century?

### Response:
The population of Honaz, as with many other areas in the Ottoman Empire, was influenced by a variety of factors, including economic opportunities, social and cultural ties, and the impact of external events such as the Greek population exchange after World War I. 

During the late 19th century, Honaz, like many other regions in the Ottoman Empire, experienced significant economic growth. The Ottoman Empire was characterized by a centralized economy, with extensive infrastructure and trade networks. This led to increased agricultural production,


In [20]:
prompt = "What are the climatic influences on Honaz Mountain’s vegetation?"
generate_streaming_text(ref_model, tokenizer, prompt, device, max_new_tokens=100)

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the climatic influences on Honaz Mountain’s vegetation?

### Response:
Honaz Mountain, located in the Aegean region of Turkey, experiences a Mediterranean climate characterized by mild winters and hot, dry summers. The climatic influences on its vegetation are primarily driven by temperature, precipitation, and vegetation patterns.

1. Temperature: The average temperature ranges from 12°C in winter to 26°C in summer, with spring and autumn experiencing mild variations. The temperature influences the distribution of plant species, with the majority being Mediterranean shrubs and trees.

2. Precipitation


**Note**: `Since I spent my 18 years there, I know the temperature values above are pretty accurate(:`

In [40]:
def compare_models(csv_path, model, ref_model, tokenizer, max_new_tokens=256, prompt_template=alpaca_prompt):
    """
    compare both models on QA data
    """

    # get QA dataset
    df = pd.read_csv(csv_path)
    model_answers = []
    ref_model_answers = []

    # Process each question 
    for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing Questions"):
        #  Generate the response from the trained model
        question = row['question']
        message = prompt_template.format(question, "")
        inputs = tokenizer([message], return_tensors="pt").to(device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=True
        )
        trained_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        trained_response = trained_response.split("### Response:")[1].strip() if "### Response:" in trained_response else trained_response
        model_answers.append(trained_response)

        # Generate the response from the reference model
        ref_outputs = ref_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=True
        )
        ref_response = tokenizer.decode(ref_outputs[0], skip_special_tokens=True)
        ref_response = ref_response.split("### Response:")[1].strip() if "### Response:" in ref_response else ref_response
        ref_model_answers.append(ref_response)

    # Append the responses
    df['finetuned_model'] = model_answers
    df['original_model'] = ref_model_answers

    return df


In [24]:
#load the base model again to be sure we start fresh--> I cannot confirm exacly but Unsloth may be modifying the base model in some way
ref_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=True
)
ref_model = FastLanguageModel.for_inference(ref_model)


==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [25]:
csv_path = "honaz_questions_answers.csv"
df = compare_models(
    csv_path,
    model=model,  
    ref_model=ref_model,   
    tokenizer=tokenizer,
    max_new_tokens=200
)

Processing Questions: 100%|██████████| 9/9 [01:16<00:00,  8.49s/it]


In [26]:
def inspect_answers(df,idx=1):
    
    # inspect the first question-answer pair
    print(f"Instruction:\n\n{df.iloc[idx]['question']}")
    
    print("=========================================")
    print(f"Reference Answer:\n\n{df.iloc[idx]['answer']}")
    
    print("=========================================")
    print(f"Fine Tuned Model:\n\n{df.iloc[idx]['finetuned_model']}")

    print("=========================================")
    print(f"Original Model:\n\n{df.iloc[idx]['original_model']}")


In [27]:
inspect_answers(df=df,idx=0)

Instruction:

What roles did Honaz historically play in the economic, military, and cultural life of the region? Please provide specific examples.
Reference Answer:

Honaz served as a vital center for trade and military defense due to its strategic location in the Menderes basin. During the Byzantine period, it was a military stronghold overseeing critical routes. Culturally, it was notable for religious significance, with prominent structures like the Church of Archangel Michael.
Fine Tuned Model:

Honaz has been a significant economic, military, and cultural center in the region for centuries. Its strategic location on the Aegean Sea, coupled with its proximity to the Menderes Valley and the Çürüksu River, made it an important hub for trade and commerce.

Historically, Honaz served as a key military location, particularly during the Ottoman era. The city was strategically situated to control the passage of the Çürüksu River to the Aegean Sea, making it an essential location for the O

In [28]:
inspect_answers(df=df,idx=1)

Instruction:

How did the early 20th-century population exchange, involving specific agreements between communities, affect Honaz’s social and economic structure? Be specific about its consequences.
Reference Answer:

The population exchange (1924–1930) resettled 70 households in Honaz, replacing the departing Greek population. These new settlers introduced agricultural techniques, gradually rebuilt homes, and contributed to the homogenization of the town’s community structure.
Fine Tuned Model:

The early 20th-century population exchange, which involved specific agreements between communities in Turkey and Greece, had significant effects on Honaz's social and economic structure. The exchange was part of a broader effort to reduce ethnic and religious tensions in the region and to achieve demographic balance.

One of the primary consequences of the population exchange was the shift in the demographic makeup of Honaz. Historically, Honaz was predominantly Muslim, while the Greek populat

In [32]:
inspect_answers(df=df,idx=7)

Instruction:

How do the current ecological features of Honaz Mountain reflect historical climatic periods? Provide specific examples tied to changes over time.
Reference Answer:

The presence of Black Sea floristic elements in the northern valleys of Honaz Mountain indicates remnants of humid vegetation from the Pleistocene glacial period, reflecting adaptations to historical climatic conditions.
Fine Tuned Model:

Honaz Mountain, located in the Aegean Region of Turkey, has undergone significant ecological changes over the centuries due to shifts in climate. These changes are evident in the vegetation, soil composition, and other ecological features of the area.

One of the earliest changes is the increase in Mediterranean species. These species are adapted to the warmer and drier conditions of the Mediterranean climate. For example, the iconic Mediterranean olive tree (Olivia muscarellifolia) is now widespread in the region. This shift is linked to the warming of the climate, which h

In [37]:
# Save the whole model and push to HuggingFace for further usage
model.save_pretrained_merged("finetune_llama_honaz", tokenizer,save_method="merged_16bit")
model.push_to_hub("erdi28/finetune_llama_honaz", tokenizer,save_method="merged_16bit") 

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 154.55 out of 216.26 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 115.34it/s]

Unsloth: Saving tokenizer...




 Done.
Done.


README.md:   0%|          | 0.00/595 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/90.2M [00:00<?, ?B/s]

Saved model to https://huggingface.co/erdi28/finetune_llama_honaz


Now lets calculate the infamous metric BLEU(Bilingual Evaluation Understudy). It measures how well n-grams in the prediction match the reference. I am not a big fan of this metric or any other metric in LLMs since I prefer simple human evaluation, we can nevertheless see that the fine-tuned model does a way better job on testing data.

In [41]:
def compute_bleu_scores(model, tokenizer, test_data, alpaca_prompt, num_test_points=10, max_new_tokens=128):
    """
    Compute BLEU scores for a specified number of test data points using Unsloth's inference.
    """
    # Enable optimized inference for the model
    FastLanguageModel.for_inference(model)
    
    # Load BLEU metric
    bleu = load("bleu")
    bleu_scores = []

    # Randomly sample num_test_points from the test_data-
    total_data_points = len(test_data)
    sampled_indices = random.sample(range(total_data_points), min(num_test_points, total_data_points))
    sampled_test_data = test_data.select(sampled_indices)

    # Iterate over test data with progress bar
    for i in tqdm(range(len(sampled_test_data)), desc="Evaluating BLEU"):
        # Get the test example
        example = test_data[i]
        instruction = example["instruction"]
        reference_output = example["output"]

        # Prepare the input prompt
        input_text = alpaca_prompt.format(instruction, "")
        inputs = tokenizer([input_text], return_tensors="pt").to("cuda")

        # Generate predictions using Unsloth's fast inference
        with torch.no_grad():
            output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
        generated_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        
        #organize the results
        predictions = [generated_output]
        references = [[reference_output]]

        # Compute BLEU score for the current example
        bleu_score = round(bleu.compute(predictions=predictions, references=references)["bleu"], 3)
        bleu_scores.append(bleu_score)

    # save
    average_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0

    return {
        "bleu_scores": bleu_scores,
        "average_bleu": average_bleu
    }


In [35]:
# Fine tuned model
num_test_points = 10

# Compute BLEU scores
results = compute_bleu_scores(
    model=model,
    tokenizer=tokenizer,
    test_data=dataset["test"],
    alpaca_prompt=alpaca_prompt,
    num_test_points=num_test_points,
    max_new_tokens=128
)

# Print results
#print("Individual BLEU Scores:", results["bleu_scores"])
print("Average BLEU Score:", results["average_bleu"])


Evaluating BLEU: 100%|██████████| 10/10 [00:31<00:00,  3.19s/it]

Average BLEU Score: 0.1694





In [36]:
# Base model
num_test_points = 10

# Compute BLEU scores
results = compute_bleu_scores(
    model=ref_model,
    tokenizer=tokenizer,
    test_data=dataset["test"],
    alpaca_prompt=alpaca_prompt,
    num_test_points=num_test_points,
    max_new_tokens=128
)

# Print results
#print("Individual BLEU Scores:", results["bleu_scores"])
print("Average BLEU Score:", results["average_bleu"])

Evaluating BLEU: 100%|██████████| 10/10 [00:23<00:00,  2.33s/it]

Average BLEU Score: 0.1167



