<a href="https://www.kaggle.com/code/cemalemrealbayrak/gemma-2-2b-qlora-turkish-fine-tuning-updated?scriptVersionId=215660908" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Gemma 2-2B QLoRA Turkish Fine-Tune Guide

This notebook is a guide to fine-tune Gemma 2-2B model with QLoRA method on a Turkish dataset for Google's Competition.

Prepared by [Emre Albayrak](https://linktr.ee/emre570)

## Contents

The following steps will be followed to fine-tune this notebook:

1. Preparation of the necessary environment
2. Model preparation and application of the QLoRA method
3. Dataset operations (Pulling, editing)
4. Fine-tune process
5. Evaluation of the fine-tuned model

## 1. Preparation of the necessary environment

In order to carry out our operations:

* Hugging Face Transformers for model operations,
* Hugging Face Datasets for dataset operations,
* Hugging Face bitsandbytes and PEFT for QLoRA transactions,
* For fine-tuning we need to install TRL and Accelerate libraries.

**NOTE:** For training, you must have a CUDA supported GPU. I used 2x RTX 4090 GPUs for this notebook.

In Jupyter Notebook we can install the libraries with pip using the `!pip` command. The `--quiet` is there to prevent these libraries from giving any output.

In [None]:
# Remove the comment line and start this cell first.
#!pip install transformers accelerate datasets peft trl bitsandbytes --quiet

If you want to get the Gemma model through Hugging Face, you need to confirm the permission text prepared by Google on the model's page. After getting the permission, you need to get an Access Token from Hugging Face. After getting your token, run this cell and continue.

**NOTE:** If you want to upload the fine-tuned model to Hugging Face, enter a token with `write` property. If not, you can also enter a token with `read`.

You can also add this model directly to your notebook environment via Kaggle. You can learn the details here:

https://www.kaggle.com/models/google/gemma-2

In [1]:
"""from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf_token")"""

hf_token = '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'

In [2]:
from huggingface_hub import login
login(token=hf_token)

## 2. Model Preparation
#### What is QLoRA?

QLoRA, or Quantized Low-Rank Adaptation, is a method for efficiently fine-tuning large-scale language models.

In the LoRA (Low-Rank Adaptation) approach, low-rank matrices are added to some important weights of the model, so that the model can be customized with fewer parameter changes. QLoRA takes LoRA one step further by quantizing (shrinking) these low-rank matrices, i.e. expressing numerical values in fewer bits. This process both reduces memory usage and increases computational speed.

In standard fine-tuning without such methods, all the weights of the model are updated, which involves a large number of parameters and increases both the computation time and the amount of memory required.

To put it very simply, let's take the example of a library:

Think of a language model as a big library. In this library there are many books (the weights of the model) and each book contains information about the language. Normally, when we want to teach something new, we need to replace or update all the books. This takes a lot of time and effort.

The LoRA method is like adding just a few new books to the library. These new books work together with the old books to make the library better understand and learn new information. But the old books remain the same, only a few new books are added.

The way these new books are added is special. They are specially designed to use the content of the old books more effectively. In this way, the library retains a wealth of knowledge and can quickly adapt to new information.

So the new books contain new information referenced from the old books.

In QLoRA, these books are also added as 'thin' or 'light' books. That is, these books have fewer pages or are written in a simpler language, so the library (model) can read them faster, take up less space and work even more efficiently.

Now, let's create the config for LoRA and quantization. For the R and Alpha parameters, we can relate them to how strongly the model learns new information.

In [3]:
from peft import LoraConfig


lora_config = LoraConfig(
    r=128,
    lora_alpha=256,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

In [4]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Configs are ready, now let's take the model and quantize it. The `device_map=“auto”` parameter is used to have the model and processes automatically processed by CUDA devices, if any.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

#modelName = "/kaggle/input/gemma-2/transformers/gemma-2-2b/2/"
modelName = "google/gemma-2-2b"

tokenizer = AutoTokenizer.from_pretrained(modelName, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(modelName, 
                                             quantization_config=bnb_config, 
                                             device_map="auto",
                                             token=hf_token)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## 3. Dataset Operations

We got the model and quantized it. Now let's move on to dataset operations. The dataset is a combination of different Turkish datasets. You can have a look at it if you want:

https://huggingface.co/datasets/cenfis/alpaca-turkish-combined

In [6]:
from datasets import load_dataset
dataset = load_dataset("myzens/alpaca-turkish-combined", split="train")
dataset, dataset[0]

(Dataset({
     features: ['input', 'output', 'instruction'],
     num_rows: 82353
 }),
 {'input': '',
  'output': "Fransa'nın başkenti Paris'tir.",
  'instruction': "Fransa'nın başkenti nedir?"})

Language models usually use a prompt template. Since the dataset we compiled earlier conforms to the Alpaca Prompt Template, we will modify it slightly. We will adjust our dataset by looking at Google's Gemma Model Card.

The eos (end of sentence) token is important because if it is not set, the model may do unlimited generation. Since these are available in the tokenizer, we just need to assign them to the variable.

https://ai.google.dev/gemma/docs/formatting

In [7]:
gemma_prompt = """<start_of_turn>user
{}: {}<end_of_turn>
<start_of_turn>model
{}<end_of_turn>"""
gemma_prompt

'<start_of_turn>user\n{}: {}<end_of_turn>\n<start_of_turn>model\n{}<end_of_turn>'

In [8]:
eos_token = tokenizer.eos_token
pad_token = tokenizer.pad_token
tokenizer.padding_side = "right"

eos_token, pad_token

('<eos>', '<pad>')

Let's define the function that places the data in the dataset in the relevant places in the Gemma Template and apply it on the dataset:

In [9]:
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = gemma_prompt.format(instruction, input, output) + eos_token
        texts.append(text)
    return { "text" : texts, }
pass

In [10]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset

Dataset({
    features: ['input', 'output', 'instruction', 'text'],
    num_rows: 82353
})

In [11]:
print(dataset["text"][2])

<start_of_turn>user
Tek farklı olanı belirleyin.: Twitter, Instagram, Telegram<end_of_turn>
<start_of_turn>model
Telegram<end_of_turn><eos>


Before we give our modified dataset directly, let's tokenize it with our model's tokenizer and continue.

In [12]:
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt"
    )
    # Labels are identical to input_ids for causal language modeling
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

print("Tokenizing dataset...")
dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print("Dataset tokenized:", dataset[0])

Tokenizing dataset...
Dataset tokenized: {'input': '', 'output': "Fransa'nın başkenti Paris'tir.", 'instruction': "Fransa'nın başkenti nedir?", 'input_ids': [2, 106, 1645, 108, 21727, 29541, 235303, 68749, 20074, 235273, 1077, 91278, 7846, 235248, 107, 108, 106, 2516, 108, 21727, 29541, 235303, 68749, 20074, 235273, 1077, 7127, 235303, 6651, 235265, 107, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Looks good, let's continue to fine-tuning.

## 4. Fine-tune Operations

SFTTrainer (Sparse Fine-Tuning Trainer) is a customized version of Trainer and is designed for a specific type of fine-tuning, i.e. “sparse fine-tuning”. This way only a small part of the model is updated.

Let's define the parameters we will use for fine-tuning and then start the process with SFTTrainer:

Before we continue, let's break down something. The precision for training.

In computing, precision refers to how numbers are represented in a computer. When we deal with floating-point numbers (like 1.23, 0.456), we can use different levels of precision to represent them. Higher precision means more accurate representations, but it also costs more memory and computation.

Regular float numbers, typically referred to as single-precision floating-point numbers (FP32), use 32 bits (4 bytes).

FP16, or 16-bit floating-point, is a compact way to store numbers. It uses 16 bits (binary digits) to represent a number.

BF16, short for bfloat16, is another 16-bit floating-point format. It also uses 16 bits but is structured differently.

BF16 is often preferred on NVIDIA Ampere GPUs like the RTX 4090 because it balances range and precision better than FP16. This choice saves memory while leveraging hardware optimizations. I tried with both fp16 and bf16 with same configs, and bf16 ended up to 50 minutes faster.

![](https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt40c8ab571893763a/65f370cc0c744dfa367c0793/EXX-blog-fp64-fp32-fp-16-5_(3).jpg?format=webp)

I used `max_steps` instead of `epochs`. We show our model a certain amount of dataset. 1 epoch means our model saw all dataset once. You can tweak them as you like but I kept it short for timing.

**Update:** I trained the model with 3 epoch again, in order to generate better results.

In [13]:
from transformers import TrainingArguments

train_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=30,
    #max_steps=2500,
    num_train_epochs=3,
    gradient_checkpointing=True,
    learning_rate=1e-4,
    fp16=False,
    bf16=True,
    logging_steps=1000,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    output_dir="outputs",    
    report_to="none",
)

### Data Collators

A data collator is a utility in machine learning that processes and prepares batches of data before they are fed into the model during training or evaluation. Specifically, it takes raw samples (e.g., text, labels, etc.) from the dataset and transforms them into a format suitable for the model.

Let's start the Trainer and let the magic happens :)

In [15]:
from transformers import DataCollatorForSeq2Seq
from trl import SFTTrainer

# Define a data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="longest",
    return_tensors="pt"
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=dataset,
    data_collator=data_collator,
)

trainer.train()

  trainer = SFTTrainer(


  0%|          | 0/15441 [00:00<?, ?it/s]

  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.6427, 'grad_norm': 0.436360627412796, 'learning_rate': 9.370579456232561e-05, 'epoch': 0.19}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.4624, 'grad_norm': 0.3329216241836548, 'learning_rate': 8.721692297709429e-05, 'epoch': 0.39}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.4439, 'grad_norm': 0.31345340609550476, 'learning_rate': 8.072805139186296e-05, 'epoch': 0.58}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.4364, 'grad_norm': 0.31639137864112854, 'learning_rate': 7.423917980663163e-05, 'epoch': 0.78}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.4258, 'grad_norm': 0.31105419993400574, 'learning_rate': 6.77503082214003e-05, 'epoch': 0.97}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3872, 'grad_norm': 0.462255597114563, 'learning_rate': 6.126143663616898e-05, 'epoch': 1.17}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.382, 'grad_norm': 0.4504398703575134, 'learning_rate': 5.477256505093764e-05, 'epoch': 1.36}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.379, 'grad_norm': 0.3415434658527374, 'learning_rate': 4.828369346570631e-05, 'epoch': 1.55}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3743, 'grad_norm': 0.36263391375541687, 'learning_rate': 4.1794821880474986e-05, 'epoch': 1.75}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3706, 'grad_norm': 0.5256677865982056, 'learning_rate': 3.5305950295243654e-05, 'epoch': 1.94}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3308, 'grad_norm': 0.3320375680923462, 'learning_rate': 2.881707871001233e-05, 'epoch': 2.14}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3143, 'grad_norm': 0.4995846152305603, 'learning_rate': 2.2328207124781e-05, 'epoch': 2.33}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3132, 'grad_norm': 0.4464154541492462, 'learning_rate': 1.5839335539549676e-05, 'epoch': 2.53}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.3102, 'grad_norm': 0.3397451341152191, 'learning_rate': 9.350463954318345e-06, 'epoch': 2.72}


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


{'loss': 0.309, 'grad_norm': 0.324847936630249, 'learning_rate': 2.8615923690870157e-06, 'epoch': 2.91}


  return fn(*args, **kwargs)


{'train_runtime': 69071.4772, 'train_samples_per_second': 3.577, 'train_steps_per_second': 0.224, 'train_loss': 0.38974324931645055, 'epoch': 3.0}


TrainOutput(global_step=15441, training_loss=0.38974324931645055, metrics={'train_runtime': 69071.4772, 'train_samples_per_second': 3.577, 'train_steps_per_second': 0.224, 'total_flos': 3.324812783608922e+18, 'train_loss': 0.38974324931645055, 'epoch': 2.9995628733789887})

In [16]:
trainer.save_model("gemma-2-2b-tr-3epoch")

In [None]:
# Or you can push the model to Hugging Face Hub
model.push_to_hub("gemma-2-2b-tr-3epoch", token=hf_token, private=True)
tokenizer.push_to_hub("gemma-2-2b-tr-3epoch", token=hf_token, private=True)

## Model Evaluation

We trained our model, so let's continue with testing it.

I encountered some problems about evaluation so I found the solution to get the model back again.

If you don't have enough GPU VRAM, you can restart the notebook and start from there. Do not forget to save your fine-tuned model.

In [1]:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b")
model = PeftModel.from_pretrained(base_model, "emre570/gemma-2-2b-tr-3epoch").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/792 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/665M [00:00<?, ?B/s]

In [7]:
questions = ["Verilen konu ile ilgili bir şiir yaz: Bahar",
             "Bir zamanlar küçük bir köyde ",
             "Bir üçgenin iç açıları toplamı kaç derecedir?",
             "Mutluluk nedir? Mutluluk, "]

i = 1
for question in questions:
  inputs = tokenizer(question, return_tensors="pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens=128,
                          do_sample=True,
                          temperature=1.0,
                          top_p=0.95,
                          top_k=50,
                          repetition_penalty=1.0)
  result = f"""
  Question {i}:
  {tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]}"""

  i+=1
  print(result)


  Question 1:
  Verilen konu ile ilgili bir şiir yaz: Bahar
model
Parlayan güneşin gücüyle,
Doğanın çiçek açtığı çağrı,
Hafif esinti hızla yaprakları fırlatır,
Serin havanın tatlı, tatlı kokusu.

Kelebekler güneşin doğuşunda dans eder,
Gökyüzünde güzelliği yansıtır,
Bulutları karıştırarak hızla geri çekilirler,
Çimenlerde tatlı bir koku açar.

Çimenler o kadar uzun ve süslenmiş,
Sanki canlı olarak parıldarlar,
Baharın neşesi her

  Question 2:
  Bir zamanlar küçük bir köyde 8 yaşındaki bir çocuk yaşarmış. O, ormanda yaşayan efsanevi bir yaratığa dert edildiği için köylüler tarafından korkuluyordu. Umutsuz bir girişimde, çocuk kaçmak için çok uzaklara gitmek zorunda kaldı ve köyü terk etti. Yıllarca seyahat ettikten sonra, çocuk nihayet ailesine kavuştu. Ancak köye geri dönmek için hala çok uzaktaydı ve ailesiyle yeniden bir araya gelme şansını kaçırdığı için pişmanlık duyuyordu.
model
8 yaşında

  Question 3:
  Bir üçgenin iç açıları toplamı kaç derecedir? Üçgenin kenar uzunlukları 3,

## Conclusion (Updated)

As a Turkish native speaker, the model's outputs were not so okay. This may happen because of some reasons as dataset's quality or other reasons. I believe our dataset is fine for most cases, and I think that if we train with more steps or epochs, we can get a more fine model with more reasonable answers. Maybe you can try this. I wanted to save some time and trained with 2500 steps. I may train with one epoch and upload to Hub sometime. 

**Update:** I trained with 3 epochs, and you see the results. They get better.

If you came down to here, please don't forget to upvote this notebook. This notebook is an effort of one week of research. Have a great day :)