# Fine-tuning Gemma 2 with Hugging Face for Hungarian Literature

The [Google - Unlock Global Communication with Gemma][1] competiton encourages people to fine-tune [Gemma 2][2] for a specific language or cultural context. This notebook focuses on optimizing model for translating Hungarian literature. The dataset used here is [Helsinki-NLP/opus_books][3], a collection of copyright free books aligned by Andras Farkas in multiple languages. The dataset includes many different language pairings, not just English and Hungarian. 

The readers are encouraged to use this notebook as inspiration for their own use, change the sentence pairing in the dataset for their own native language or explore other similar datasets listed below by making minor minor changes to the code.

### Official resources for the Gemma 2 model:

* [google/gemma-2-2b-it Model Card][4]
* [Google - Gemma 2 is now available to researchers and developers][5]
* [Dadashi et al. - Towards Global Understanding – Advancing Multilingual AI with Gemma 2 and a 150K dollar Challenge][6]
* [Google -  Tasks in spoken languages with Gemma][7]

### Cookbooks for fine-tuning LLMs using the Huggng Face Trainer API and LoRA:

* [Mohammadreza Esmaeiliyan - Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format][8]
* [Maria Khalusova - Fine-tuning a Code LLM on Custom Code on a single GPU][9]
* [Abid Ali Awan - Fine-Tuning Gemma 2 and Using it Locally][10]

### Other datasets made for translation on the Hugging Face Hub

* [Helsinki-NLP/open_subtitles][11]: collection of translated movie subtitles from http://www.opensubtitles.org/
* [aiana94/polynews-parallel][12]: A multilingual paralllel dataset containing news titles for 833 language pairs.


[1]: https://www.kaggle.com/competitions/gemma-language-tuning/overview
[2]: https://www.kaggle.com/models/google/gemma-2/pyTorch/gemma-2-9b-pt
[3]: https://huggingface.co/datasets/Helsinki-NLP/opus_books
[4]: https://huggingface.co/google/gemma-2-2b-it
[5]: https://blog.google/technology/developers/google-gemma-2/
[6]: https://developers.googleblog.com/en/advancing-multilingual-ai-with-gemma-2-and-a-150k-challenge/
[7]: https://ai.google.dev/gemma/docs/spoken-language/task-specific-tuning
[8]: https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format
[9]: https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu
[10]: https://www.datacamp.com/tutorial/fine-tuning-gemma-2
[11]: https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
[12]: https://huggingface.co/datasets/aiana94/polynews-parallel

In [1]:
!pip install peft
!pip install bitsandbytes
!pip install trl

Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.13.2
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting trl
  Downloading trl-0.12.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers>=4.46.0 (from trl)
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1

In [2]:
import torch
import wandb
import bitsandbytes as bnb
from typing import List
from dataclasses import dataclass
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
from datasets import load_dataset, Dataset
from datasets.formatting.formatting import LazyRow
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, setup_chat_format

### Using Accelerators

Please use GPU P100 due to VRAM requirements.

In [3]:
assert torch.cuda.is_available(),\
    "Don't forget to enable your GPU on Kaggle!"

assert torch.cuda.get_device_name(0) == 'Tesla P100-PCIE-16GB',\
    "This notebook was designed to work with a P100 GPU"

### Set your up your Hugging Face and Weights and Biases API keys as Kaggle secrets and load them in your notebook.

* Login to Hugging Face using your token: https://huggingface.co/docs/hub/en/security-tokens
* Login to WAndB using you API key: https://wandb.ai/authorize
* Save these tokens into your Kaggle secrets: https://www.kaggle.com/discussions/product-feedback/114053


In [4]:
# Load in secrets
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf-gemma-2-kaggle-key")
wandb_key = user_secrets.get_secret("wandb_key")

# Login to Hugging Face and WAndB
login(token=hf_token)
wandb.login(key=wandb_key)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

### Load in the Gemma 2 (2B-Instruct) model and tokenizer

Load in the model using [4-bit quantization][1] to reduce the memory use of the model without sacrificing too much of its performance.

[1]: https://www.datacamp.com/tutorial/quantization-for-large-language-models

In [5]:
model_id = "google/gemma-2-2b-it"

# Use 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # 4-bit NormalFloat
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load in the tokenizer and the LM
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    attn_implementation="eager",  # If the cuda capability of the GPU is >8, use "flash_attention_2"
    trust_remote_code=True,
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

### An example use of model inference

In [6]:
def chat_with_gemma(chat_message: str, max_new_tokens: int = 256) -> str:
    """
    Chat with the instruct Gemma 2 model.

    Args:
    - chat_message (str): Your prompt message for the model.
    max_new_tokens (int): The maximum number of tokens the model can generate. Defaults to 256.

    Returns:
    - str: The model response.
    """
    # Define chat template for the instruct model and tokenize it
    messages = [
        {"role": "user", "content": chat_message},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        return_dict=True
    ).to("cuda")
    # Generate model response
    outputs = model.generate(**input_ids, max_new_tokens=max_new_tokens)
    # Decode the response with the tokenizer
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [7]:
assistant_response = chat_with_gemma(
    chat_message="Translate the following sentence into Hungarian: Oh! Single, my dear, to be sure!"
)

In [8]:
print(assistant_response)

user
Translate the following sentence into Hungarian: Oh! Single, my dear, to be sure!
*Oh! Single, my dear, to be sure!*

Here's the translation:

**"Oh! Személyes, kedvesem, biztosan!"**

**Explanation:**

* **Oh!** - This is a common exclamation in Hungarian, expressing surprise or excitement.
* **Személyes** - This means "single" or "unmarried" in Hungarian.
* **Kedvesem** - This means "my dear" in Hungarian.
* **Biztosan** - This means "surely" or "definitely" in Hungarian.


Let me know if you have any other sentences you'd like me to translate! 



 ### Load in and preprocess the [Helsinki-NLP/opus_books][1] dataset for fine-tuning

 [1]: https://huggingface.co/datasets/Helsinki-NLP/opus_books

In [9]:
# Load in the dataset
opus_books_ds = load_dataset(
    "Helsinki-NLP/opus_books",
    "en-hu",
    split='train',
    
)
print(f'Length of the opus book dataset: {len(opus_books_ds)}')

README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/137151 [00:00<?, ? examples/s]

Length of the opus book dataset: 137151


In [10]:
# Filter out items that only tell the source of the books
filtered_opus_books_ds = opus_books_ds.filter(
    # We dont need the rows that tell the source of the books
    lambda example: not example['translation']['en'].startswith("Source")
)
print(f'Length after filtering Source rows: {len(filtered_opus_books_ds)}')

Filter:   0%|          | 0/137151 [00:00<?, ? examples/s]

Length after filtering Source rows: 137131


In [11]:
# Filter out chapter titles
filtered_opus_books_ds = filtered_opus_books_ds.filter(
    lambda example: not example['translation']['en'].startswith("Chapter")
)
print(f'Length after filtering Chapter titles: {len(filtered_opus_books_ds)}')

Filter:   0%|          | 0/137131 [00:00<?, ? examples/s]

Length after filtering Chapter titles: 136838


In [12]:
def create_prompt(en_text: str, hun_text: str) -> str:
    """
    Create a formatted prompt for a translation task with Gemma 2.

    Args:
    - en_text (str): The English text to be translated.
    - hun_text (str): The Hungarian translation of the English text.

    Returns:
    - str: A formatted prompt string representing the conversation.
    """
    # Construct the prompt with the English text and Hungarian translation
    prompt = (
        "<start_of_turn>user\n"
        "Translate the following sentence into Hungarian:\n"
        f"\"{en_text}\"\n"
        "<end_of_turn>\n"
        "<start_of_turn>model\n"
        f"{hun_text}\n"
        "<end_of_turn>"
    )
    return prompt


def count_number_of_tokens(prompt: str) -> int:
    """
    Count the number of tokens in a given prompt.

    Args:
    - prompt (str): The text prompt to tokenize and analyze.

    Returns:
    - int: The total number of tokens in the prompt.
    """
    # Tokenize the prompt and extract the input IDs
    tokenized = tokenizer(prompt, return_tensors="pt")
    # Return the number of tokens
    return tokenized["input_ids"].shape[1]


In [13]:
# Filter out too long items
filtered_opus_books_ds = filtered_opus_books_ds.filter(
    lambda example: count_number_of_tokens(
        create_prompt(example['translation']['en'], example['translation']['hu'])
    ) < 256,
)

print(f'Length after filtering long sequences: {len(filtered_opus_books_ds)}')
print(f'The number of filtered items are: {len(opus_books_ds) - len(filtered_opus_books_ds)}')

Filter:   0%|          | 0/136838 [00:00<?, ? examples/s]

Length after filtering long sequences: 135391
The number of filtered items are: 1760


In [14]:
# Shuffle the items in the dataset
shuffled_opus_books_ds = filtered_opus_books_ds.shuffle(seed=42)

In [15]:
# Examples in the dataset
for idx in range(10):
    print(shuffled_opus_books_ds[idx]['translation']['en'])
    print(shuffled_opus_books_ds[idx]['translation']['hu'])
    print('-'*50)

'IF IT WERE NOT A PITY to give up what has been set going... after spending so much toil... I would throw it all up, sell out and, like Nicholas Ivanich, go away... to hear La belle Hélène,' said the landowner, a pleasant smile lighting up his wise old face.
Csak ne sajnálnám annyira elhagyni azt, a mit eddig csináltam... tömérdek munkám fekszik benne..., fütyülnék mindenre, eladnám mindenemet és elmennék, mint Ivánovics Nikoláj... a Szép Heléná-t hallgatni, - mondotta a birtokos kellemes mosolylyal, mely okos, öreges arczát szinte földerítette.
--------------------------------------------------
Meanwhile the arena was levelled, and slaves began to dig holes one near the other in rows throughout the whole circuit from side to side, so that the last row was but a few paces distant from Cæsar's podium. From outside came the murmur of people, shouts and plaudits, while within they were preparing in hot haste for new tortures.
Ezek elrejtőztek a padok közötti átjárókban vagy az alsóbb hely

### Add a "text" feature to the dataset that contains the formatted prompts

In [16]:
def format_chat_template(row: LazyRow) -> LazyRow:
    """
    Format a row of data into a chat-style template for translation.

    Args:
    - row (LazyRow): A LazyRow object containing 'translation' with 'en' (English) and 'hu' (Hungarian) keys.

    Returns:
    - LazyRow: The updated LazyRow object with the 'text' key containing the formatted chat-style text.
    """
    # Create the chat-style template
    row_template = [
        {"role": "user", "content": f"Translate the following sentence into Hungarian: {row['translation']['en']}"},
        {"role": "assistant", "content": row['translation']['hu']}
    ]
    # Apply the chat template to add the 'text' key
    row["text"] = tokenizer.apply_chat_template(row_template, tokenize=False)
    return row

In [17]:
preprocessed_ds = shuffled_opus_books_ds.map(
    format_chat_template,
    num_proc= 4,
)

preprocessed_ds

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/135391 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'translation', 'text'],
    num_rows: 135391
})

In [18]:
# train, validation, test split
split_ds = preprocessed_ds.train_test_split(test_size=0.1)
train_ds, eval_ds = split_ds['train'], split_ds['test']
split_ds = train_ds.train_test_split(test_size=0.1)
train_ds, test_ds = split_ds['train'], split_ds['test']

# Set up Transfer learning

In [19]:
@dataclass
class TrainingConfig:
    token_limit: int = 256
    lora_rank: int = 4
    lora_alpha: int = 2 * 4
    lora_dropout: float = 0.1
    lora_bias: str = "none"
    lr_value: float = 1e-4
    weight_decay: float = 0.01
    warmup_ratio: float = 0.03
    lr_scheduler_type: str = "cosine"
    train_epoch: int = 1
    per_device_train_batch_size: int = 4
    gradient_accumulation_steps: int = 1
    optim: str = "paged_adamw_32bit"
    save_steps: int = 0
    logging_steps: int = 25
    max_steps: int = -1
    new_model_name: str = "gemma-2-2b-it-hun-opus-books"


training_config = TrainingConfig()

### Create the Python function that will use the model and extract the names of all the linear modules. 

Source: [Abid Ali Awan - Fine-Tuning Gemma 2 and Using it Locally][1]

[1]: https://www.datacamp.com/tutorial/fine-tuning-gemma-2

In [20]:
def find_all_linear_names(model: AutoModelForCausalLM) -> list[str]:
    """
    Find all module names corresponding to 4-bit linear layers in a given model.

    Args:
    - model (AutoModelForCausalLM): The model to inspect for 4-bit linear layers.

    Returns:
    - list[str]: A list of unique module names containing 4-bit linear layers, 
      excluding the 'lm_head' module if present.
    """
    import bitsandbytes as bnb  # Ensure bitsandbytes is available

    # Define the target class for 4-bit linear layers
    cls = bnb.nn.Linear4bit
    lora_module_names = set()

    # Iterate through model's modules to identify 4-bit linear layers
    for name, module in model.named_modules():
        if isinstance(module, cls):
            # Extract the module name and add to the set
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    # Exclude the 'lm_head' module if it's in the set
    lora_module_names.discard('lm_head')  # Cleaner than conditional removal

    return list(lora_module_names)



target_modules = find_all_linear_names(model)
target_modules

['v_proj', 'q_proj', 'up_proj', 'down_proj', 'o_proj', 'k_proj', 'gate_proj']

In [21]:
# Load LoRA configuration
peft_config = LoraConfig(
    r=training_config.lora_rank,
    lora_alpha=training_config.lora_alpha,
    lora_dropout=training_config.lora_dropout,
    bias=training_config.lora_bias,
    task_type="CAUSAL_LM",
    target_modules=target_modules,
)

In [22]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=training_config.new_model_name,
    num_train_epochs=training_config.train_epoch,
    per_device_train_batch_size=training_config.per_device_train_batch_size,
    gradient_accumulation_steps=training_config.gradient_accumulation_steps,
    optim=training_config.optim,
    save_steps=training_config.save_steps,
    logging_steps=training_config.logging_steps,
    learning_rate=training_config.lr_value,
    weight_decay=training_config.weight_decay,
    fp16=False,
    bf16=False,
    max_steps=training_config.max_steps,
    warmup_ratio=training_config.warmup_ratio,
    gradient_checkpointing=True,
    group_by_length=True,
    lr_scheduler_type=training_config.lr_scheduler_type,
    report_to="wandb",
)

In [23]:
# Enables the gradients for the input embeddings.
# This is useful for fine-tuning adapter weights while keeping the model weights fixed
model.enable_input_require_grads()

In [24]:
# Attach the model to the adapter to create a PEFT model.
model = get_peft_model(model, peft_config)

In [25]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=training_config.token_limit,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

model.config.use_cache = False


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/109665 [00:00<?, ? examples/s]

Map:   0%|          | 0/13540 [00:00<?, ? examples/s]



In [26]:
model.print_trainable_parameters()

trainable params: 5,191,680 || all params: 2,619,533,568 || trainable%: 0.1982


In [27]:
# Display the details of the training with WAndB
%wandb

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33memmermarci[0m ([33mimport_this[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.18.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20241202_205131-vs5fb78f[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mgemma-2-2b-it-hun-opus-books[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/import_this/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/import_this/huggingface/runs/vs5fb78f[0m
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
25,4.6477
50,6.1958
75,4.2906
100,4.7488
125,3.3523
150,3.0442
175,2.8628
200,2.5199
225,2.6446
250,2.2836


TrainOutput(global_step=27417, training_loss=1.1956650689111112, metrics={'train_runtime': 31164.4808, 'train_samples_per_second': 3.519, 'train_steps_per_second': 0.88, 'total_flos': 1.1138521632231629e+17, 'train_loss': 1.1956650689111112, 'epoch': 1.0})

# Model evaluation

In [28]:
wandb.finish()
model.config.use_cache = True

[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:         train/epoch ▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇▇▇██
[34m[1mwandb[0m:   train/global_step ▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇███
[34m[1mwandb[0m:     train/grad_norm ▃▄█▂▂▄▁▃▂▃▃▂▃▄▃▂▃▁▃▂▂▃▃▃▃▃▂             
[34m[1mwandb[0m: train/learning_rate ▄██████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▁▁▁▁▁
[34m[1mwandb[0m:          train/loss █▇▇▆▅▆▅▇▅▅▅▇▅▅▇▇▇▇▇▅▇▇▅▇▅▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m:               total_flos 1.1138521632231629e+17
[34m[1mwandb[0m:              train/epoch 1
[34m[1mwandb[0m:        train/global_step 27417
[34m[1mwandb[0m:          train/grad_norm nan
[34m[1mwandb[0m:      train/learning_rate 0.0
[34m[1mwandb[0m:               train/loss 0
[34m[1mwandb[0m:               train_loss 1.19567
[34m[1mwandb[0m:    

# Saving the model to the Hugging Face Hub

In [29]:
trainer.model.save_pretrained(training_config.new_model_name)
trainer.model.push_to_hub(training_config.new_model_name, use_temp_dir=False)

adapter_model.safetensors:   0%|          | 0.00/20.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/emmermarcell/gemma-2-2b-it-hun-opus-books/commit/ec22344148419fbfd40c468dee7f0e282fbd973f', commit_message='Upload model', commit_description='', oid='ec22344148419fbfd40c468dee7f0e282fbd973f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/emmermarcell/gemma-2-2b-it-hun-opus-books', endpoint='https://huggingface.co', repo_type='model', repo_id='emmermarcell/gemma-2-2b-it-hun-opus-books'), pr_revision=None, pr_num=None)