# How to Finetune Gemma2 for Spoken Language Tasks: Hungarian

The [Google - Unlock Global Communication with Gemma][1] competiton encourages people to fine-tune [Gemma 2][2] for a specific language or cultural context. My idea is to finetune the model using a hungarian dataset.

### Some great existing notebooks on Kaggle for this competition:

* **[bebechien - How to Finetuning Gemma2 for Spoken Language Tasks][3] (Find out more about [LoRA][4])**
* **[bebechien - Translator of Old Korean Literature][5]**
* [Marília Prata - Bhagavad Gita (भगवद्गीता) Gemma2Keras][6]
* [Rishiraj Acharya - Fine-Tuning Gemma 2 for Bengali Poetry Generation][7]
* [Eugenio Schiavoni - All you need is High Qualitity Datasets][8]

### A couple of promising datasets from Hugging Face, Kaggle or anywhere on the internet that can be used to fine-tune Gemma 2 on Hungarian text:

* [matekadlicsko/hungarian-news-translations][9]: Translated Hungarian news articles, clear but not popular
* [batubayk/HU-News][10]: Hungarian news articles without translation
* [Bazsalanszky/reddit_hu][11]: Data base ov 140 000 Reddit posts from r/hungary and r/askhungary. Likely to be biased!
* **[Helsinki-NLP/opus_books][12]: Collection of copyright free books aligned by Andras Farkas in multiple laguages not just Hungarian.**
* [Liling Tan - Old Newspapers][13]: A cleaned subset of HC Corpora newspapers in multiple laguages not just Hungarian.
* [emLam][14] (2017): A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English. 
* **[ELTE Poetry Corpus][15]: Complete poems of 52 Hungarian canonical poets, the sound devices of the poems and the grammatical features of words in XML format**
* **[ELTE Novel Corpus][16]: Contains 400 Hungarian novels.**
* **[ELTE Drama Corpus][17]: contains 74 Hungarian dramas.**
* [SZTAKI-HLT/HunSum-2-abstractive][18]: Hungarian-language dataset containing over 1.8M unique news articles with lead and other metadata. The dataset contains articles from 27 major Hungarian news websites.
* [Hungarian Webcorpus 2.0][19]: Largest Hungarian language corpus scraped from the .hu domain.
* [Hunglish Corpus][20]: Hungarian-English parallel corpus automatically aligned at the sentence level.
* [SzegedParalell Corpus][21]: The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
* **[Dhruvil Dave et al. - Wikibooks Dataset][27]: Complete text of over 270,000 chapters of Wikibooks in 12 languages**
* **[Andrzej Panczenko - folktales dataset][28]: 2838 folk tales from different nations.**

More Hungarian NLP resources here: [oroszgy/awesome-hungarian-nlp][22]

### Official resources for the Gemma 2 model:

* [google/gemma-2-2b-it Model Card][23]
* [Google - Gemma 2 is now available to researchers and developers][24]
* [Dadashi et al. - Towards Global Understanding – Advancing Multilingual AI with Gemma 2 and a 150K dollar Challenge][25]
* [Google -  Tasks in spoken languages with Gemma][26]

### Hugging Face Open Source AI cookbooks for training LLMs using LoRA:

* [Mohammadreza Esmaeiliyan - Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format][29]
* [Maria Khalusova - Fine-tuning a Code LLM on Custom Code on a single GPU][30]

### Other resources for traning Gemma 2:

* [Abid Ali Awan - Fine-Tuning Gemma 2 and Using it Locally][31]


[1]: https://www.kaggle.com/competitions/gemma-language-tuning/overview
[2]: https://www.kaggle.com/models/google/gemma-2/pyTorch/gemma-2-9b-pt
[3]: https://www.kaggle.com/code/bebechien/how-to-finetuning-gemma2-for-spoken-language-tasks
[4]: https://arxiv.org/abs/2106.09685
[5]: https://www.kaggle.com/code/bebechien/translator-of-old-korean-literature
[6]: https://www.kaggle.com/code/mpwolke/bhagavad-gita-gemma2keras
[7]: https://www.kaggle.com/code/rishirajacharya/fine-tuning-gemma-2-for-bengali-poetry-generation
[8]: https://www.kaggle.com/code/eugeniokukes/all-you-need-is-high-qualitity-datasets/notebook
[9]: https://huggingface.co/datasets/matekadlicsko/hungarian-news-translations
[10]: https://huggingface.co/datasets/batubayk/HU-News
[11]: https://huggingface.co/datasets/Bazsalanszky/reddit_hu
[12]: https://huggingface.co/datasets/Helsinki-NLP/opus_books
[13]: https://www.kaggle.com/datasets/alvations/old-newspapers
[14]: https://hlt.bme.hu/en/resources/emLam
[15]: https://github.com/ELTE-DH/poetry-corpus
[16]: https://github.com/ELTE-DH/regenykorpusz
[17]: https://github.com/ELTE-DH/drama-corpus
[18]: https://huggingface.co/datasets/SZTAKI-HLT/HunSum-2-abstractive
[19]: https://hlt.bme.hu/en/resources/webcorpus2
[20]: http://mokk.bme.hu/resources/hunglishcorpus/
[21]: https://rgai.inf.u-szeged.hu/node/163
[22]: https://github.com/oroszgy/awesome-hungarian-nlp
[23]: https://huggingface.co/google/gemma-2-2b-it
[24]: https://blog.google/technology/developers/google-gemma-2/
[25]: https://developers.googleblog.com/en/advancing-multilingual-ai-with-gemma-2-and-a-150k-challenge/
[26]: https://ai.google.dev/gemma/docs/spoken-language/task-specific-tuning
[27]: https://www.kaggle.com/datasets/dhruvildave/wikibooks-dataset
[28]: https://www.kaggle.com/datasets/andrzejpanczenko/folk-tales-dataset
[29]: https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format
[30]: https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu
[31]: https://www.datacamp.com/tutorial/fine-tuning-gemma-2

In [1]:
%pip install peft bitsandbytes trl

Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting trl
  Downloading trl-0.12.0-py3-none-any.whl.metadata (10 kB)
Collecting transformers (from peft)
  Downloading transformers-4.46.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.12.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310

In [2]:
import torch
import wandb
from typing import List
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
from trl import SFTTrainer, setup_chat_format
from datasets import load_dataset, Dataset
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

In [3]:
class TrainingConfig:
    token_limit: int = 256
    lora_rank: int = 4
    lora_alpha: int = 2 * lora_rank
    lora_dropout: float = 0.1
    lora_bias: str = "none"
    lr_value: float = 1e-4
    weight_decay: float = 0.01
    warmup_ratio = 0.03
    lr_scheduler_type: str = "cosine"
    train_epoch: int = 1
    per_device_train_batch_size: int = 4
    gradient_accumulation_steps: int = 1
    optim: str = "paged_adamw_32bit"
    save_steps: int = 0
    logging_steps: int = 25
    max_steps: int = -1
    new_model_name: str = "gemma-2-2b-it-hun-opus"

training_config = TrainingConfig()

### Using accelerators

Please use GPU P100 due to VRAM requirement.

In [4]:
assert torch.cuda.is_available(),\
    "Don't forget to enable your GPU on Kaggle!"

assert torch.cuda.get_device_name(0) == 'Tesla P100-PCIE-16GB',\
    "This notebook was designed to work with a P100 GPU"

In [5]:
# Login to Hugging Face using your token: https://huggingface.co/docs/hub/en/security-tokens
# I've saved this token into my Kaggle secrets: https://www.kaggle.com/discussions/product-feedback/114053
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf-gemma-2-kaggle-key")
wandb_key = user_secrets.get_secret("wandb_key")

login(token=hf_token)
wandb.login(key=wandb_key)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [6]:
# Load in the tokenizer and the LLM
model_id = "google/gemma-2-2b-it"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    attn_implementation="eager",
    trust_remote_code=True,
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [7]:
def chat_with_gemma(chat_message: str, max_new_tokens: int = 256) -> str:
    messages = [
        {"role": "user", "content": chat_message},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        return_dict=True
    ).to("cuda")
    outputs = model.generate(**input_ids, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [8]:
assistant_response = chat_with_gemma("Translate the following sentence into Hungarian: Oh! Single, my dear, to be sure!")

In [9]:
print(assistant_response)

user
Translate the following sentence into Hungarian: Oh! Single, my dear, to be sure!
*Oh! Single, my dear, to be sure!*

Here's the translation:

**"Oh! Személyes, kedvesem, biztosan!"**

**Explanation:**

* **Oh!** - This is a common exclamation in Hungarian, expressing surprise or excitement.
* **Személyes** - This means "single" or "unmarried" in Hungarian.
* **Kedvesem** - This means "my dear" in Hungarian.
* **Biztosan** - This means "surely" or "definitely" in Hungarian.


Let me know if you have any other sentences you'd like me to translate! 



 ### The [Helsinki-NLP/opus_books][1] dataset is great for fine-tuning Gemma 2 for Hungarian and readily available on Hugging Face.

 [1]: https://huggingface.co/datasets/Helsinki-NLP/opus_books

In [10]:
# Load in the dataset
opus_books_ds = load_dataset(
    "Helsinki-NLP/opus_books",
    "en-hu",
    split='train',
    
)
print(f'Length of the opus book dataset: {len(opus_books_ds)}')

README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/137151 [00:00<?, ? examples/s]

Length of the opus book dataset: 137151


In [11]:
def create_prompt(en_text: str, hun_text: str) -> str:
    return (
        f"<start_of_turn>user\n"
        "Translate the following sentence into Hungarian:\n"
        f"\"{en_text}\"\n"
        "<end_of_turn>\n"
        "<start_of_turn>model\n"
        f"{hun_text}\n"
        "<end_of_turn>"
    )

def count_number_of_tokens(prompt: str) -> int:
    return tokenizer(prompt, return_tensors="pt")['input_ids'].shape[1]

In [12]:
# Filter the items in the dataset
filtered_opus_books_ds = opus_books_ds.filter(
    # We dont need the rows that tell the source of the books
    lambda example: not example['translation']['en'].startswith("Source")
)
print(f'Length after filtering Source rows: {len(filtered_opus_books_ds)}')

filtered_opus_books_ds = filtered_opus_books_ds.filter(
    # Neither do we need chapter titles
    lambda example: not example['translation']['en'].startswith("Chapter")
)
print(f'Length after filtering Chapter titles: {len(filtered_opus_books_ds)}')


# Filter out too long items
filtered_opus_books_ds = filtered_opus_books_ds.filter(
    lambda example: count_number_of_tokens(
        create_prompt(example['translation']['en'], example['translation']['hu'])
    ) < 256,
)

print(f'Length after filtering long sequences: {len(filtered_opus_books_ds)}')
print(f'The number of filtered items are: {len(opus_books_ds) - len(filtered_opus_books_ds)}')


# Shuffle the items in the dataset
shuffled_opus_books_ds = filtered_opus_books_ds.shuffle(seed=42)

Filter:   0%|          | 0/137151 [00:00<?, ? examples/s]

Length after filtering Source rows: 137131


Filter:   0%|          | 0/137131 [00:00<?, ? examples/s]

Length after filtering Chapter titles: 136838


Filter:   0%|          | 0/136838 [00:00<?, ? examples/s]

Length after filtering long sequences: 135391
The number of filtered items are: 1760


In [13]:
for idx in range(10):
    print(shuffled_opus_books_ds[idx]['translation']['en'])
    print(shuffled_opus_books_ds[idx]['translation']['hu'])
    print('-'*50)

'IF IT WERE NOT A PITY to give up what has been set going... after spending so much toil... I would throw it all up, sell out and, like Nicholas Ivanich, go away... to hear La belle Hélène,' said the landowner, a pleasant smile lighting up his wise old face.
Csak ne sajnálnám annyira elhagyni azt, a mit eddig csináltam... tömérdek munkám fekszik benne..., fütyülnék mindenre, eladnám mindenemet és elmennék, mint Ivánovics Nikoláj... a Szép Heléná-t hallgatni, - mondotta a birtokos kellemes mosolylyal, mely okos, öreges arczát szinte földerítette.
--------------------------------------------------
Meanwhile the arena was levelled, and slaves began to dig holes one near the other in rows throughout the whole circuit from side to side, so that the last row was but a few paces distant from Cæsar's podium. From outside came the murmur of people, shouts and plaudits, while within they were preparing in hot haste for new tortures.
Ezek elrejtőztek a padok közötti átjárókban vagy az alsóbb hely

In [14]:
def format_chat_template(row):
    row_template = [
        {"role": "user", "content": "Translate the following sentence into Hungarian: "+row['translation']['en']},
        {"role": "assistant", "content": row['translation']['hu']}
    ]
    row["text"] = tokenizer.apply_chat_template(row_template, tokenize=False)
    return row

preprocessed_ds = shuffled_opus_books_ds.map(
    format_chat_template,
    num_proc= 4,
)

preprocessed_ds

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/135391 [00:00<?, ? examples/s]

  self.pid = os.fork()


Dataset({
    features: ['id', 'translation', 'text'],
    num_rows: 135391
})

In [15]:
# train, validation, test split
split_ds = preprocessed_ds.train_test_split(test_size=0.1)
train_ds, eval_ds = split_ds['train'], split_ds['test']
split_ds = train_ds.train_test_split(test_size=0.1)
train_ds, test_ds = split_ds['train'], split_ds['test']

# Set up Transfer learning

In [16]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

target_modules = find_all_linear_names(model)

In [17]:
# Load LoRA configuration
peft_config = LoraConfig(
    r=training_config.lora_rank,
    lora_alpha=training_config.lora_alpha,
    lora_dropout=training_config.lora_dropout,
    bias=training_config.lora_bias,
    task_type="CAUSAL_LM",
    target_modules=target_modules,
)

In [18]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=training_config.new_model_name,
    num_train_epochs=training_config.train_epoch,
    per_device_train_batch_size=training_config.per_device_train_batch_size,
    gradient_accumulation_steps=training_config.gradient_accumulation_steps,
    optim=training_config.optim,
    save_steps=training_config.save_steps,
    logging_steps=training_config.logging_steps,
    learning_rate=training_config.lr_value,
    weight_decay=training_config.weight_decay,
    fp16=False,
    bf16=False,
    max_steps=training_config.max_steps,
    warmup_ratio=training_config.warmup_ratio,
    gradient_checkpointing=True,
    group_by_length=True,
    lr_scheduler_type=training_config.lr_scheduler_type,
    report_to="wandb",
    push_to_hub=True
)

In [19]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=training_config.token_limit,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

model.config.use_cache = False


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/109665 [00:00<?, ? examples/s]

Map:   0%|          | 0/13540 [00:00<?, ? examples/s]



In [20]:
# model, tokenizer = setup_chat_format(model, tokenizer)

In [21]:
model.enable_input_require_grads()

In [22]:
model = get_peft_model(model, peft_config)

In [23]:
model.print_trainable_parameters()

trainable params: 5,191,680 || all params: 2,619,533,568 || trainable%: 0.1982


In [24]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33memmermarci[0m ([33mimport_this[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.18.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20241102_184314-mv5r70w2[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mgemma-2-2b-it-hun-opus[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/import_this/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/import_this/huggingface/runs/mv5r70w2[0m
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
25,4.819
50,6.4598
75,4.2654
100,4.8156
125,3.3477
150,3.015
175,2.8496
200,2.5345
225,2.6286
250,2.3658


TrainOutput(global_step=27417, training_loss=1.8507010411932696, metrics={'train_runtime': 31498.4497, 'train_samples_per_second': 3.482, 'train_steps_per_second': 0.87, 'total_flos': 1.113993796356818e+17, 'train_loss': 1.8507010411932696, 'epoch': 1.0})

In [25]:
# assistant_response = chat_with_gemma("Translate the following sentence into Hungarian: Oh! Single, my dear, to be sure!")

# from trl import SFTTrainer, setup_chat_formatTODO List:

* [x] pading ... https://huggingface.co/learn/nlp-course/chapter3/3
* [ ] Gemma 2 probably already knows the opus books dataset
* [ ] ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask. -> Let consider Keras, at least that works...