In this notebook we evaluate the fine tuned Gemma 2b model. Evaluation of machine translation task is a tough task. Translation is an open ended task, a sentence can have multiple correct translations. To ensure that the translations capture the right context, cultural nuances, idiomatic expressions, evaluations need to be evaluated manually by humans. Various scores available miss these aspects of translations. Human evaluations are expensive, and for our task we still rely on commonly used metrics used for translation task. 

We make use of BiLingual Evaluation Understudy score (BLEU score) proposed by Papineni, K., et al. fo evaluting our fine tuned model. This metric helps evaluating the quality of machine-translated text by comparing it to one or more reference translations. The score ranges from 0 to 1, with 1 indicating a perfect match between the machine translation and the reference translation(s).

To see how much our finetuning has improved the model, we take a random sample of 1000 examples from the CFILT dataset that was not used in fine tuning the Gemma model. We compare the BLEU score for baseline model with that of fine tuned model on this sample. Further, we also compare how our fine tuned model performs against Nemotron-4-Mini-Hindi-4B-Instruct model released by Nvidia on this test set. 



In [3]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp datasets
!pip install -q -U keras

import os
import keras_nlp
import keras
import tensorflow as tf

import numpy as np
import pandas as pd
from transformers import AutoTokenizer, TrainingArguments, pipeline, AutoModelForCausalLM
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

import warnings
import time

warnings.filterwarnings('ignore')
nltk.download("punkt")

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Configs 
# Set the backbend before importing Keras
os.environ["KERAS_BACKEND"] = "jax"
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00"

model_id = "gemma2_instruct_2b_en"
token_limit = 256

# Run at half precision.
#keras.config.set_floatx("bfloat16")


Following is a util function we use to calculate BLEU score, it uses the sentence_bleu function available in NLTK package. For calculating BLEU score, we build n-grams of translated and reference text, and calculate the precision of translated text for each n-gram (1-gram, 2-gram, 3-gram and so on). To prevent the machine translation from getting undue credit for repeating words excessively, we limit the count of n-grams in the machine-generated translation to the maximum count found in any reference translation. This is known as clipping. The final score is calculated by taking geometric mean of the precisions at each n-gram. 

In [5]:
def calculate_bleu_score(reference, candidate):
    reference_tokens = word_tokenize(reference)
    candidate_tokens = word_tokenize(candidate)

    return sentence_bleu([reference_tokens], candidate_tokens)

def clean_model_output(txt):
    """Removes the control tokens from model output """
    txt = txt[txt.find('model') + 6 : -14]
    return txt

In [6]:
# Helpers : https://github.com/google-gemini/gemma-cookbook/blob/main/Gemma/Advanced_Prompting_Techniques.ipynb
def convert_message_to_prompt(message: str, model_prefix: str = "") -> str:
    """Converts a message to a prompt for a large language model.

    Args:
        message: The message to convert (str).
        model_prefix: An optional prefix to prepend to the model response (str).

    Returns:
        A string containing the prompt for the large language model (str).
    """

    return (
        f"<start_of_turn>user\n Translate the sentence into hindi and only return the translation. Text :  {message}<end_of_turn>\n"
        f"<start_of_turn>model\n{model_prefix}"
    )

Dataset Details

In [7]:
splits = {
    "train": "data/train-00000-of-00001.parquet",
    "validation": "data/validation-00000-of-00001.parquet",
    "test": "data/test-00000-of-00001.parquet",
}

# load the parquet files from huggingface
train = pd.read_parquet(
    "hf://datasets/cfilt/iitb-english-hindi/" + splits["train"]
)
val = pd.read_parquet(
    "hf://datasets/cfilt/iitb-english-hindi/" + splits["validation"]
)
test = pd.read_parquet(
    "hf://datasets/cfilt/iitb-english-hindi/" + splits["test"]
)

print(train.shape)
print(val.shape)
print(test.shape)

(1659083, 1)
(520, 1)
(2507, 1)


## Model details : Gemma 2

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

Key Architecture Highlights:

1. GeGelu activation
2. RoPe positional embeddings
3. Multi Query Attention
4. RMSNorm

Training Highlights:
1. Pretraining with 2 trillion tokens. 
2. sentence piece token/byte level tokenization for unknown tokens
3. Teacher student training with 27b model
4. dictionary size
5. Instruction fine tuning

In [11]:
device_name = tf.test.gpu_device_name()
with tf.device('/GPU:0'):
    gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(model_id)
    gemma_lm.summary()

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


In [8]:
tick_start = 0

def tick():
    global tick_start
    tick_start = time.time()

def tock():
    print(f"TOTAL TIME ELAPSED: {time.time() - tick_start:.2f}s")

 
def text_gen(prompt, debug = False):
    input = convert_message_to_prompt(prompt)
    if debug:
        print(f'User input: {input}')
    output = gemma_lm.generate(input, max_length=token_limit)
    if debug:
        print("\nGemma output:")
        print(output)
    return output


## Evaluating base model - blue score

In [8]:
test_sample = test.sample(1000, random_state=42)
print(test_sample.shape)
test_sample["english"] = test_sample.translation.apply(lambda x: x.get("en"))
test_sample["hindi"] = test_sample.translation.apply(lambda x: x.get("hi"))

test_sample.head()

(1000, 1)


Unnamed: 0,translation,english,hindi
2121,{'en': 'Due to financial constraints Dhirubhai...,Due to financial constraints Dhirubhai had to ...,आर्थिक तंगी के कारण धीरूभाई को हाईस्कूल के बाद...
56,{'en': 'Or they can choose not to have a devic...,Or they can choose not to have a device at all...,"या फिर से कोई भी डिवाइस न लेना चुन सकते हैं, ब..."
2479,"{'en': 'However, following the recent murder o...","However, following the recent murder of Austra...",वैसे हाल ही में फुकेट में ऑस्ट्रेलियाई ट्रैवेल...
1292,{'en': 'Delta and JetBlue were among the airli...,Delta and JetBlue were among the airliners who...,"डेल्टा और जेटब्लू उन एयरलाइन्स में से हैं, जिन..."
1599,{'en': 'It was simply nothing I could have ima...,It was simply nothing I could have imagined.,सरल तौर कहें तो में इसमें से किसी भी कल्पना भी...


### Model inference 

In [None]:
tick()
test_sample["predictions_woft"] = test_sample.english.apply(
    lambda text: text_gen(f"{text}")
)
tock()


I0000 00:00:1736973306.900328      23 service.cc:145] XLA service 0x589d722899e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1736973306.900392      23 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1736973319.521795      23 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


In [15]:
test_sample.head()

Unnamed: 0,translation,english,hindi,predictions_woft
2121,{'en': 'Due to financial constraints Dhirubhai...,Due to financial constraints Dhirubhai had to ...,आर्थिक तंगी के कारण धीरूभाई को हाईस्कूल के बाद...,<start_of_turn>user\n Translate the sentence i...
56,{'en': 'Or they can choose not to have a devic...,Or they can choose not to have a device at all...,"या फिर से कोई भी डिवाइस न लेना चुन सकते हैं, ब...",<start_of_turn>user\n Translate the sentence i...
2479,"{'en': 'However, following the recent murder o...","However, following the recent murder of Austra...",वैसे हाल ही में फुकेट में ऑस्ट्रेलियाई ट्रैवेल...,<start_of_turn>user\n Translate the sentence i...
1292,{'en': 'Delta and JetBlue were among the airli...,Delta and JetBlue were among the airliners who...,"डेल्टा और जेटब्लू उन एयरलाइन्स में से हैं, जिन...",<start_of_turn>user\n Translate the sentence i...
1599,{'en': 'It was simply nothing I could have ima...,It was simply nothing I could have imagined.,सरल तौर कहें तो में इसमें से किसी भी कल्पना भी...,<start_of_turn>user\n Translate the sentence i...


In [17]:
test_sample["predictions_woft_clean"] = test_sample.predictions_woft.apply(
    lambda text: clean_model_output(text)
)

In [18]:
test_sample["BLEU_Score"] = test_sample[
    ["hindi", "predictions_woft_clean"]
].apply(
    lambda inputs: calculate_bleu_score(inputs[0], inputs[1]), axis=1
)

print(f"Average BLEU Score: {test_sample['BLEU_Score'].mean().round(2)}")


Average BLEU Score: 0.3



A BLEU score of 0.3 (or 30) means that the machine translation has a moderate level of overlap with the reference translation(s). This score indicates that approximately one-third of the n-grams (contiguous sequences of words) in the machine-translated text match those in the reference translation(s).

In [33]:
sample_trn = test_sample[["hindi", "predictions_woft_clean", "english", "BLEU_Score"]].head(1).to_dict()
print(f"Original english sentence :\n{sample_trn['english'][2121]}")
print(f"Hindi ground truth sentence : \n{sample_trn['hindi'][2121]}")
print(f"Hindi translation :\n{sample_trn['predictions_woft_clean'][2121]}")
print(f"BLEU score for this sentence :\n{sample_trn['BLEU_Score'][2121]}")

Original english sentence :
Due to financial constraints Dhirubhai had to drop out after high school.
Hindi ground truth sentence : 
आर्थिक तंगी के कारण धीरूभाई को हाईस्कूल के बाद ही पढ़ाई छोड़ना पड़ गई।
Hindi translation :
धन कमी के कारण दिरुभाई ने हाई स्कूल छोड़ दिया। 
BLEU score for this sentence :
0.25880882365505126


BLEU of 0.25 here means that roughly one fourth of tokens have overlap here. 

## Fine Tuned Model 

### Loading save weights

In [30]:
with tf.device('/GPU:0'):

    gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_instruct_2b_en")
#     # # Use the same LoRA rank that you trained
    gemma_lm.backbone.enable_lora(rank=20)
    
    # Load pre-trained LoRA weights
    gemma_lm.backbone.load_lora_weights(f"/kaggle/input/gemma2b-fine-tuned/llm_mark1_20_epoch20.lora (1).h5")


normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


In [31]:
gemma_lm.compile()#sampler="top_k"

gemma_lm.summary()

In [32]:
def text_gen(prompt, debug = False):
    input = convert_message_to_prompt(prompt)
    if debug:
        print(f'User input: {input}')
    output = gemma_lm.generate(input, max_length=token_limit)
    if debug:
        print("\nGemma output:")
        print(output)
    return output


## Inference Test Set - Fine Tuned Model

In [33]:

test_sample["predictions_ft"] = test_sample.english.apply(
    lambda text: text_gen(f"{text}")
)


I0000 00:00:1736979351.890667      23 service.cc:145] XLA service 0x5bfdbe6f5020 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1736979351.891216      23 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1736979366.092412      23 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


In [34]:
test_sample["predictions_ft_clean"] = test_sample.predictions_ft.apply(
    lambda text: clean_model_output(text)
)

test_sample.head()

Unnamed: 0,translation,english,hindi,predictions_ft,predictions_ft_clean
2121,{'en': 'Due to financial constraints Dhirubhai...,Due to financial constraints Dhirubhai had to ...,आर्थिक तंगी के कारण धीरूभाई को हाईस्कूल के बाद...,<start_of_turn>user\n Translate the sentence i...,दुर्घटना के बैंक को उ उठाना पड़े।
56,{'en': 'Or they can choose not to have a devic...,Or they can choose not to have a device at all...,"या फिर से कोई भी डिवाइस न लेना चुन सकते हैं, ब...",<start_of_turn>user\n Translate the sentence i...,उन्होंने अपने राज्य के द्‌ swagger खाते को चुन...
2479,"{'en': 'However, following the recent murder o...","However, following the recent murder of Austra...",वैसे हाल ही में फुकेट में ऑस्ट्रेलियाई ट्रैवेल...,<start_of_turn>user\n Translate the sentence i...,"हालांकि, 2008 में कोचीन के आगंतुक़ाठुइ ने शाहि..."
1292,{'en': 'Delta and JetBlue were among the airli...,Delta and JetBlue were among the airliners who...,"डेल्टा और जेटब्लू उन एयरलाइन्स में से हैं, जिन...",<start_of_turn>user\n Translate the sentence i...,दूसरी दुनिया पर उड़ानों पर उड़ान की योजनाएं आर...
1599,{'en': 'It was simply nothing I could have ima...,It was simply nothing I could have imagined.,सरल तौर कहें तो में इसमें से किसी भी कल्पना भी...,<start_of_turn>user\n Translate the sentence i...,"यह सिर्फ एक युद्ध नहीं था,Décès मेरें,"


In [35]:
test_sample['pred_len'] = test_sample["predictions_ft_clean"].apply(lambda x: len(x))
test_sample['gt_len'] = test_sample["hindi"].apply(lambda x: len(x))


In [39]:
test_sample["BLEU_Score_ft"] = test_sample[
    ["hindi", "predictions_ft_clean"]
].apply(
    lambda inputs: calculate_bleu_score(inputs[0], inputs[1]), axis=1
)
print(f"Average BLEU Score for LoRA fine tuned model: {test_sample['BLEU_Score_ft'].mean().round(4)}")

Average BLEU Score for LoRA fine tuned model: 0.3468


We see that the BLEU Score for our test improved slightly using fine tuned model. Below we see some sample translations that had a high BLEU score. 

In [45]:
test_sample.sort_values(by = 'BLEU_Score_ft', ascending = False).head(1)
sample_trn = test_sample[["hindi", "predictions_ft_clean", "english", "BLEU_Score"]].head(1).to_dict()
print(f"Original english sentence :\n{sample_trn['english']}")
print(f"Hindi ground truth sentence : \n{sample_trn['hindi']}")
print(f"Hindi translation :\n{sample_trn['predictions_ft_clean']")
print(f"BLEU score for this sentence :\n{sample_trn['BLEU_Score']}")

Unnamed: 0,translation,english,hindi,predictions_ft,predictions_ft_clean,pred_len,gt_len,BLEU_Score_ft
13,"{'en': 'The technology is there to do it.', 'h...",The technology is there to do it.,ऐसा करने के लिए प्रौद्योगिकी है।,<start_of_turn>user\n Translate the sentence i...,यह ऐसा करने के लिए प्रौद्योगिकी है।,35,32,0.809107
2463,{'en': 'People are joining them due to the fea...,People are joining them due to the fear of Tru...,तृणमूल कांग्रेस के आतंक से डर कर लोग उसमें शाम...,<start_of_turn>user\n Translate the sentence i...,ही वे लोग इस्लाम के ख़ौदा के डर को अपने द्वारा...,66,60,0.759836
907,{'en': 'This time we are warned that an indepe...,This time we are warned that an independent Sc...,इस बार हमें चेतावनी दी गई है कि ई.यू. की सदस्य...,<start_of_turn>user\n Translate the sentence i...,इसके लिए भारत सरकार ने इस बात की पुष्टि दी कि ...,183,172,0.74874
772,"{'en': '""Possibly,"" Martin replied.', 'hi': '""...","""Possibly,"" Martin replied.","""संभवतः,"" मार्टिन ने कहा।",<start_of_turn>user\n Translate the sentence i...,"""मैं तुम्हें यह संभव नही""",25,25,0.73111
1744,{'en': 'There are many sources from which to c...,There are many sources from which to collect t...,शेष राशि के लिए कई स्रोत है।,<start_of_turn>user\n Translate the sentence i...,उनकी कई अन्य योग्य स्रोत होते हैं।,34,28,0.73111


We see that our fine tuned model still is not able translate the sentences well. We need to use bigger dataset during finetuning to further improve our model's performance. For this exercise we made use of 2000 samples to train the model for 20 epochs which took around 5 hours in Kaggle's notebook (P100 GPU enabled). In future experiments, we plan to use bigger dataset and other memory efficient techniques like quantization to fine tune Gemma model. 