# Evaluation of finetuned model on benchmark datasets
* Evaluation dataset:
    1. IN22 Gen (https://huggingface.co/datasets/ai4bharat/IN22-Gen)
    2. Tatoeba Challenge (https://github.com/Helsinki-NLP/Tatoeba-Challenge)
* Finetuned model:finetuned-mbart50-en-tel
* Evaluation metrics: BLEU score

## Setup

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 dataset sacrebleu

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.53 which is incompatible.
kaggle-environments 1.14.15 requires transformers>=4.33.1, but you have transformers 4.31.0 which is incompatible.[0m[31m
[0m

In [2]:
import os
import torch
import pandas as pd
import sacrebleu
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline
)
from peft import PeftModel
os.environ["WANDB_DISABLED"] = "true"

2024-08-01 12:54:07.949325: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-01 12:54:07.949452: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-01 12:54:08.065156: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# LLAMA-2 model
model_name = "meta-llama/Llama-2-7b-hf"
# Fine-tuned model name
new_model = "/kaggle/input/llama2-finetuned/results/finetuned-llama2-7b-en-hi"

In [4]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
                                                  model_name,
                                                  low_cpu_mem_usage=True,
                                                  return_dict=True,
                                                  torch_dtype=torch.float16,
                                                  device_map="auto",
                                                  use_auth_token='your hf auth token'
                                                  )
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_auth_token='your hf auth token')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"



config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# IN22 Gen

In [5]:
# Load the dataset
df = load_dataset('ai4bharat/IN22-Gen', "eng_Latn-hin_Deva", trust_remote_code=True, split='gen')

Downloading builder script:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.60k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

Generating gen split: 0 examples [00:00, ? examples/s]

In [6]:
english_sentences = df['sentence_eng_Latn']
hindi_sentences = df['sentence_hin_Deva']

In [7]:
import re
def translate_to_hindi(query, max_length= 128):
    non_english_chars_pattern = re.compile(r'[^a-zA-Z]+')
    system_prompt = "Translate English to Hindi"
    pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=max_length)
    result = pipe(f"[INST] <> {system_prompt} <>{query}[/INST]")
    result = result[0]['generated_text'].split('[/INST]')[1].split('  ')[0]
    return result

In [8]:
translate_to_hindi('Hello, how are you?')

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


' नमस्ते, आप कैसे हैं? नहीं जानता हूं कि आप कैसे हैं?'

In [9]:
translations = []
references = []
for i in range(0, len(english_sentences)):
    translations.append(translate_to_hindi(english_sentences[i]))
    references.append([hindi_sentences[i]])
bleu = sacrebleu.corpus_bleu(translations, references)
print(f"BLEU score on IN22 Gen: {bleu.score}") 

Input length of input_ids is 137, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 139, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 220, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 153, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 129, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 165, but `max_length` is set to 128. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 175, but `max_length` is set to 128. This can lead to

BLEU score on IN22 Gen: 25.893729634826876


# Tatoeba Challenge

In [10]:
# Load the dataset
df = pd.read_csv('/kaggle/input/tatoeba-challenge/Tatoeba-Challenge.csv')
english_sentences = df['English']
hindi_sentences = df['Hindi']

In [11]:
translations = []
references = []
for i in range(0, len(english_sentences)):
    translations.append(translate_to_hindi(english_sentences[i]))
    references.append([hindi_sentences[i]])
bleu = sacrebleu.corpus_bleu(translations, references)
print(f"BLEU scoreon Tatoeba Challenge: {bleu.score}") 

BLEU scoreon Tatoeba Challenge: 12.605968092174914
