<a href="https://colab.research.google.com/github/badhon1512/large-language-model-LLM/blob/main/fune_tune_gemini2b_for_english_to_bengali_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this notebook, I will fune tune Gemini 2b model for bengali transaltion.

## Installing all required dependencies

In [None]:
!pip install -U bitsandbytes accelerate

In [33]:
!pip install -U bitsandbytes peft trl accelerate datasets transformers




## Importing all the libraries

In [35]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer, TrainingArguments
from peft import LoraConfig
from datasets import load_dataset
from trl import SFTTrainer


## Model config setup

In [36]:
your_token = "######"
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=your_token)
tokenizer.padding_side = "right"
lora_config = LoraConfig(r=8, target_modules=["q_proj", "k_proj", "v_proj"], task_type="CAUSAL_LM")
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", token=your_token)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

 ## Dataset loading


In [41]:
dataset = load_dataset("csebuetnlp/BanglaNMT", split="train[:100000]")

In [42]:
def preprocess_data(examples):
    inputs = [f"translate English to Bengali: {src}" for src in examples["en"]]
    targets = [f"{tgt}" for tgt in examples["bn"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, truncation=True, max_length=512, padding="max_length"
    )
    return model_inputs

In [43]:
tokenized_data = dataset.map(preprocess_data, batched=True)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [44]:
def translate_single_english_sentence(en_sentence, model):
    input_prompt = f"translate English to Bengali: {en_sentence}"

    # Tokenize the input sentence with truncation
    inputs = tokenizer(
        input_prompt,
        return_tensors="pt",
        truncation=True,        # Truncate if it exceeds max length
        max_length=512,
        padding="max_length"
    ).to("cuda:0")

    # Generate the translation
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        num_beams=2,
        early_stopping=True
    )

    # Decode the generated tokens to get the translated text
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text


## Testing  model before fune tunning


In [45]:
# Example long Bangla sentence
sentence = (
    "I have some news."
)

# Translate the sentence
translated_sentence = translate_single_english_sentence(sentence, model)

# Print the result
print("Eng:")
print(sentence)
print("\nBng:")
print(translated_sentence)


Eng:
I have some news.

Bng:
translate English to Bengali: I have some news.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%2



> Add blockquote



Isn't the response very poor?

### Let's finetune the model using englis quotes dataset so that we can get perfect quotes

In [49]:
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_data,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=3,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        report_to="none"
    ),
    peft_config=lora_config,
    formatting_func=lambda example: f"translate English to Bengali: {example['bn']}\nEnglish: {example['en']}"
)
trainer.train()



Step,Training Loss
1,14.0131
2,16.9684
3,17.913
4,16.871
5,14.7809
6,12.8211
7,14.5657
8,13.9992
9,11.8179
10,16.1613



Cannot access gated repo for url https://huggingface.co/google/gemma-2b/resolve/main/config.json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in. - silently ignoring the lookup for the file config.json in google/gemma-2b.


TrainOutput(global_step=10, training_loss=14.99116916656494, metrics={'train_runtime': 18.8643, 'train_samples_per_second': 1.59, 'train_steps_per_second': 0.53, 'total_flos': 182765978910720.0, 'train_loss': 14.99116916656494, 'epoch': 0.0003})

In [51]:
# Example long Bangla sentence
sentence = (
    "I have some news."
)

# Translate the sentence
translated_sentence = translate_single_english_sentence(sentence, model)

# Print the result
print("Eng:")
print(sentence)
print("\nBng:")
print(translated_sentence)


Eng:
I have some news.

Bng:
translate English to Bengali: I have some news.

Answer:

Step 1/3
1. First, we need to identify the subject of the sentence. In this case, the subject is "I".

Step 2/3
2. Next, we need to identify the verb. In this case, the verb is "have".

Step 3/3
3. Finally, we need to identify the object of the sentence. In this case, the object is "some news". So, the Bengali translation of "I have some news" would be: আমি কিছু খবর আছে।


## The performence of fune tuned model is better than pre trained model, but still not perfect.