## Sequence to sequence language translation using Gemma  
In here we have used the following to train Gemma to understand the south Indian language called "Malayalam."  

Tools and libraries used: **Pytorch, bitsandbytes, Huggingface, Lora, and Pfet.**  

One of the unique properties of this training is the dataset.  
This is an English-Malayalam parallel corpus which contains around 4 lakh parallel corpora. 
English sentences are from the COCO dataset and are translated using the Google API. You can find the trainable text files at: 

[English-Malayalam Translation Corpus](https://github.com/eksubin/English-Malayalam-Translation-Corpus?tab=readme-ov-file).  

If you are doubtful about training Gemma using Pytorch, please refer to my previous notebooks on inferencing and fine-tuning Gemma using Pytorch:  
- [Understanding Gemma Fine-tuning Using Pytorch](https://www.kaggle.com/code/subinek/understanding-gemma-finetuning-using-pytorch)  
- [Gemma Inferencing with Pytorch](https://www.kaggle.com/code/subinek/gemma-inferencing-with-pytorch)

Please upvote this notebook if you find it interesting

### Install all the required libraries
- HuggingFace : For downloading the Gemma version trainable using pytorch
- bitsandbytes : Loading the configurations for training
- peft : Parameter efficient finetuning

In [1]:
#!pip install -U trl
!pip install git+https://github.com/huggingface/trl.git

Collecting git+https://github.com/huggingface/trl.git
  Cloning https://github.com/huggingface/trl.git to /tmp/pip-req-build-4s_zavbk
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /tmp/pip-req-build-4s_zavbk
  Resolved https://github.com/huggingface/trl.git to commit d6a8f2c2f6780affae11e8fbab6101ceeebaf0ca
  Installing build dependencies ... [?25l- \ | / done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Collecting transformers>=4.46.0 (from trl==0.13.0.dev0)
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheel

In [2]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1


In [3]:
!pip install -U peft

Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.13.2


In [4]:
!pip install ipywidgets --upgrade

Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.4/214.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab-widgets, ipywidget

In [5]:
import os
import wandb
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset

user_secrets = UserSecretsClient()
huggingface_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
wandb_api_key = user_secrets.get_secret("WANDB_API_KEY")

login(token=huggingface_token)
os.environ["WANDB_API_KEY"] = wandb_api_key
wandb.init(project="gemma2", name="malayalamTranslation")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: Currently logged in as: [33msubinek[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.18.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20241129_205650-uw0jkd0i[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mmalayalamTranslation[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/subinek/gemma2[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/subinek/gemma2/runs/uw0jkd0i[0m


### Load the model
* Provided the model ID to download it from hugging face.
* `BitsAndBytesConfig` : Is a library used for efficient model quantization. In here we have quantized the model to 4 bit precision.
* `AutoTokenizer` : Loads the tokenizer specific to the `model_id`
* `AutoModelForCausalLM` : Load the model with the quantization configs

In [6]:
# Load the model and the tokenizers
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0})

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### Inferencing the Gemma Model
- Conduct a test inferencing using the model.

In [7]:
# Perform inference with the Gemma model
text = "Quote: Imagination is more"  # Input text for the model
device = "cuda:0"  # Specify the device to use (GPU)

# Tokenize the input text and convert it to tensor format, moving it to the specified device
inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate output from the model with a limit on the number of new tokens
outputs = model.generate(**inputs, max_new_tokens=20)

# Decode the generated tokens back to text and print the result, skipping special tokens
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world.

- Albert Einstein

The


### Training Gemma Using Malayalam
- Download the Malayalam to English translation corpus from the provided GitHub link.
- The dataset contains 400,000 samples of translation from Malayalam to English.
- Created using the Google Translate API.

In [8]:
# Get the english to malayalam corpus
!wget https://raw.githubusercontent.com/eksubin/English-Malayalam-Translation-Corpus/refs/heads/main/english.txt
!wget https://raw.githubusercontent.com/eksubin/English-Malayalam-Translation-Corpus/refs/heads/main/malayalam.txt

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-11-29 20:59:08--  https://raw.githubusercontent.com/eksubin/English-Malayalam-Translation-Corpus/refs/heads/main/english.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19281987 (18M) [text/plain]
Saving to: 'english.txt'


2024-11-29 20:59:10 (36.5 MB/s) - 'english.txt' saved [19281987/19281987]



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-11-29 20:59:12--  https://raw.githubusercontent.com/eksubin/English-Malayalam-Translation-Corpus/refs/heads/main/malayalam.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58141526 (55M) [application/octet-stream]
Saving to: 'malayalam.txt'


2024-11-29 20:59:17 (61.1 MB/s) - 'malayalam.txt' saved [58141526/58141526]



In [9]:
# Test samples from the dataset
with open('/kaggle/working/english.txt', 'r', encoding='utf-8') as en_file:
    english_lines = en_file.readlines()

with open('/kaggle/working/malayalam.txt', 'r', encoding='utf-8') as ml_file:
    malayalam_lines = ml_file.readlines()

# Example: Print the first 5 lines
print('English:', english_lines[:5])
print('Malayalam:', malayalam_lines[:5])

English: ['A large freight train sits in a train station.\n', 'People boat down a river with flags strung across it.\n', 'A set of railway tracks as seen from a train.\n', 'A canoe with a motor speeding down a river.\n', 'A yellow van with people trying to display something in an empty square.\n']
Malayalam: ['ഒരു വലിയ ചരക്ക് ട്രെയിൻ ഒരു ട്രെയിൻ സ്റ്റേഷനിൽ ഇരിക്കുന്നു.\n', 'ആളുകൾ ഒരു നദിക്കരയിൽ പതാകകൾ പതിച്ചിട്ടുണ്ട്.\n', 'ട്രെയിനിൽ നിന്ന് കാണുന്നതുപോലെ ഒരു കൂട്ടം റെയിൽവേ ട്രാക്കുകൾ.\n', 'ഒരു നദിയിലൂടെ മോട്ടോർ വേഗതയുള്ള ഒരു പീരങ്കി.\n', 'ശൂന്യമായ സ്ക്വയറിൽ എന്തെങ്കിലും പ്രദർശിപ്പിക്കാൻ ശ്രമിക്കുന്ന ആളുകളുള്ള ഒരു മഞ്ഞ വാൻ.\n']


In [10]:
# Load your Malayalam-to-English dataset
def load_data(malayalam_file, english_file):
    with open(malayalam_file, "r", encoding="utf-8") as m_file, open(english_file, "r", encoding="utf-8") as e_file:
        malayalam_lines = m_file.readlines()
        english_lines = e_file.readlines()

    # Ensure both files have the same number of lines
    assert len(malayalam_lines) == len(english_lines), "Mismatch in line counts between source and target files."
    
    # Create a Dataset object
    data = {"source": malayalam_lines, "target": english_lines}
    return Dataset.from_dict(data)

# Paths to your training data
malayalam_file = "malayalam.txt"
english_file = "english.txt"

# Load data
dataset = load_data(malayalam_file, english_file)

# Split into training and validation sets
data = dataset.train_test_split(test_size=0.01)

In [11]:
# Provide the LORA config details
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Formatting function for Malayalam-to-English translation
def formatting_func(example):
    # Tokenize the source (Malayalam) and target (English) texts
    inputs = tokenizer(example["source"], max_length=128, truncation=True, padding="max_length")
    targets = tokenizer(example["target"], max_length=128, truncation=True, padding="max_length")
    
    # Return tokenized inputs and labels
    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": targets["input_ids"],
    }

# Map the formatting function over the dataset
tokenized_data = data.map(formatting_func, batched=True, remove_columns=["source", "target"])


# Training arguments for fine-tuning
sft_arguments = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_steps=10,
    max_steps=100,
    learning_rate=3e-5,
    fp16=True,  # Mixed-precision training for efficiency
    logging_steps=10,
    output_dir="outputs",
    save_steps=50,
    evaluation_strategy="steps",
    eval_steps=50,
    save_total_limit=3,
)


# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    args=sft_arguments,
    peft_config=lora_config,
    tokenizer=tokenizer,
)

# Start fine-tuning
trainer.train()

Map:   0%|          | 0/358213 [00:00<?, ? examples/s]

Map:   0%|          | 0/3619 [00:00<?, ? examples/s]

  trainer = SFTTrainer(
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
50,2.5939,2.569399
100,2.4533,2.410065


TrainOutput(global_step=100, training_loss=2.6951832008361816, metrics={'train_runtime': 1405.8163, 'train_samples_per_second': 1.138, 'train_steps_per_second': 0.071, 'total_flos': 2447388966912000.0, 'train_loss': 2.6951832008361816, 'epoch': 0.004466603761997019})

In [12]:
# Save the model and tokenizer
model.save_pretrained("custom_model.ckpt")
tokenizer.save_pretrained("custom_tokenizer.model")

('custom_tokenizer.model/tokenizer_config.json',
 'custom_tokenizer.model/special_tokens_map.json',
 'custom_tokenizer.model/tokenizer.model',
 'custom_tokenizer.model/added_tokens.json',
 'custom_tokenizer.model/tokenizer.json')

### Inferencing the trained model

In [13]:
# Function to translate Malayalam text to English
def translate_malayalam_to_english(text, tokenizer=tokenizer, model=model, max_length=128):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
    
    # Generate translation using the model
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_length,
        num_beams=4,  # Beam search for better translations
        early_stopping=True
    )
    
    # Decode the output and return the English translation
    translation = tokenizer.decode(output[0], skip_special_tokens=True)
    return translation

# Example Malayalam text for testing
malayalam_text = "ഞാൻ ഇന്ന് രാവിലെ ഉണർന്നു. ഒരു ചായ കുടിച്ചു"  # "What is your name?" in Malayalam

# Get the translation
english_translation = translate_malayalam_to_english(malayalam_text, tokenizer, model)

# Print the results
print(f"Malayalam: {malayalam_text}")
print(f"English: {english_translation}")



Malayalam: ഞാൻ ഇന്ന് രാവിലെ ഉണർന്നു. ഒരു ചായ കുടിച്ചു
English: ഞാൻ ഇന്ന് രാവിലെ ഉണർന്നു. ഒരു ചായ കുടിച്ചു.
I woke up this morning. I had a cup of tea.
I woke up this morning. I had a cup of tea.
I woke up this morning. I had a cup of tea.
I woke up this morning. I had a cup of tea.
I woke up this morning. I had a cup of tea.


# Thank You!

Thank you for taking the time to go through my Kaggle notebook! I appreciate your interest and support.

If you found the content helpful or enjoyable, please consider upvoting it. Your positive feedback means a lot to me!

Happy coding!

Best,  
Subin Erattakulangara