# DSN 2025 In-House Hackathon

Develop and Fine-Tune State-of-the-Art Machine Translation Systems for Yoruba, Igbo, and Hausa to English.


## Overview

The world's linguistic diversity is one of its greatest treasures, yet many low-resource languages lack the quality machine translation tools available to high-resource languages. This competition challenges the global data science community to help bridge that gap for three vibrant Nigerian languages: Yoruba, Igbo, and Hausa.
Goal:

Your objective is to develop and/or fine-tune State-of-the-Art(SOTA) Machine Translation Systems to achieve the highest possible accuracy for translation from Yoruba, Igbo, and Hausa into English.
Competitors are encouraged to utilize any available external training data, along with the provided datasets. You are also free to choose whatever platform to run your code, although the GPU resources provided by the Kaggle platform are typically sufficient for this task.

Use transfer learning, creative data augmentation, and cutting-edge techniques to push the boundaries of translation quality for Nigerian languages. Join us to make a lasting impact on linguistic accessibility in West Africa!

### Team Name : Neural Minds

Team Members 

1) Emmanuel Ebiendele  https://www.kaggle.com/emmydesign
2) Olanudun Oluwapelumi https://www.kaggle.com/oluwapelumiolanudun



### STEP 1: Environment Setup
This step prepares the workspace for model development.  
We install and import essential libraries such as Transformers, Datasets, PEFT, and Accelerate, which handle model loading, tokenization, efficient fine-tuning, and GPU acceleration.  
We also verify that a GPU is available and functioning properly, since it significantly speeds up training.  
In short, this step ensures that all dependencies, frameworks, and hardware resources are correctly configured before proceeding.

In [1]:
# =====================================
# STEP 1: Environment Setup
# =====================================
!pip install -q transformers==4.36.2 accelerate==0.26.1 datasets==2.16.1 \
    sacrebleu==2.3.2 sentencepiece==0.1.99 peft==0.10.0 pandas==2.2.2

#Import libary package

import torch, os, time, pandas as pd
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          DataCollatorForSeq2Seq, Seq2SeqTrainingArguments,
                          Seq2SeqTrainer)
from datasets import Dataset
from peft import LoraConfig, get_peft_model
# Load model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import time, os
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
print("GPU available:", torch.cuda.is_available())
!nvidia-smi

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


GPU available: True
Thu Oct 23 19:04:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                            

### STEP 2: Load Data
Here, we load the dataset that will be used for model training and evaluation.  
Using pandas, we import both training and test files (usually in CSV or Excel format) and inspect their structure.  
This step helps us confirm that columns such as *Source_Text* and *Target_Text* are properly formatted and that there are no missing or invalid entries.  
It essentially sets the foundation by making sure our input data is clean, complete, and ready for preprocessing.

In [2]:
train_df = pd.read_excel("/content/train.xlsx")
test_df  = pd.read_excel("/content/test.xlsx")
sub_df   = pd.read_csv("/content/Submission_template.csv")

print("train:", train_df.columns.tolist())
print("test:", test_df.columns.tolist())
print("sub :", sub_df.columns.tolist())
print("Train size:", len(train_df), " Test size:", len(test_df))


train: ['Output', 'input', 'Language']
test: ['Competition_ID', 'Input Text', 'Language']
sub : ['ID', 'Output text']
Train size: 135000  Test size: 597


###  STEP 3: Data Preprocessing
In this step, we prepare the data for model consumption.  
We apply a tokenizer to convert text sequences into numerical tokens  the only form the model can understand.  
Sentences are truncated or padded to a uniform length for consistency, and unnecessary columns are dropped.  
The processed dataset is then organized into training and validation splits, ensuring the model can be both trained and evaluated effectively.

In [3]:
# =====================================
# STEP 3 : Data Cleaning and preprocessing 
# =====================================
SRC_TAGS = {"yor": "yor_Latn", "ibo": "ibo_Latn", "hau": "hau_Latn"}
TGT_TAG = "eng_Latn"

MODEL_NAME = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Drop empty / null rows
train_df = train_df.dropna(subset=["input", "Output", "Language"]).copy()
train_df = train_df[train_df["input"].str.strip() != ""]
train_df = train_df[train_df["Output"].str.strip() != ""]

# Normalize language codes
def normalize_lang(x):
    x = str(x).strip().lower()
    if x.startswith("yo"): return "yor"
    if x.startswith("ig"): return "ibo"
    if x.startswith("ha"): return "hau"
    return "yor"  # default fallback

train_df["Language"] = train_df["Language"].map(normalize_lang)

# Preprocess function
def preprocess(batch):
    lang = batch["Language"]
    src_lang = SRC_TAGS.get(lang, "yor_Latn")
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = TGT_TAG
    prefix = f"translate {lang} to english: "
    inputs = prefix + str(batch["input"])
    targets = str(batch["Output"])
    model_inputs = tokenizer(inputs, truncation=True, max_length=128)
    labels = tokenizer(text_target=targets, truncation=True, max_length=128)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset = Dataset.from_pandas(train_df)
tokenized = dataset.map(preprocess, remove_columns=dataset.column_names)

split = tokenized.train_test_split(test_size=0.1, seed=42)
train_data, val_data = split["train"], split["test"]

print(" Tokenization complete:")
print("Train:", len(train_data), " Validation:", len(val_data))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/132137 [00:00<?, ? examples/s]

 Tokenization complete:
Train: 118923  Validation: 13214


### STEP 4: Training Setup
This step defines the model’s training configuration.  
We specify hyperparameters such as batch size, learning rate, number of epochs, and the output directory for saving checkpoints.  
Using the Seq2SeqTrainer, we also set up data collators, evaluation strategies, and logging intervals.  
This setup ensures that the training process is structured, trackable, and reproducible.

In [5]:
!pip install -q --upgrade bitsandbytes==0.43.2

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# Step 4 Load Model (fp16)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()




pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

trainable params: 1,179,648 || all params: 616,253,440 || trainable%: 0.1914225419983051


### STEP 5: Model Training & Evaluation
Here, the model begins learning patterns from the data.  
It processes sentence pairs, compares its predicted translations with the correct targets, and adjusts its parameters to minimize error.  
Throughout the training, we monitor metrics such as training loss and validation loss to assess performance and prevent overfitting.  
The goal is to produce a well-generalized model capable of accurate translation beyond the training data.

In [11]:
# =====================================
# STEP 5: integrating QLoRA-safe Trainer into the model (no graph detach)
# =====================================

OUT_DIR = "/content/results"
os.makedirs(OUT_DIR, exist_ok=True)

# ---- DO NOT cast to float32; keep 4-bit base + LoRA adapters
# model = model.float()  # breaks quant hooks

# Required for training decoder models; also prevents cache/ckpt conflicts
model.config.use_cache = False

# Make sure inputs carry gradient when needed (different TF versions handle this differently)
try:
    model.enable_input_require_grads()
except Exception:
    emb = model.get_input_embeddings()
    def _grad_hook(module, inputs, output):
        try:
            output.requires_grad_(True)
        except Exception:
            pass
    emb.register_forward_hook(_grad_hook)

# (Optional) Turn OFF gradient checkpointing while you debug grads.
# If you need it for VRAM later, re-enable after training is confirmed OK.
try:
    model.gradient_checkpointing_disable()
except Exception:
    pass

# Collator (lets it pad + set label pad to -100)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding="longest")

# (Recommended) Fix forced BOS to English for seq2seq models like NLLB
try:
    forced_bos = tokenizer.lang_code_to_id["eng_Latn"]
    model.config.forced_bos_token_id = forced_bos
except Exception:
    pass

# TF32 for speed (safe on Ampere+)
import torch
torch.backends.cuda.matmul.allow_tf32 = True

args = Seq2SeqTrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=1.5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=3e-4,

    # Mixed precision is fine with QLoRA on 4.36
    fp16=True,
    bf16=False,

    # Keep simple to remove moving parts while debugging
    evaluation_strategy="no",
    save_strategy="no",
    logging_steps=50,
    predict_with_generate=False,
    report_to=[],
    dataloader_pin_memory=False,      # avoids sync hiccups
    dataloader_num_workers=2,         # mild speedup without flakiness
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

print(" Training start (QLoRA safe mode)...")
t0 = time.time()
trainer.train()
print(f" Done. Train time: {(time.time()-t0)/60:.2f} min")

trainer.save_model(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)
print("Saved to:", OUT_DIR)


🚀 Training start (QLoRA safe mode)...


Step,Training Loss
50,2.4402
100,2.0952
150,2.0702
200,2.0106
250,1.8898
300,1.9226
350,1.8383
400,1.9124
450,1.8947
500,1.961


✅ Done. Train time: 113.31 min
💾 Saved to: /content/results


### STEP 6: Inference & Submission
In the final step, we use the trained model to generate translations on unseen text (the test set).  

The output predictions are decoded back into human-readable sentences and stored in a submission file (typically a CSV)

In [12]:
# =====================================
# STEP 6: Inference → submission.csv
# =====================================
import pandas as pd, torch, os, time
from tqdm import tqdm

# Load your test and sample submission files again if needed
TEST_XLSX = '/content/test.xlsx'  # adjust if path differs
SUB_CSV = '/content/Submission_template.csv'

test_df = pd.read_excel(TEST_XLSX)
sub_df = pd.read_csv(SUB_CSV)

# Normalize language labels just like in training
def normalize_lang(x):
    x = str(x).strip().lower()
    if x.startswith("yo"): return "yor"
    if x.startswith("ig"): return "ibo"
    if x.startswith("ha"): return "hau"
    return "yor"

test_df["Language"] = test_df["Language"].map(normalize_lang)

SRC_TAGS = {"yor": "yor_Latn", "ibo": "ibo_Latn", "hau": "hau_Latn"}
TGT_TAG = "eng_Latn"



MODEL_DIR = "/content/results"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR)
model.to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

def translate_batch(texts, lang, max_len=128, num_beams=4):
    """Generate translations for a batch of texts."""
    tokenizer.src_lang = SRC_TAGS[lang]
    tokenizer.tgt_lang = TGT_TAG
    prefix = f"translate {lang} to english: "
    results = []
    BS = 32
    device = model.device

    for i in tqdm(range(0, len(texts), BS), desc=f"Translating {lang.upper()}"):
        chunk = [prefix + str(x) for x in texts[i:i+BS]]
        enc = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_len)
        enc = {k: v.to(device) for k, v in enc.items()}

        with torch.no_grad():
            gen = model.generate(**enc, max_length=max_len, num_beams=num_beams, no_repeat_ngram_size=3)
        decoded = tokenizer.batch_decode(gen, skip_special_tokens=True)
        results.extend(decoded)
    return results

# Perform translations per language
preds = []
for lg in ["yor", "ibo", "hau"]:
    part = test_df[test_df["Language"] == lg]
    if len(part) == 0:
        continue
    translations = translate_batch(part["Input Text"].tolist(), lg)
    preds.append(pd.DataFrame({"ID": part["Competition_ID"].values, "Output text": translations}))

# Combine all results
pred_df = pd.concat(preds, axis=0).reset_index(drop=True)
submission = sub_df[["ID"]].merge(pred_df, on="ID", how="left")
submission["Output text"] = submission["Output text"].fillna("")

# Save to file
SUB_PATH = "/content/neural_minds_translate_.csv"
submission.to_csv(SUB_PATH, index=False)
print("\n Submission file created successfully:", SUB_PATH)


Translating YOR: 100%|██████████| 7/7 [00:17<00:00,  2.44s/it]
Translating IBO: 100%|██████████| 6/6 [00:18<00:00,  3.01s/it]
Translating HAU: 100%|██████████| 8/8 [00:24<00:00,  3.02s/it]


 Submission file created successfully: /content/neural_minds_translate_.csv





After making submission we place 8th on the private leaderboard with a private score of 0.72236

## What to improve

- Hyperparameter tuning for better model generalization.  
- Data augmentation or external corpus integration for richer language coverage.  
- Advanced fine-tuning methods such as LoRA or mixed-precision training to optimize efficiency.


### 🤝 Connect & Collaborate
I’m open to collaboration, mentorship, and professional connections in Data Science, Machine Learning, and NLP research.  
Let’s connect, share ideas, and build impactful AI solutions together.

- 💼 LinkedIn: [linkedin.com/in/emmanuel-ebiendele-063ba0255](https://www.linkedin.com/in/emmanuel-ebiendele-063ba0255)  
- ⭐️ GitHub: [github.com/emmanuel-123tech](https://github.com/emmanuel-123tech)  

 *Authored by Emmanuel Ebiendele - Data Scientist & Machine Learning Engineer*