<a href="https://colab.research.google.com/github/fachiny17/machine_learning/blob/main/dsn_inhouse_hackathon/dsn_inhouse_hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2025 DSN AI Bootcamp In-House Hackathon

Visit the [kaggle link](https://www.kaggle.com/competitions/dsn-bootcamp-in-house-hackathon/overview) to view more about the contest.

In [1]:
# Install all required packages
!pip install transformers datasets sentencepiece accelerate evaluate rouge-score bert-score torchview nltk sacrebleu
!pip install --upgrade transformers datasets

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting torchview
  Downloading torchview-0.2.7-py3-none-any.whl.metadata (13 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bert_score-0.3.1

In [2]:
!pip install tdqm

Collecting tdqm
  Downloading tdqm-0.0.1.tar.gz (1.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tdqm
  Building wheel for tdqm (setup.py) ... [?25l[?25hdone
  Created wheel for tdqm: filename=tdqm-0.0.1-py3-none-any.whl size=1322 sha256=132be88a6364a8887634bfb657b83fd1dbe535ca9486f9c09c73d3c3add9994c
  Stored in directory: /root/.cache/pip/wheels/af/02/71/aae0f7ee738abf19498353918ddae0f90a0d6ceb337b0bbc91
Successfully built tdqm
Installing collected packages: tdqm
Successfully installed tdqm-0.0.1


In [3]:
from google.colab import drive
import pandas as pd
import numpy as np
import os
import torch
import random

In [4]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)

from datasets import Dataset, load_dataset
import evaluate
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

In [5]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
drive_path = '/content/drive/MyDrive/dsn-inhouse-hackathon-files/'

In [7]:
print("Files in the folder:")
print(os.listdir(drive_path))

Files in the folder:
['train.xlsx', 'test.xlsx', 'Submission_template.csv']


In [8]:
# Load the datasets
train_df = pd.read_excel(drive_path + 'train.xlsx')
test_df = pd.read_excel(drive_path + 'test.xlsx')
sample_df = pd.read_csv(drive_path + 'Submission_template.csv')

In [9]:
train_df.head()

Unnamed: 0,Output,input,Language
0,"So, I find myself, over and over again, thinki...",оооооооооооооооооооооооооооооооооооооооооооооо...,Hausa
1,Especially in things where the connection to G...,"Karịsịa na ihe ebe na njikọ aka Chineke otuto,...",Igbo
2,"12 , 13 . ( a ) What is hyperbole ?\n","12 , 13 . ( a ) Kí ni àbùmọ́ ?\n",Yoruba
3,You and your story have helped me.\n,оооооооооооооооооооооооооооооооооооооооооооооо...,Hausa
4,CAUSE ALL PEOPLE TO BE TREATED EQUALLY,,Igbo


In [10]:
test_df.head()

Unnamed: 0,Competition_ID,Input Text,Language
0,IGB001,Onye ọ bụla tukotara ego iji fu na emeziri ihe...,Igbo
1,IGB002,Anyị bughariri ọrụ ụgwọ metara nile iji debe i...,Igbo
2,IGB003,Emeputara obere akwụkwọ ndekọ ka anyị wee nwee...,Igbo
3,IGB004,Anyị kwekọrịtara ka onye ọ bụla kwụọ ụgwọ ọnụ ...,Igbo
4,IGB005,Echetaram ha na-ntunye ụtụ imezi ihe nke ọma n...,Igbo


In [11]:
sample_df.head()

Unnamed: 0,ID,Output text
0,IGB001,
1,IGB002,
2,IGB003,
3,IGB004,
4,IGB005,


In [12]:
# Set random seeds for reprducibility
def set_seed(seed=42):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)

set_seed(42)

In [13]:
print("📊 DATA EXPLORATION")
print("=" * 50)

print("\nTraining Data Info:")
print(f"Shape: {train_df.shape}")
print(f"Columns: {list(train_df.columns)}")
print(f"\nMissing values:")
print(train_df.isnull().sum())

print(f"\nLanguage Distribution in Training:")
print(train_df['Language'].value_counts())

print(f"\nLanguage Distribution in Test:")
print(test_df['Language'].value_counts())

print("\nSample training examples:")
for i in range(2):
    lang = train_df['Language'].iloc[i]
    print(f"\n{lang.upper()}:")
    print(f"Source: {train_df['input'].iloc[i]}")
    print(f"Target: {train_df['Output'].iloc[i]}")

📊 DATA EXPLORATION

Training Data Info:
Shape: (135000, 3)
Columns: ['Output', 'input', 'Language']

Missing values:
Output      1196
input        266
Language       0
dtype: int64

Language Distribution in Training:
Language
Yoruba    45055
Igbo      45001
Hausa     44944
Name: count, dtype: int64

Language Distribution in Test:
Language
Hausa     229
Yoruba    200
Igbo      168
Name: count, dtype: int64

Sample training examples:

HAUSA:
Source: оооооооооооооооооооооооооооооооооооооооооооооооооооооооовввввввввввввввввввввввв

Target: So, I find myself, over and over again, thinking about my German mother.


IGBO:
Source: Karịsịa na ihe ebe na njikọ aka Chineke otuto, ọ dịghị ka o doo anya.
Target: Especially in things where the connection to God's glory isn't as clear.


In [17]:
print("🔄 Applying data augmentation...")

def simple_augmentation(df, num_augments=1):
    """Simple data augmentation by creating variations"""
    augmented_rows = []

    # Drop rows with missing values in 'input' or 'Output' columns
    df_cleaned = df.dropna(subset=['input', 'Output']).copy()


    for _, row in tqdm(df_cleaned.iterrows(), total=len(df_cleaned)):
        source_text = row['input']
        target_text = row['Output']
        lang = row['Language']

        # Keep original
        augmented_rows.append({
            #'ID': row['ID'],
            'input': source_text,
            'Output': target_text,
            'Language': lang
        })

        # Create simple variations
        for aug_idx in range(num_augments):
            # Simple word shuffle for augmentation
            words_source = source_text.split()
            words_target = target_text.split()

            if len(words_source) > 3 and len(words_target) > 3:
                # Shuffle words (simple augmentation)
                np.random.shuffle(words_source)
                np.random.shuffle(words_target)

                aug_source = ' '.join(words_source)
                aug_target = ' '.join(words_target)

                augmented_rows.append({
                    #'ID': f"aug_{row['ID']}_{aug_idx}",
                    'input': aug_source,
                    'Output': aug_target,
                    'Language': lang
                })


    return pd.DataFrame(augmented_rows)

# Apply augmentation
original_size = len(train_df)
augmented_train_df = simple_augmentation(train_df, num_augments=1)
print(f"✅ Data augmentation complete!")
print(f"Original size: {original_size}")
print(f"Augmented size: {len(augmented_train_df)}")

🔄 Applying data augmentation...


100%|██████████| 133538/133538 [00:06<00:00, 20402.38it/s]


✅ Data augmentation complete!
Original size: 135000
Augmented size: 225207


In [None]:
# Model configuration - using the distilled version for faster training
MODEL_NAME = "facebook/nllb-200-distilled-600M"

# Language mapping for NLLB
LANG_MAPPING = {
    'yoruba': 'yor_Latn',
    'igbo': 'ibo_Latn',
    'hausa': 'hau_Latn',
    'english': 'eng_Latn'
}

print(f"🚀 Loading model: {MODEL_NAME}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print("✅ Model loaded successfully!")
print(f"Model parameters: {model.num_parameters():,}")

# Check GPU and move model to device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    model = model.to(device)

In [20]:
# Model configuration - using the distilled version for faster training
MODEL_NAME = "facebook/nllb-200-distilled-600M"

# Language mapping for NLLB
LANG_MAPPING = {
    'yoruba': 'yor_Latn',
    'igbo': 'ibo_Latn',
    'hausa': 'hau_Latn',
    'english': 'eng_Latn'
}

print(f"🚀 Loading model: {MODEL_NAME}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print("✅ Model loaded successfully!")
print(f"Model parameters: {model.num_parameters():,}")

# Check GPU and move model to device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    model = model.to(device)

🚀 Loading model: facebook/nllb-200-distilled-600M
✅ Model loaded successfully!
Model parameters: 615,073,792
Using device: cuda
GPU memory: 15.8 GB
