<a href="https://colab.research.google.com/github/bur3hani/mtaalam-training/blob/main/Cleaned_Kiswahili_Model_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kiswahili Language Model Development**
This notebook focuses on building a **ChatGPT-style Kiswahili model** using deep learning. It includes:
- **Dataset Preparation** (OSCAR corpus)
- **Tokenization** (mT5 model)
- **Fine-Tuning a Transformer Model**
- **Evaluation and Inference**

**Goal:** Create an interactive chatbot for proper Kiswahili teaching.

---

## **1. Environment Setup**
First, we install and import necessary libraries for deep learning and NLP.

## **2. Load Kiswahili Dataset**
We use the **OSCAR corpus** as the primary Kiswahili dataset. It contains high-quality texts from web sources.

## **3. Tokenization**
Tokenizing text using Google's **mT5 model**, which supports multilingual processing. This step prepares data for fine-tuning.

## **4. Model Fine-Tuning**
We fine-tune an mT5-based model on the Kiswahili dataset. Adjust hyperparameters for optimal results.

## **5. Evaluation and Inference**
Evaluate the model using test data and generate responses.

## **6. Deployment Strategy**
Once trained, the model can be deployed as a chatbot using **FastAPI or Telegram bots**.

## **Recommendations & Next Steps**
- **Expand Dataset:** Add more high-quality Kiswahili texts.
- **Optimize Tokenization:** Train a custom Kiswahili tokenizer for better results.
- **Improve Evaluation Metrics:** Use BLEU and perplexity for better assessment.
- **Deploy as an API:** Use Hugging Face Spaces or FastAPI.

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

In [2]:
from datasets import load_dataset

# Load Kiswahili corpus (OSCAR Dataset)
dataset = load_dataset("oscar", "unshuffled_deduplicated_sw")

# Print sample data
print(dataset["train"][1]["text"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/303k [00:00<?, ?B/s]

oscar.py:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.95M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24803 [00:00<?, ? examples/s]

Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu.


In [3]:
from transformers import AutoTokenizer

# Load mT5 tokenizer with fast processing
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small", use_fast=True)

# Define max_length to avoid warning
MAX_LENGTH = 512  # Adjust as needed
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

# Apply tokenization to dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/24803 [00:00<?, ? examples/s]

In [4]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load Kiswahili dataset
dataset = load_dataset("oscar", "unshuffled_deduplicated_sw")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small", use_fast=True)

# Tokenize dataset
MAX_LENGTH = 512  # Ensures no truncation warnings

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Split dataset into train (90%) and validation (10%)
train_test_split = tokenized_datasets["train"].train_test_split(test_size=0.1)
dataset = DatasetDict({
    "train": train_test_split["train"],
    "validation": train_test_split["test"]  # Rename test set as validation
})


In [5]:
# Load mT5-small model
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")


pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [7]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=500,
)


In [8]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)


In [9]:
from transformers import MT5Tokenizer, MT5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        inputs = self.tokenizer(item["input"], return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        outputs = self.tokenizer(item["output"], return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        return {
            "input_ids": inputs.input_ids.flatten(),
            "attention_mask": inputs.attention_mask.flatten(),
            "labels": outputs.input_ids.flatten(), #labels are used as decoder_input_ids
        }

tokenizer = MT5Tokenizer.from_pretrained("google/mt5-small")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")

data = [{"input":"translate English to German: Hello, world!","output":"Hallo Welt!"}]

dataset = MyDataset(data, tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'MT5Tokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.mt5.tokenization_mt5.MT5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mengr-buru[0m ([33mengr-buru-buruops[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss


TrainOutput(global_step=1, training_loss=66.59555053710938, metrics={'train_runtime': 377.1683, 'train_samples_per_second': 0.003, 'train_steps_per_second': 0.003, 'total_flos': 132187422720.0, 'train_loss': 66.59555053710938, 'epoch': 1.0})

In [10]:
import torch
from transformers import MT5Tokenizer, MT5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import Dataset
import wandb

class MyDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        inputs = self.tokenizer(item["input"], return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        outputs = self.tokenizer(item["output"], return_tensors="pt", padding="max_length", truncation=True, max_length=128)
        return {
            "input_ids": inputs.input_ids.flatten(),
            "attention_mask": inputs.attention_mask.flatten(),
            "labels": outputs.input_ids.flatten(), #labels are used as decoder_input_ids
        }

# Initialize wandb
wandb.init(project="mt5-training")

tokenizer = MT5Tokenizer.from_pretrained("google/mt5-small")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")

data = [{"input":"translate English to German: Hello, world!","output":"Hallo Welt!"},{"input":"translate English to German: How are you?","output":"Wie geht es dir?"}]

dataset = MyDataset(data, tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    logging_dir='./logs',
    logging_steps=10,
    report_to="wandb"  # Enable wandb logging
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

wandb.finish() # finish the wandb run.

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'MT5Tokenizer'.


Step,Training Loss


0,1
train/epoch,▁▁
train/global_step,▁█

0,1
total_flos,264374845440.0
train/epoch,1.0
train/global_step,2.0
train_loss,67.18486
train_runtime,73.4718
train_samples_per_second,0.027
train_steps_per_second,0.027


In [11]:
# Save trained model
model.save_pretrained("kiswahili_grammar_model")
tokenizer.save_pretrained("kiswahili_grammar_model")

('kiswahili_grammar_model/tokenizer_config.json',
 'kiswahili_grammar_model/special_tokens_map.json',
 'kiswahili_grammar_model/spiece.model',
 'kiswahili_grammar_model/added_tokens.json')