# Fine tuning Automatic Speech Recognition (ASR) models for Kinyarwanda

## GEMMA3-N 


## 1. Kinyarwanda audio datasets

- the most comprehensive labeled  kinyarwanda audio  dataset has been created by **Digital Umuganda** with funding from the Gates Foundation.

- it was released in June 2025 at the occasion of the Kaggle Competition "Kinyarwanda Automatic Speech Recognition" https://www.kaggle.com/digitalumuganda/competitions

- it contains about 1000 hours of transcribed speech and 1000 hours unlabeled Kinyarwanda audio

- the dataset can be found here https://www.kaggle.com/datasets/digitalumuganda/track-b-kinyarwanda-asr-dataset

- *Badr al-Absi* , the winner of the kaggle competition (track a), has done the data proprocessing and released on huggingface a more easy to use dataset of 1000 hours transcribed audio dataset: https://huggingface.co/datasets/badrex/kinyarwanda-speech-1000h


# 2. Data preparation

- While every *ASR model* has its own format (and template), they dataset in huggingface format provides an common starting point

- the data prepartion workflow will be the following :

  1. download the audio datasets locally
  2. perform needed transformations (specific for each model one whishes to finetune)


# 2.1 download datasets

- audio data can be huge *badrex/kinyarwanda-speech-1000h* has over 100 GB

- "samples" (single dataset point) in the dataset are about 20 seconds on average

- badrex/kinyarwanda-speech-1000h has three components :

  (1) train : 180k rows (about 1.000 hours of audio)
  (2) test : 9.26k rows (about about 50 hours)
  (3) validation  : 9.27k rows (about about 50 hours)

- For this proof of concept, we will only download part of the data

  (1) train : 2500 samples about 12 hours
  (2) test  :  600 samples about 3 hours
  (3) validation  :  600 samples about 3 hours

In [None]:
import os
import time
import numpy as np
from datasets import load_dataset, Dataset, Audio

In [None]:


def process_and_save_split(dataset_name, split_name, max_samples, target_sr, base_output_dir):
    output_path = os.path.join(base_output_dir, split_name)
    os.makedirs(output_path, exist_ok=True)
    
    print(f"\nüé≤ Randomizing and processing: {split_name} ({max_samples} samples)")

    def simple_generator():
        # 1. Load stream
        ds = load_dataset(dataset_name, split=split_name, streaming=True)
        
        # 2. Shuffle the stream with a buffer
        # buffer_size=1000 means it pulls 1000 samples and picks randomly from them
        ds = ds.shuffle(seed=42, buffer_size=1000)
        
        # 3. Resample
        ds = ds.cast_column("audio", Audio(sampling_rate=target_sr))
        
        count = 0
        for example in ds:
            if count >= max_samples:
                break
            try:
                audio_array = example['audio']['array'].astype(np.float32)
                
                # Keep it simple: just audio and text
                yield {
                    "audio_id": example['audio_id'],
                    "audio": audio_array,
                    "transcription": example['transcription'],
                    "sampling_rate": target_sr
                }
                
                count += 1
                if count % 100 == 0:
                    print(f"‚úÖ Processed {count}/{max_samples}...", end="\r")
            except Exception:
                continue

    # Create the dataset
    processed_ds = Dataset.from_generator(simple_generator, writer_batch_size=100)
    
    print(f"\nüíæ Saving to {output_path}...")
    processed_ds.save_to_disk(output_path)
    print(f"üèÅ Done: {time.asctime()}")

if __name__ == "__main__":
    CONFIG = {
        "dataset": "badrex/kinyarwanda-speech-1000h",
        "output_dir": "/media/mike/SSD4T/__staging/AI_Training_dset/audio_badrex_kinyarwanda-speech-1000h",
        "sampling_rate": 16000,
        "splits": {
            "train": 2500,
            "test": 600,
            "validation": 600
        }
    }

    for split, count in CONFIG["splits"].items():
        process_and_save_split(
            CONFIG["dataset"], split, count, CONFIG["sampling_rate"], CONFIG["output_dir"]
        )

## Inspect dataset 

In [1]:
from datasets import load_from_disk
import random

from IPython.display import Audio, display


In [4]:


# Path to your local validation folder
xpath = "/media/mike/SSD4T/__staging/AI_Training_dset/audio_badrex_kinyarwanda-speech-1000h/validation"

# 1. Load the processed dataset from disk
ds = load_from_disk(xpath )

# 2. Shuffle and select
# We use a seed to make it reproducible, or remove it for true randomness
random_5 = ds.shuffle(seed=42).select(range(5))

random_5

Dataset({
    features: ['audio_id', 'audio', 'transcription', 'sampling_rate'],
    num_rows: 5
})

In [5]:
## structure 

{'audio_id': 'vDKxM6DGm60qTOuc2TNw',
 'audio': [0.0,  0.0,  0.0,  0.00054931640625,  0.000335693359375,  -0.000152587890625,  ...],
 'transcription': "Inzu ndende ubona ko ifite ibirahure, uyuriyeho wanyerera, ifite amarangi y'umutuku amagambo yanditseho hejuru, ndetse na harimo ababyeyi, uri kurebera mu madirishya wahita ubabona.",
 'sampling_rate': 16000}



{'audio_id': 'vDKxM6DGm60qTOuc2TNw',
 'audio': [0.0,
  0.0,
  0.0,
  0.00054931640625,
  0.000335693359375,
  -0.000152587890625,
  Ellipsis],
 'transcription': "Inzu ndende ubona ko ifite ibirahure, uyuriyeho wanyerera, ifite amarangi y'umutuku amagambo yanditseho hejuru, ndetse na harimo ababyeyi, uri kurebera mu madirishya wahita ubabona.",
 'sampling_rate': 16000}

In [6]:
q1 = random_5[3]

print(q1['transcription'])
print('\n--------------------')
display(Audio(data=q1['audio'], 
              rate=q1['sampling_rate']
             ))

iIsosiyete y'ubwishingizi mu Rwanda, iramenyesha abaturarwanda ko bashobora gushinganisha ibihingwa byabo, bikaba bitekanye, aho ushobora guhamagara ijana na mirongo inani na rimwe, zeru, bakaguha ubundi busobanuro bwimbitse.

--------------------


## GEMMA 3N Format 

In [1]:
import numpy as np
import librosa
from datasets import load_from_disk, Dataset
import os
import time

# --- CONFIGURATION ---
DIR_INPUT  = '/media/mike/SSD4T/__staging/AI_Training_dset/audio_badrex_kinyarwanda-speech-1000h/train'
DIR_OUTPUT = '/media/mike/SSD4T/__staging/AI_Training_dset/audio_gemma/train'
TARGET_SR = 16000
MAX_SAMPLES = None #for full batch 200 

def gemma_generator(input_dir, max_samples, target_sr):
    # Load the source dataset
    ds = load_from_disk(input_dir)
    
    count = 0
    for example in ds:
        if max_samples and count >= max_samples:
            break
            
        try:
            # 1. Handle the raw list/array
            raw_audio = example['audio']
            # Convert list to numpy array if it isn't one
            audio_array = np.array(raw_audio, dtype=np.float32)
            
            # 2. Get the original sampling rate (default to 16k if missing)
            orig_sr = example.get('sampling_rate', 16000)

            # 3. Manual Resample if necessary
            if orig_sr != target_sr:
                audio_array = librosa.resample(y=audio_array, orig_sr=orig_sr, target_sr=target_sr)

            # 4. Ensure Mono
            if len(audio_array.shape) > 1:
                audio_array = audio_array.mean(axis=1)

            transcription = example.get('transcription', "")

            # 5. Yield the Audio Gemma structure
            yield {
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {"type": "audio", "audio": audio_array},
                            {"type": "text", "text": "transcribe this Kinyarwanda audio into text:"}
                        ]
                    },
                    {
                        "role": "model",
                        "content": [
                            {"type": "text", "text": transcription}
                        ]
                    }
                ]
            }
            
            count += 1
            if count % 50 == 0:
                print(f"‚úÖ Processed {count} samples...", end="\r")

        except Exception as e:
            print(f"\n‚ö†Ô∏è Skipping sample {count} due to error: {e}")
            continue

if __name__ == "__main__":
    print(f"üöÄ Starting transformation from raw list format...")
    print('Start time:', time.asctime())
    
    os.makedirs(DIR_OUTPUT, exist_ok=True)

    # Note: No 'features' defined here so Arrow can infer the schema from the first yield
    mdataset = Dataset.from_generator(
        gemma_generator,
        gen_kwargs={
            "input_dir": DIR_INPUT,
            "max_samples": MAX_SAMPLES,
            "target_sr": TARGET_SR
        },
        writer_batch_size=100
    )

    print(f"\nüíæ Saving to {DIR_OUTPUT}...")
    mdataset.save_to_disk(DIR_OUTPUT)
    
    print("‚ú® Success!")
    print('End time:', time.asctime())

üöÄ Starting transformation from raw list format...
Start time: Thu Feb 12 16:57:43 2026


Generating train split: 0 examples [00:00, ? examples/s]

‚úÖ Processed 2500 samples...
üíæ Saving to /media/mike/SSD4T/__staging/AI_Training_dset/audio_gemma/train...


Saving the dataset (0/7 shards):   0%|          | 0/2500 [00:00<?, ? examples/s]

‚ú® Success!
End time: Thu Feb 12 16:59:41 2026
