In this Notebook we : 

- check if Gemma3n supports Kinyarwanda
- transcribe kinyarwanda audio to see how the model performs without finetuning 



### 1. Check Tokenizer 

check if the Tokenizer support Kinyarwanda 


To check if a tokenizer supports your low-resource language, you cannot simply check if it "crashes." 

Modern tokenizers (like Gemma's) almost never crash; instead, they fall back to Byte-Level Encoding, which is inefficient and hurts performance.

"Support" really means **Efficiency**.

Here is a Python script to analyze how well the Gemma tokenizer (or any Hugging Face tokenizer) handles your language.

The "Fertility" Test Script

This script calculates the **Fertility Score** (Tokens per Word).

    Low Score (~1.0 - 1.5): Native-level support. (Good)

    High Score (> 3.0): Poor support. The model is reading your language as raw bytes/garbage, not words.



In [None]:
from transformers import AutoTokenizer

def check_language_support(model_id, sample_text):
    print(f"--- Analyzing: {model_id} ---")
    
    # 1. Load Tokenizer
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_id)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return

    # 2. Tokenize
    tokens = tokenizer.tokenize(sample_text)
    token_ids = tokenizer.encode(sample_text, add_special_tokens=False)
    
    # 3. Calculate Metrics
    word_count = len(sample_text.split())
    token_count = len(tokens)
    fertility_score = token_count / word_count if word_count > 0 else 0
    
    # 4. Check for "Byte Fallback" (The Danger Zone)
    # Gemma/Llama tokenizers represent raw bytes as tokens like '<0xE2>' or hex reps
    # If we see many tokens starting with '<0x', it means the script is unknown.
    byte_fallback_count = sum(1 for t in tokens if t.startswith("<0x") or len(t) == 6 and t.startswith("<") and t.endswith(">"))
    
    print(f"Sample Text:     {sample_text[:50]}...")
    print(f"Word Count:      {word_count}")
    print(f"Token Count:     {token_count}")
    print(f"Fertility Score: {fertility_score:.2f} tokens/word")
    
    # 5. Interpretation
    print("\nVERDICT:")
    if byte_fallback_count > 0:
        print(f"❌ POOR SUPPORT. Detected {byte_fallback_count} byte-fallback tokens.")
        print("   The model treats this language as raw binary data, not text.")
        print("   Training will be slow and context window will fill up 3x faster.")
    elif fertility_score < 1.5:
        print("✅ EXCELLENT. The tokenizer treats this as a native language.")
    elif fertility_score < 2.5:
        print("⚠️ OKAY. The tokenizer splits words into sub-words, but understands the script.")
    else:
        print("❌ POOR. High fragmentation. The model struggles to see whole words.")

    print("-" * 30)
    print(f"First 10 tokens: {tokens[:10]}")

# --- EXAMPLE USAGE ---

# Replace with your specific Gemma checkpoint
model_name = "unsloth/gemma-3-4b-it" 
model_name = 'unsloth/gemma-3-270m-it'

# 1. English (Baseline)
english_text = "The quick brown fox jumps over the lazy dog."
check_language_support(model_name, english_text)



xtext_list = ["Mu 2009 yaje gusubira muri Syria ashaka kuhatangira ubucuruzi gusa agongwa n’intambara yabaye muri icyo gihugu, ndetse n’imyaka yagombaga kujya mu gisirikare y’itegeko ku muturage wese wo muri Syria",
              "Perezida Paul Kagame aheruka gukomoza ku makosa benshi bakora agatuma imiryango myinshi isenyuka, agaragaza ko ariko igisubizo kitakabaye gukandukana, kuko ababiri batananirwa gukemura ibyo batumvikanaho",
              "Ubu burwayi ahanini buterwa n'isuku nke yo mu bwiherero, kutamenya uburyo bwo kwikorera isuku igihe urangije gukora ibikomeye nko kwihanagura uhereye inyuma uza imbere no mu mibonano mpuzabitsina."
]


for xtext in xtext_list:
    check_language_support(model_name, xtext )
    print('------------') 
    

    

Fertility Score,What it means for your Project

1.0 - 1.8,"Green Light. The model has ""seen"" this script before. You can fine-tune easily."

1.9 - 2.9,"Yellow Light. The model knows the alphabet but not the words. It will break words into meaningless chunks (e.g., ""hap-py"" → ""h-ap-py""). Fine-tuning will work, but you might need more data."

3.0+ (or Bytes),"Red Light. The tokenizer is failing. It is encoding one character as 3-4 separate byte tokens (e.g., the letter é becoming [<0xC3>, <0xA9>])."

**From "Gemini" ... POOR FERTILITY SCORE**

This is a classic "low-resource language" problem in LLM fine-tuning. You have correctly identified that the tokenizer is inefficient for Kinyarwanda, but your interpretation of "POOR" needs a little nuance, and your model choice for audio transcription specifically might be the bigger hurdle.


Here is an analysis of your situation and a step-by-step advisory.

1. The Verdict on the Tokenizer

Don't panic. A fertility score of 2.7 - 2.8 is actually manageable.

    It is not "broken": The most critical metric is the byte_fallback_count. Yours is 0. This means the tokenizer does recognize the Kinyarwanda script (Latin characters) and is building words out of recognizable sub-word chunks (like erez, ida, han). It is not falling back to raw hex bytes.

    It is just inefficient: A score of 2.8 means for every 1 word of Kinyarwanda, the model uses ~3 tokens.

        English average: ~1.2 tokens/word.

        Impact: Your model will have a shorter effective context window (it will "forget" earlier parts of the conversation 3x faster than in English) and generation will be slower.

        Can it learn? Yes. The model can absolutely learn to associate ['▁bur', 'way', 'i'] with the concept of "illness".

**Tokenizer Score Matters Less For Transcription**

In a text-only model, a "poor" tokenizer breaks the input into fragments, making it hard for the model to understand the meaning.
However, in your Gemma-3n Audio pipeline:

    Input (Audio): The model uses its Audio Encoder to process the sound. It does not use the text tokenizer for the input. Therefore, the "fragmentation" issue does not affect how well the model "hears" the Kinyarwanda audio.

    Output (Text): The tokenizer is only used to generate the written transcript. A score of 2.8 just means it will take slightly longer to write out the sentence. It will not prevent the model from learning the correct spelling or grammar.

Verdict: Proceed with Gemma-3n. The tokenizer is efficient enough for generating transcripts.

## CHECK INFERENCE 


We now check how **gemma-3n-E4B-it** performs in transcribing kinyarwanda (without fine-tuning) 


In [1]:
from datasets import load_from_disk
import random
import os 

from IPython.display import Audio, display

In [2]:
### HERE WE TAKE 10 random samples 

xFld_test = '/workspace/work/AI_Training_dset/audio_badrex_kinyarwanda-speech-1000h/train'

#cache 
cache_file = "/tmp/shuffled_test.arrow"
ds = load_from_disk(xFld_test)
xdset = ds.shuffle(seed=50, indices_cache_file_name=cache_file).select(range(10))

# Clean up only when you are done
if os.path.exists(cache_file):
    os.remove(cache_file)
    print(f"Deleted {cache_file}")

xdset 

Deleted /tmp/shuffled_test.arrow


Dataset({
    features: ['audio_id', 'audio', 'transcription', 'sampling_rate'],
    num_rows: 10
})

In [3]:
q1 = xdset[3]

print(q1['transcription'])
print('\n--------------------')
display(Audio(data=q1['audio'], 
              rate=q1['sampling_rate']
             ))

Kwa muganga, aho bakirira indembe zikeneye kuvurwa vuba, hari abantu benshi cyane higanjemo ab'igitsina gabo, barembye cyane, baje kwivuza byihutirwa. Hirya haparitse imodoka.

--------------------


In [4]:
from unsloth import FastModel

import torch
import numpy as np
import librosa  # We need this for resampling if your audio isn't 16k
from jiwer import wer, cer
from tqdm import tqdm
import time 

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
# ------------------------------------------------------------------------
# 1. SETUP MODEL
# ------------------------------------------------------------------------
max_seq_length = 2048 
dtype = None 
load_in_4bit = True 

#xmodel = 'unsloth/gemma-3n-E2B-it'
xmodel = 'unsloth/gemma-3n-E4B-it'   # test first the 4B model 

print("Loading model...")
model, processor = FastModel.from_pretrained(
    model_name = xmodel ,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
model.eval()

print('OK')

Loading model...
==((====))==  Unsloth 2026.2.1: Fast Gemma3N patching. Transformers: 4.57.1. vLLM: 0.11.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.543 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

OK


In [6]:
# ------------------------------------------------------------------------
# 2. DEFINE PROMPTS & RESAMPLER
# ------------------------------------------------------------------------
system_prompt = "You are an assistant that transcribes speech accurately."
user_instruction = "Please transcribe this Kinyarwanda audio."

def ensure_16k(audio_data, source_sr):
    """
    Gemma-3N expects 16kHz audio. 
    If your dataset says sampling_rate is 44100 or 48000, we must resample.
    """
    audio_np = np.array(audio_data, dtype=np.float32)
    
    if source_sr != 16000:
        # Using librosa to resample quickly
        audio_np = librosa.resample(audio_np, orig_sr=source_sr, target_sr=16000)
        
    return audio_np

In [7]:
# ... (Previous imports and setup) ...

# 1. CRITICAL: Set Padding Side to LEFT for Generation
# Decoder-only models (like Gemma) generate from left to right. 
# If you pad on the right, the model sees [audio, pad, pad] and gets confused.
# It must be [pad, pad, audio] so generation starts immediately after the audio tokens.
processor.tokenizer.padding_side = "left" 

def run_stable_batch_inference(dataset, batch_size=4):
    # 2. Sort dataset by audio length to minimize padding
    # This reduces the chance of "out of bounds" errors caused by massive padding gaps
    # We add a temporary column for length, sort, and then remove it
    dataset = dataset.map(lambda x: {"len": len(x["audio"])})
    dataset = dataset.sort("len", reverse=True) # Process longest first (often helps OOM)
    
    all_predictions = []
    all_references = []
    
    # Iterate with the sorted dataset
    for i in tqdm(range(0, len(dataset), batch_size), desc="Processing Batches"):
        batch = dataset[i : i + batch_size]
        
        batch_audio_lists = batch['audio']
        batch_references = batch['transcription']
        batch_srs = batch['sampling_rate']
        
        prompts_text = []
        processed_audios = []
        
        for j, raw_audio_list in enumerate(batch_audio_lists):
            current_sr = batch_srs[j]
            audio_array = ensure_16k(raw_audio_list, current_sr)
            processed_audios.append(audio_array)
            
            messages = [
                {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
                {"role": "user", "content": [
                    {"type": "audio", "audio": audio_array}, 
                    {"type": "text", "text": user_instruction}
                ]}
            ]
            prompts_text.append(
                processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            )

        # 3. Generate Inputs
        inputs = processor(
            text=prompts_text,
            audio=processed_audios,
            sampling_rate=16000,
            return_tensors="pt",
            padding=True,          # This will now apply LEFT padding
        ).to("cuda")

        # 4. Run Generation
        # We use a try-except block to gracefully handle any lingering shape errors
        try:
            with torch.no_grad():
                generated_ids = model.generate(
                    **inputs,
                    max_new_tokens=256,
                    do_sample=False,
                    use_cache=True,
                )
            
            # Decode output
            input_len = inputs.input_ids.shape[1]
            new_tokens = generated_ids[:, input_len:]
            decoded_output = processor.batch_decode(
                new_tokens, 
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True
            )
            
            all_predictions.extend(decoded_output)
            all_references.extend(batch_references)

        except RuntimeError as e:
            if "setStorage" in str(e):
                print(f"\n[Warning] Batch {i//batch_size} failed with storage error. Retrying with batch_size=1...")
                # Fallback: If a specific batch fails, run its items one by one
                for k in range(len(prompts_text)):
                    single_pred = run_single_inference(
                        prompts_text[k], processed_audios[k], model, processor
                    )
                    all_predictions.append(single_pred)
                    all_references.append(batch_references[k])
            else:
                raise e # Re-raise if it's a different error

    return all_predictions, all_references

# Helper for the fallback mechanism
def run_single_inference(text, audio, model, processor):
    inputs = processor(
        text=[text],
        audio=[audio],
        sampling_rate=16000,
        return_tensors="pt",
        padding=False # No padding needed for single item
    ).to("cuda")
    
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
    
    input_len = inputs.input_ids.shape[1]
    return processor.batch_decode(generated_ids[:, input_len:], skip_special_tokens=True)[0]

In [8]:
# Assuming 'my_dataset' is your variable name
print('start:', time.asctime())

predictions, references = run_stable_batch_inference(xdset, batch_size=2)

print('end:', time.asctime())


start: Fri Feb 13 07:18:02 2026


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Processing Batches: 100%|██████████| 5/5 [00:45<00:00,  9.14s/it]

end: Fri Feb 13 07:18:48 2026





In [9]:
# ------------------------------------------------------------------------
# 5. METRICS
# ------------------------------------------------------------------------
print("\n--- Evaluation ---")
wer_score = wer(references, predictions)
cer_score = cer(references, predictions)

print(f"WER: {wer_score:.4f}")
print(f"CER: {cer_score:.4f}")




--- Evaluation ---
WER: 0.9208
CER: 0.4414


In [10]:
# Optional: Print a few examples
for i in range(min(20, len(predictions))):
    print(f"\nRef:  {references[i]}")
    print(f"Pred: {predictions[i]}")


Ref:  Umumotari wambaye inkweto z'umweru zirimo ibara ry'umukara, yambaye kandi ipantaro yubururu, hejuru yambaye ishati y'icyatsi iciye amaboko ndetse na casque y'icyatsi, imbere ye hari moto iriho ikintu cy'icyatsi ndetse icyo kintu bakibikamo ibiribwa.
Pred: umumotari wambaye inkweto zo mweru zirimwo ibara ry'umukara yambaye kandi ipanaroe y'ubururu hejuru yambaye shati y'icyatsi n'amakoko ndetse na kaske y'icyatsi.
Ambere ye hari moto iriho ikino k'icyatsi ndetse icyo kintu bakibikamo ibiribga.

Ref:  Hari ibyuma bibikwamo amata biri mu ngano zitandukanye yaba ibinini cyangwa ibitoya aho usanga ku makusanyiririzo ahacururizwa amata hari ibyo byuma mu kigero gitandukanye aho bifasha abakiliya cyangwa abacuruzi kuba babika ayo mata bitewe n'ingano bafite bakaba bayabika mu ibyo byuma kandi neza. 
Pred: Hari ibyo biuma bibw'umata biri muunganze bitanukanye y'ibinyini c'ange bitoya. Aho usanga k'umakusanyirizwa aturi zwa mata hari ibyo byuma mu kigero gitanukanye. Aho bifashwa abakiri

In [None]:
import time 