<div align="center">

<img src="https://africa.dlnlp.ai/simba/images/VoC_simba" alt="VoC Simba Models Logo">


[![EMNLP 2025 Paper](https://img.shields.io/badge/EMNLP_2025-Paper-B31B1B?style=for-the-badge&logo=arxiv&logoColor=B31B1B&labelColor=FFCDD2)](https://aclanthology.org/2025.emnlp-main.559/)
[![Official Website](https://img.shields.io/badge/Official-Website-2EA44F?style=for-the-badge&logo=googlechrome&logoColor=2EA44F&labelColor=C8E6C9)](https://africa.dlnlp.ai/simba/)
[![SimbaBench](https://img.shields.io/badge/SimbaBench-Benchmark-8A2BE2?style=for-the-badge&logo=googlecharts&logoColor=8A2BE2&labelColor=E1BEE7)](#simbabench)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-FFD21E?style=for-the-badge&logoColor=black&labelColor=FFF9C4)](https://huggingface.co/collections/UBC-NLP/simba-speech-series)
[![YouTube Video](https://img.shields.io/badge/YouTube-Video-FF0000?style=for-the-badge&logo=youtube&logoColor=FF0000&labelColor=FFCCBC)](#demo)

</div>

## *Bridging the Digital Divide for African AI*

**Voice of a Continent** is a comprehensive open-source ecosystem designed to bring African languages to the forefront of artificial intelligence. By providing a unified suite of benchmarking tools and state-of-the-art models, we ensure that the future of speech technology is inclusive, representative, and accessible to over a billion people.

## Best-in-Class Multilingual Models

Introduced in our EMNLP 2025 paper *[Voice of a Continent](https://aclanthology.org/2025.emnlp-main.559/)*, the **Simba Series** represents the current state-of-the-art for African speech AI.

- **Unified Suite:** Models optimized for African languages.
- **Superior Accuracy:** Outperforms generic multilingual models by leveraging SimbaBench's high-quality, domain-diverse datasets.
- **Multitask Capability:** Designed for high performance in ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).
- **Inclusion-First:** Specifically built to mitigate the "digital divide" by empowering speakers of underrepresented languages.

The **Simba** family consists of state-of-the-art models fine-tuned using SimbaBench. These models achieve superior performance by leveraging dataset quality, domain diversity, and language family relationships.

### üó£Ô∏è‚úçÔ∏è Simba-ASR
> **The New Standard for African Speech-to-Text**

**üéØ Task** `Automatic Speech Recognition` ‚Äî Powering high-accuracy transcription across the continent.

**üåç Language Coverage (43 African languages)**
>  **Amharic** (`amh`), **Arabic** (`ara`), **Asante Twi** (`asanti`), **Bambara** (`bam`), **Baoul√©** (`bau`), **Bemba** (`bem`), **Ewe** (`ewe`), **Fanti** (`fat`), **Fon** (`fon`), **French** (`fra`), **Ganda** (`lug`), **Hausa** (`hau`), **Igbo** (`ibo`), **Kabiye** (`kab`), **Kinyarwanda** (`kin`), **Kongo** (`kon`), **Lingala** (`lin`), **Luba-Katanga** (`lub`), **Luo** (`luo`), **Malagasy** (`mlg`), **Mossi** (`mos`), **Northern Sotho** (`nso`), **Nyanja** (`nya`), **Oromo** (`orm`), **Portuguese** (`por`), **Shona** (`sna`), **Somali** (`som`), **Southern Sotho** (`sot`), **Swahili** (`swa`), **Swati** (`ssw`), **Tigrinya** (`tir`), **Tsonga** (`tso`), **Tswana** (`tsn`), **Twi** (`twi`), **Umbundu** (`umb`), **Venda** (`ven`), **Wolof** (`wol`), **Xhosa** (`xho`), **Yoruba** (`yor`), **Zulu** (`zul`), **Tamazight** (`tzm`), **Sango** (`sag`), **Dinka** (`din`).

**üèóÔ∏è Base Architectures**

  -  **Simba-S** (SeamlessM4T-v2-MT) ‚Äî *Top Performer*
  - **Simba-W** (Whisper-v3-large)
  - **Simba-X** (Wav2Vec2-XLS-R-2b)
  - **Simba-M** (MMS-1b-all)
  - **Simba-H** (AfriHuBERT)
      
| **ASR Models**   | **Architecture**  | **#Parameters** | **ü§ó Hugging Face Model Card** | **Status** |
|---------|:------------------:| :------------------:| :------------------:|:------------------:|    
| üî•**Simba-S**üî•|    SeamlessM4T-v2  |  2.3B | ü§ó [https://huggingface.co/UBC-NLP/Simba-S](https://huggingface.co/UBC-NLP/Simba-S) | ‚úÖ Released |
| üî•**Simba-W**üî•|    Whisper         |  1.5B | ü§ó [https://huggingface.co/UBC-NLP/Simba-W](https://huggingface.co/UBC-NLP/Simba-W) | ‚úÖ Released |
| üî•**Simba-X**üî•|    Wav2Vec2        |  1B | ü§ó [https://huggingface.co/UBC-NLP/Simba-X](https://huggingface.co/UBC-NLP/Simba-X) | ‚úÖ Released |   
| üî•**Simba-M**üî•|    MMS             |  1B | ü§ó [https://huggingface.co/UBC-NLP/Simba-M](https://huggingface.co/UBC-NLP/Simba-M) | ‚úÖ Released |   
| üî•**Simba-H**üî•|    HuBERT          |  94M | ü§ó [https://huggingface.co/UBC-NLP/Simba-H](https://huggingface.co/UBC-NLP/Simba-H) | ‚úÖ Released |   

* **Simba-S** (based on SeamlessM4T-v2-MT) emerged as the best-performing ASR model overall.


## Install and load requirments

In [None]:
!pip install torch torchvision torchaudio
!pip install transformers datasets  huggingface_hub

In [None]:
import torchaudio
from transformers import pipeline
from huggingface_hub import login


## Loading the model

In [None]:

# Step 1: Login with YOUR token (get it from https://huggingface.co/settings/tokens)
login(token="hf_xxxxxxx")  # ‚Üê PASTE YOUR TOKEN HERE

# Load Simba model for ASR
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="UBC-NLP/Simba-S" #Simba mdoels `UBC-NLP/Simba-S`, `UBC-NLP/Simba-W`, `UBC-NLP/Simba-X`, `UBC-NLP/Simba-H`, `UBC-NLP/Simba-M`
)

In [None]:
# Only for  `UBC-NLP/Simba-M`
asr_pipeline.model.load_adapter("multilingual_african")


## Transcribe audio file

In [None]:
# Direct from the audio file
result = asr_pipeline("https://africa.dlnlp.ai/simba/audio/afr_Lwazi_afr_test_idx3889.wav")

print(f"Transcription: {result['text']}")

In [None]:
########### Resampling and loading from array
# Load and resample
audio, orig_freq =  torchaudio.load("https://africa.dlnlp.ai/simba/audio/afr_Lwazi_afr_test_idx3889.wav")
audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
# Now you have your audio array!
print(f"Shape: {audio.shape}")
print(f"Type: {type(audio)}")
# Get 1D array (if mono or convert to mono)
audio_array = audio.mean(dim=0).numpy()  # Average stereo to mono

# Process the audio - need to pass both the array and sampling rate
result = asr_pipeline({
    "array": audio_array,
    "sampling_rate": 16_000
})

print(f"Transcription: {result['text']}")