This repository contains tools and resources for Arabic Speech Self-Supervised Learning (SSL), including modules for data preparation and model training using SpeechBrain. The goal is to facilitate research and development in Arabic speech processing, with a focus on multidialectal Arabic speech representation learning.
This repository includes:
- Data Preparation Module: Scripts to preprocess raw audio and text data into a format compatible with SpeechBrain recipes.
- SpeechBrain Recipes: Custom recipes for training Arabic SSL models with BEST-RQ.
The focus is on handling Arabic dialectal diversity, enabling efficient SSL training with large-scale datasets.
- Ara-BEST-RQ dataset: Elyadata/Ara-Best-RQ_dataset
The Hugging Face dataset provides metadata for the crawled Ara-BEST-RQ data, including YouTube video identifiers and segment boundaries. No audio or video files are distributed as part of the dataset.
The following Ara-BEST-RQ pretrained checkpoints are available on Hugging Face:
| Model | Training data | Hugging Face |
|---|---|---|
| Ara-BEST-RQ-300M-6k | Crawled data | Elyadata/AraBEST-RQ-300M-6k |
| Ara-BEST-RQ-600M-6k | Crawled data | Elyadata/AraBEST-RQ-600M-6k |
| Ara-BEST-RQ-600M-14k | Combined datasets | Elyadata/AraBEST-RQ-600M-14k |
- Support for Arabic speech datasets, including dialectal Arabic datasets.
- Utilities for audio preprocessing, clustering, and segmentation.
- Integration with SpeechBrain for SSL model training.
- Recipes based on SpeechBrain's BEST-RQ implementation.
Ensure the following dependencies are installed:
- Python >= 3.8
- SpeechBrain
- ffmpeg, for audio processing
- PyTorch, with a version compatible with your GPU/CPU setup
Clone this repository and set up the environment:
# Clone the repository
git clone https://github.com/elyadata/AraBEST-RQ.git
cd AraBEST-RQ
# Create a virtual environment, for example using conda
conda create -n arabic_ssl python=3.11
conda activate arabic_sslInstall SpeechBrain:
- From PyPI:
pip install speechbrain- Or from source, which is recommended:
cd .. # Or another root directory of your choice
git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain
pip install -r requirements.txt
pip install --editable .Install project dependencies:
cd ../AraBEST-RQ # Go back to this project's directory
pip install -r additional_requirements.txtEnsure ffmpeg is available in your system path.
- Linux:
sudo apt update
sudo apt install ffmpeg
ffmpeg -version # To verify installation- macOS:
brew install ffmpeg
ffmpeg -version- Windows:
winget install --id=Gyan.FFmpeg -e # Using WinGetNote: There are alternative ways to install ffmpeg. You can learn more on the official website.
Generate SpeechBrain-compatible CSV files with columns such as ID, wav, start, end, duration, and dialect
for more than 20 Arabic speech datasets.
For more details, refer to the data preparation guide. All supported datasets are listed in the catalog.
Use the provided SpeechBrain recipe to train SSL models for Arabic speech representation learning.
The recipe is based on SpeechBrain's BEST-RQ implementation.
If you use this project, dataset, or pretrained models in your research, please cite:
@misc{elleuch2026arabestrqmultidialectalarabic,
title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
year={2026},
eprint={2603.21900},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.21900},
}