Ara-BEST-RQ: Arabic Speech SSL

This repository contains tools and resources for Arabic Speech Self-Supervised Learning (SSL), including modules for data preparation and model training using SpeechBrain. The goal is to facilitate research and development in Arabic speech processing, with a focus on multidialectal Arabic speech representation learning.

Overview

This repository includes:

Data Preparation Module: Scripts to preprocess raw audio and text data into a format compatible with SpeechBrain recipes.
SpeechBrain Recipes: Custom recipes for training Arabic SSL models with BEST-RQ.

The focus is on handling Arabic dialectal diversity, enabling efficient SSL training with large-scale datasets.

Released Resources

Dataset

Ara-BEST-RQ dataset: Elyadata/Ara-Best-RQ_dataset

The Hugging Face dataset provides metadata for the crawled Ara-BEST-RQ data, including YouTube video identifiers and segment boundaries. No audio or video files are distributed as part of the dataset.

Pretrained Models

The following Ara-BEST-RQ pretrained checkpoints are available on Hugging Face:

Model	Training data	Hugging Face
Ara-BEST-RQ-300M-6k	Crawled data	Elyadata/AraBEST-RQ-300M-6k
Ara-BEST-RQ-600M-6k	Crawled data	Elyadata/AraBEST-RQ-600M-6k
Ara-BEST-RQ-600M-14k	Combined datasets	Elyadata/AraBEST-RQ-600M-14k

Features

Support for Arabic speech datasets, including dialectal Arabic datasets.
Utilities for audio preprocessing, clustering, and segmentation.
Integration with SpeechBrain for SSL model training.
Recipes based on SpeechBrain's BEST-RQ implementation.

Installation and Setup

Prerequisites

Ensure the following dependencies are installed:

Python >= 3.8
SpeechBrain
ffmpeg, for audio processing
PyTorch, with a version compatible with your GPU/CPU setup

Setup

Clone this repository and set up the environment:

# Clone the repository
git clone https://github.com/elyadata/AraBEST-RQ.git
cd AraBEST-RQ

# Create a virtual environment, for example using conda
conda create -n arabic_ssl python=3.11
conda activate arabic_ssl

Install SpeechBrain:

From PyPI:

pip install speechbrain

Or from source, which is recommended:

cd .. # Or another root directory of your choice
git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain
pip install -r requirements.txt
pip install --editable .

Install project dependencies:

cd ../AraBEST-RQ # Go back to this project's directory
pip install -r additional_requirements.txt

Ensure ffmpeg is available in your system path.

Linux:

sudo apt update
sudo apt install ffmpeg
ffmpeg -version # To verify installation

macOS:

brew install ffmpeg
ffmpeg -version

Windows:

winget install --id=Gyan.FFmpeg -e # Using WinGet

Note: There are alternative ways to install ffmpeg. You can learn more on the official website.

Data Preparation

Generate SpeechBrain-compatible CSV files with columns such as ID, wav, start, end, duration, and dialect for more than 20 Arabic speech datasets.

For more details, refer to the data preparation guide. All supported datasets are listed in the catalog.

Model Training with SpeechBrain

Use the provided SpeechBrain recipe to train SSL models for Arabic speech representation learning.

The recipe is based on SpeechBrain's BEST-RQ implementation.

Citation

If you use this project, dataset, or pretrained models in your research, please cite:

@misc{elleuch2026arabestrqmultidialectalarabic,
      title={Ara-Best-RQ: Multi Dialectal Arabic SSL}, 
      author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
      year={2026},
      eprint={2603.21900},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.21900}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_prep		data_prep
model		model
README.md		README.md
additonal_requirements.txt		additonal_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ara-BEST-RQ: Arabic Speech SSL

Table of Contents

Overview

Released Resources

Dataset

Pretrained Models

Features

Installation and Setup

Prerequisites

Setup

Data Preparation

Model Training with SpeechBrain

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ara-BEST-RQ: Arabic Speech SSL

Table of Contents

Overview

Released Resources

Dataset

Pretrained Models

Features

Installation and Setup

Prerequisites

Setup

Data Preparation

Model Training with SpeechBrain

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages