Skip to content

elyadata/AraBEST-RQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ara-BEST-RQ: Arabic Speech SSL

This repository contains tools and resources for Arabic Speech Self-Supervised Learning (SSL), including modules for data preparation and model training using SpeechBrain. The goal is to facilitate research and development in Arabic speech processing, with a focus on multidialectal Arabic speech representation learning.


Table of Contents


Overview

This repository includes:

  • Data Preparation Module: Scripts to preprocess raw audio and text data into a format compatible with SpeechBrain recipes.
  • SpeechBrain Recipes: Custom recipes for training Arabic SSL models with BEST-RQ.

The focus is on handling Arabic dialectal diversity, enabling efficient SSL training with large-scale datasets.


Released Resources

Dataset

The Hugging Face dataset provides metadata for the crawled Ara-BEST-RQ data, including YouTube video identifiers and segment boundaries. No audio or video files are distributed as part of the dataset.

Pretrained Models

The following Ara-BEST-RQ pretrained checkpoints are available on Hugging Face:

Model Training data Hugging Face
Ara-BEST-RQ-300M-6k Crawled data Elyadata/AraBEST-RQ-300M-6k
Ara-BEST-RQ-600M-6k Crawled data Elyadata/AraBEST-RQ-600M-6k
Ara-BEST-RQ-600M-14k Combined datasets Elyadata/AraBEST-RQ-600M-14k

Features

  • Support for Arabic speech datasets, including dialectal Arabic datasets.
  • Utilities for audio preprocessing, clustering, and segmentation.
  • Integration with SpeechBrain for SSL model training.
  • Recipes based on SpeechBrain's BEST-RQ implementation.

Installation and Setup

Prerequisites

Ensure the following dependencies are installed:

  • Python >= 3.8
  • SpeechBrain
  • ffmpeg, for audio processing
  • PyTorch, with a version compatible with your GPU/CPU setup

Setup

Clone this repository and set up the environment:

# Clone the repository
git clone https://github.com/elyadata/AraBEST-RQ.git
cd AraBEST-RQ

# Create a virtual environment, for example using conda
conda create -n arabic_ssl python=3.11
conda activate arabic_ssl

Install SpeechBrain:

  • From PyPI:
pip install speechbrain
  • Or from source, which is recommended:
cd .. # Or another root directory of your choice
git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain
pip install -r requirements.txt
pip install --editable .

Install project dependencies:

cd ../AraBEST-RQ # Go back to this project's directory
pip install -r additional_requirements.txt

Ensure ffmpeg is available in your system path.

  • Linux:
sudo apt update
sudo apt install ffmpeg
ffmpeg -version # To verify installation
  • macOS:
brew install ffmpeg
ffmpeg -version
  • Windows:
winget install --id=Gyan.FFmpeg -e # Using WinGet

Note: There are alternative ways to install ffmpeg. You can learn more on the official website.


Data Preparation

Generate SpeechBrain-compatible CSV files with columns such as ID, wav, start, end, duration, and dialect for more than 20 Arabic speech datasets.

For more details, refer to the data preparation guide. All supported datasets are listed in the catalog.


Model Training with SpeechBrain

Use the provided SpeechBrain recipe to train SSL models for Arabic speech representation learning.

The recipe is based on SpeechBrain's BEST-RQ implementation.


Citation

If you use this project, dataset, or pretrained models in your research, please cite:

@misc{elleuch2026arabestrqmultidialectalarabic,
      title={Ara-Best-RQ: Multi Dialectal Arabic SSL}, 
      author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
      year={2026},
      eprint={2603.21900},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.21900}, 
}

About

Official implementation of AraBEST-RQ. https://arxiv.org/pdf/2603.21900

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors