<a href="https://colab.research.google.com/github/cawoylel/nlp4all/blob/main/asr/src/asr_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

<h1 align="center"><strong>Developing a Speech Recognition System for your language: a pratical guide from data acquisition to model training</strong></h1>

<h4 align="center"><strong>Yaya Sy, Dioula Doucouré</strong></h4>

---

<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/cawoylel.png?raw=true:, width=200" alt="transformer" width=200>
<br>
    <em>
    https://cawoylel.com/
    </em>
</p>

# Introduction

Few months ago, we released __windanam__, a collection of multidialectal speech recognition models for the Fula language, which is underrepresented in NLP solutions. During the development phase, we encountered significant hurdles in gathering data and creating the models. The task of developing Speech Recognition Systems for such languages presents unique challenges. The process of Automatic Speech Recognition (ASR), which relies on a supervised learning approach, necessitates the availability of annotated datasets. These datasets must include both the audio recordings and their corresponding textual transcriptions. However, acquiring this type of data is particularly arduous for languages that are predominantly spoken and either lack a written form or have an unstandardized script.

In this guide, we outline the foundational steps necessary for developing an ASR model for languages that do not have pre-existing annotated datasets for ASR. We detail the strategies and methodologies that contributed to the successful creation of __windanam__, covering everything from the initial data collection phase to the model training stage. This guide marks the beginning of our [_NLP4ALL_](https://github.com/cawoylel/nlp4all) series, which is focused on simplifying the process of building NLP models for underrepresented languages and making it more accessible. Our initiative is not limited to just one language; it embodies a broader mission to promote open-source collaboration and democratize NLP technology for every language. We aim to provide a replicable framework that communities can adapt for their languages, aligning with our vision of making NLP technology widely accessible.

While this tutorial uses the Seereer language as a case study, the outlined process is applicable to any language facing similar challenges of underrepresentation in the NLP domain. Notably, there is a lack of annotated datasets for the development of ASR models in Seereer. By selecting Seereer as our example, we dissect the entire workflow, from the generation of data to the training of the model. This includes steps such as audio-speech scraping, short segments alignment, data pocessing, and the culmination of building your ASR model through the fine-tuning of an existing open-source model.

# Table of Contents

1. [Setup and Configuration](#setup)
2. [Getting the ASR Data](#data)
    - 2.1 [Bible Crawler](#crawler)
    - 2.2 [Resampling the Audios](#resampling)
    - 2.3 [Neural Forced Alignment](#aligner)
3. [Data Preprocessing](#datasets)
4. [Fine-Tuning Whisper Large on Seereer ASR Dataset](#finetuning)
    - 4.1 [Data processing](#preprocess)
    - 4.1 [Load Model](#load)
    - 4.2 [Prepare the Pretrained Model for LoRA](#prepare)
    - 4.3 [Attach Adapters to the Pretrained Model](#attach)
    - 4.4 [Data Loader](#collator)
    - 4.5 [Training Arguments](#training_args)
    - 4.6 [Launch Training and Evaluate](#training)
5. [Evaluation](#evaluate)
6. [Challenges and future directions](#challenges)
    - 6.1 [Robustness](#robust)
    - 6.2 [Lexical Diversity](#vocab)
    - 6.3 [Deploy your model](#deploy)


# Setup  <a name="setup"></a>

We use Google Colab Notebook in this tutorial, which is a free Ubuntu Virtual Machine with free CPUs (2) provided by Google. This Virtual Machin also contains some free hours of a single 16GB GPU.

Let a setup quickly the Colab Virtual Machine to ensure that our environment has all the necessary dependencies installed. First, we update the package list with `apt-get update`. Then, we install essential packages required for audio processing, such as `libsox-fmt-all`, `sox`, and `ffmpeg`. Additionally, we install `libicu-dev` and `pkg-config` to handle text processing and Unicode symbols effectively.

In [None]:
!apt install libicu-dev pkg-config

In [None]:
!apt-get update
!apt-get install libsox-fmt-all sox ffmpeg # needed for processing audio
!apt install libicu-dev pkg-config # needed for processing text and unicode symbols

Let now setup the Python environment. We focus on installing the PyTorch nightly version, which is crucial for compatibility with other libraries and tools. We begin by uninstalling existing versions of `torch`, `torchaudio`, and `torchvision`. Subsequently, we install the PyTorch nightly version along with its dependencies.

To further enrich our toolkit, we install several additional libraries that play important roles in different stages of the ASR pipeline. These include:

- **Data Processing**: `sox` (audio processing), `scrapy` (data scraping), `ICU-Tokenizer` (text tokenization), `datasets` (loading training data), `librosa` (audio processing).

- **Model Training and Evaluation**: `transformers` (training the model), `evaluate` (assessing model performance), `jiwer` (calculating Word Error Rate).

- **Efficiency and Optimization**: `accelerate` (for faster training and evaluation), `peft` (lightweight training using LoRA)

In [None]:
!apt install libicu-dev pkg-config
!pip install -q ICU-Tokenizer

In [None]:
!pip install -q scrapy

In [None]:
!pip uninstall torch torchaudio torchvision -y # we need to install the nightly version of torch
!pip install -q --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
!pip install -q fairseq # we will use this package for aligning the audio-text
!pip install -q dataclasses
!pip install -q sox # for audio processing
!pip install -q scrapy # for scapping the data
!pip install -q ICU-Tokenizer # for tokenizing the text
!pip install -q transformers # we will use huggingface transformers for training the models
!pip install -q datasets # we will use huggingface datasets for loading the training dataset
!pip install -q librosa # required by huggingface dataset
!pip install -q evaluate # for evaluating the models
!pip install -q jiwer # for computing WER (Word Error Rate)
!pip install -q bitsandbytes # for loading the quantized model
!pip install -q accelerate # for efficient training and evaluation
!pip install -q git+https://github.com/huggingface/peft.git@main # for lightweight training using LoRA

In this step, we clone repositories containing code and resources essential for our ASR project. Specifically, we clone the `cawoylel/nlp4all` repository, which holds the code for this tutorial, and the `isi-nlp/uroman` repository, which provides functionalities for Romanization of text data

In [None]:
!git clone https://github.com/cawoylel/nlp4all.git # repository containing the code of this tutorial

In [None]:
%%shell
git clone https://github.com/isi-nlp/uroman.git
git checkout 7750feb

In [None]:
%%shell
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
pip install --editable ./

# Getting the ASR data  <a name="data"></a>

As we said earlier, obtaining ASR data for training is really difficult for many languages. Data scarcity is a major roadblock in developing Natural Language Processing (NLP) tools for underrepresented languages.

To train a Speech Recognition model, you need a pair of spoken utterances and the corresponding textual transcription. How do you find this data? One good resource is the Common Voice project, a data collection initiative where speakers record themselves reading sentences. To date, it includes over a hundred languages, and the language you're interested in may be among them. Unfortunately, Sereer is not included.

If your language doesn't have annotated data for ASR, maybe you can find some audiobooks with their transcriptions. One audiobook resource that covers many languages is the Bible. This book has been translated into more than 4,000 languages in audiobook form. These recordings are made by native speakers, offering authentic insights into pronunciation, intonation, and linguistic nuances. However, we only have transcriptions for 1,000 languages. This is really a valuable resource for many languages, and perhaps the language you are interested in is included. You can check for this on the following website: https://www.bible.com/. Fortunately, the Seereer language has audiobooks and the corresponding transcriptions, for example, for the book of Samuel: https://www.bible.com/bible/3751/1SA.1.SRR23.

However, there are a couple of things to keep in mind. First, these recordings are usually done in very quiet, studio-like settings where people speak very clearly. While this ensures high-quality audio, models trained on these data might struggle in noisier, real-world conditions where speech patterns are more varied and spontaneous. Additionally, the majority of these recordings are by adult males, which could limit the model's ability to accurately recognize voices of other genders. The Seereer Bible recordings might not be perfect, but for a language with limited online resources, they provide a starting point for building powerful ASR tools.

## Bible Crawler <a name="crawler"></a>

Let scrap the bible data for Seereer.

This section imports the required modules for the scraping process. It includes `CrawlerProcess` from Scrapy for managing the crawling process, `SentSplitter` from `icu_tokenizer` for sentence splitting, and the `BibleScraper` we developed. You can find the code in `scraper.py` in the `asr/src` folder of the `nlp4all` repository.

In [None]:
from scrapy.crawler import CrawlerProcess
from icu_tokenizer import SentSplitter
from nlp4all.asr.src.scraper import BibleScraper

We initialize the `SentSplitter` object. You can also create your own splitter class that implement a method `.split()`

In [None]:
SPLITTER = SentSplitter()

This block of code sets up and executes the scraping process using Scrapy. It initializes the CrawlerProcess object, specifies parameters for the scraping task (such as the name of the scraper, the output folder where the data will be stored, start URLs, language, and language code), and then starts the scraping process.

In [None]:
process = CrawlerProcess()
process.crawl(BibleScraper,
              name="SeereerBible", # the name of scraping process, you can choose any name you want, it's not an important argument.
              output_folder="SeereerBible", # Where to store the scraped data.
              start_urls=["https://www.bible.com/bible/3751/GEN.1.SRR23"], # Look at the bible website and copy here the link of the first chapter of the first book
              language="Sereer-Sine", # TODO: remove some arguments to make this simple to understand
              code="SRR23",
              splitter=SentSplitter())

In [None]:
process.start()

## Resampling the audios  <a name="resampling"></a>

After downloading the audios, we need to resample them. Many modern speech models only deal with *16 000 sampling*. We will use `ffmpeg` to resample the audios into 16 000. We will also save the resampled audios into `.wav` files.

In [None]:
%%shell
for f in /content/SeereerBible/raw/Sereer-Sine/*.mp3; do
  filename="$(basename "$f")"
  directory="$(dirname "$f")"
  stem=${filename%.*}
  ffmpeg -i $f -ac 1 -ar 16000 $directory/$stem.wav ;
done

## Neural Forced Alignment  <a name="aligner"></a>

As such, we cannot directly train a Speech Recognition model on an entire audiobook because the audio is too long to be fed into a model. Ideally, we need smaller segments (perhaps 30 seconds each). Splitting the audio while keeping it aligned with the corresponding transcription is not an easy task. We can achieve this by using a forced aligner, a method that aligns short short spoken utterances with their corresponding transcriptions.

<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/forced_aligner.png?raw=true:, width=200" alt="transformer" width=500 class="center">
<br>
    <em>
    Illustration of the task of Forced Alignement
    </em>
</p>

We will use the [MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) Forced Aligner to do this. This is a Forced Aligner using a multilingual speech model trained on thousand of languages. You can check here if your language is included: https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html

If you cannot find your language, you can also choose the language the more similar to yours, this can also give good results. In the case of the Seereer language, we will use the Fula model (code `ful`) to align the Seereer data.

Note: If you use google colab, you may see some errors like `cannot open shared object file: No such file or director` or `Failed to load FFmpeg5 extension` while running the following cell. Just ignore them as it will not affect the alignment.

In [None]:
%%shell
input_folder=/content/SeereerBible/raw/Sereer-Sine
output_folder=/content/SeereerBible/aligned/Sereer-Sine
cd fairseq/
for audio in $input_folder/*.wav; do
  filename="$(basename "$audio")"
  stem=${filename%.*}
  output_path=$output_folder/$stem
  rm -rf $output_path
  python -m examples.mms.data_prep.align_and_segment \
  --audio_filepath $input_folder/$stem.wav \
  --text_filepath $input_folder/$stem.txt \
  --lang ful \
  --outdir $output_path \
  --uroman /content/uroman/bin
done

# Preparing the HuggingFace dataset  <a name="datasets"></a>

Once you've downloaded the audio data, split it into shorter segments (less than 30 seconds) and performed neural forced alignment to obtain the corresponding transcripts (as explained in the Neural Forced Alignment section), we'll use **Hugging Face Datasets** to create a structured data format for training our ASR model.

If you're unfamiliar with 🤗 [Datasets](https://huggingface.co/docs/datasets/en/index), it's a library provided by Hugging Face for processing large scale corpora. You ca use this to load the datasets available on the Huggingface Hub (text corpora, classification datasets, translation datasets, question answering datasets, and more). You can also just use the Datasets library as a python API for processing efficiently your data on your local machine. In our case, we'll leverage Hugging Face Datasets to process and organize our Seereer speech data and transcripts into a format readily usable for training our ASR model.

We will first import the `logging` library and disable unnecessary logging messages to keep the console output cleaner.

In [None]:
import logging
logging.disable(logging.DEBUG)
logging.disable(logging.INFO)
logging.disable(logging.WARNING)

We then import the `Path` class from the `pathlib` library, which provides a convenient way to handle file paths. We also import the `datasets` library from Hugging Face.

In [None]:
from pathlib import Path
import datasets

We define the structure of our dataset using the **`Features`** class from the Hugging Face datasets module. This structure specifies the format of each sample in the dataset. It includes two key elements:

- **audio**: This defines an audio feature with a sampling rate of 16 kHz, which is a common standard for speech model.

- **transcription**: This defines a string feature to store the corresponding text transcript for each audio segment

In [None]:
features = datasets.Features(
    {
        "audio": datasets.features.Audio(sampling_rate=16_000),
        "transcription": datasets.Value("string"),
    }
)

We define a function called **`dataset_generator`** that takes the path to the input folder (containing the processed audio data and transcripts) as input. This function essentially reads information about each audio-text pair from the manifest files and prepares it in a format suitable for the dataset creation process.

The function iterates through each bible chapter subfolder and reads the corresponding **manifest.json** file that contains information about each audio segment and its transcript. We extract the audio file path and text transcript from each line in the manifest file. Finally, the function  yields a dictionary with two key-value pairs: **audio** containing the audio file path and **transcription** containing the text transcript.

In [None]:
def dataset_generator(input_folder: str):
    input_folder = Path(input_folder)
    for chapter in input_folder.glob("*/"):
        with open(chapter / "manifest.json", "r") as manifest:
            for line in manifest:
                data = eval(line)
                audio_filepath = data["audio_filepath"]
                text = data["text"]
                yield {
                    "audio": audio_filepath,
                    "transcription": text
                }

Our final step is to create the Hugging Face Dataset using `datasets.Dataset.from_generator`.

`from_generator` module in Hugging Face Datasets provides a convenient way to create datasets from Python generators, enabling efficient handling of large or dynamically generated datasets in machine learning projects



> 🤗
The `from_generator()` method is the most memory-efficient way to create a dataset from a generator due to a generators iterative behavior. This is especially useful when you’re working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

In [None]:
dataset = datasets.Dataset.from_generator(dataset_generator,
                                          features=features,
                                          gen_kwargs={"input_folder": "/content/SeereerBible/aligned/Sereer-Sine"}
                                          ).cast_column("audio", datasets.Audio())

`cast_column("audio", datasets.Audio())` method ensures the **audio** column is correctly recognized as audio data by Hugging Face.

Well, now we have short text sentences with their corresponding audio files in the format of HuggingFace dataset. We're ready to process for a specific model. In this tutorial, we will use Whisper, a pretrained multilingual ASR model. Let see how to prepare the data for this specific model.

# Finetuning Whisper model on the Seereer data <a name="finetuning"></a>

Whisper is an encoder-decoder model designed for speech processing. The encoder, a Transformer, is trained to represent speech effectively, while the decoder, also a Transformer, is tasked with transcribing the speech based on these representations. Both components are trained simultaneously: the encoder learns to provide accurate representations to the decoder, which in turn learns to generate transcriptions from these representations. Whisper is a multilingual and multitask model, supporting approximately a hundred languages. Beyond speech recognition, it is capable of performing additional tasks, such as translating spoken input. We plan to adapt this model to work with Seereer data.

Since the Whisper model does not directly support Seerer or Fula languages, we employ a workaround by selecting the closest available language within the model. Several strategies can be used to select the best language for fine-tuning. For Seereer for instance, we selected **Hausa** language. This selection process can involve:

- Choosing the language with the best score on a Seereer development corpus.
- Choosing the language that the model most frequently predicts for the Seerer audios.
- Choosing the language that tokenizes Seereer texts most effectively.

Language selection ensures that our ASR model operates effectively despite the absence of direct language support.
We will use the Hausa language, as we found that this language works well for Fula in previous experiments, and we know that Fula is linguistically the closest language to Seereer.

In [None]:
model_name_or_path = "openai/whisper-large-v3"
task = "transcribe"
language = "Hausa"
language_abbr = "ha"

## Preprocess the dataset for the Whisper model <a name="preprocess"></a>

As the model is an encoder-decoder, and that the encoder takes the audio as input and the decoder takes the text as output, we need two processors. The first will process the speech the second will process the text. In NLP, a text processor is just a `Tokenizer` a simple model that split the text into smaller pieces of words and subwords. For the speech processor, it depends on the model. In the case of Whisper, the speech encoder takes as input a Spectogram (an image representation of the audio). So the audio processor will transform the raw speech audio into spectogram. This operation is called `FeatureExtraction` here. The transformer library gives a convient way of using these two processors: the `Processor` class encapsulates both in a single class.

In [None]:
import os
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor

In [None]:
# feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
# tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)
tokenizer = processor.tokenizer
feature_extractor = processor.feature_extractor

This function will extract speech features of the audio and tokenize the text.

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

In [None]:
num_proc = os.cpu_count() # Parallelize the process using multiple CPUs
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names, num_proc=num_proc)
dataset = dataset.train_test_split(0.1)

In [None]:
dataset

Now, we're officially done with the data preparation part!

## Load Whisper model <a name="load"></a>

We have the data already preprocessed and ready for training. Now we will load the model and prepare it for training. We will use QLoRa, which is an efficient method used for training large transformer models without the need of big compute ressources.

In [None]:
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers import Seq2SeqTrainingArguments
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model
import evaluate
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

We load the Whisper model from the HuggingFace hub. Instead of loading the model in full precision, we load it in 8bit so it can fit easily on a single small GPU.

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")

Training in a 8bit is special compared to training a model in full (FP32) or half (fp16) precision. So we use the Transformer library to prepare the model for 8bit training.

In [None]:
model = prepare_model_for_kbit_training(model)
model.model.encoder.conv1.register_forward_hook(lambda module, input, output: output.requires_grad_(True))

## Attach LoRA adapters to the pretrained model <a name="attach"></a>

Instead of updating all the model parameters like conventional finetuning, we attach small adapters with few parameters to the model, and only these parameters are updated during finetuning. We will use Lora for this

In [None]:
config = LoraConfig(r=32, # The rank to use. A big rank will require more GPU memory
                    lora_alpha=64, # used for scaling. Generaly alpha = r // 2
                    target_modules=["q_proj", "v_proj"], # which modules of the transformer to adapt.
                    lora_dropout=0.05,
                    bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

## Datacollator <a name="collator"></a>

This class is called when iterating over the batches. It will pad the inputs, mask the padded tokens so the model will not learn to generate them.

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Training arguments <a name="training_args"></a>

Set your training arguments as you want. Check the arguments for the class [TrainingArguments](https://huggingface.co/docs/transformers/v4.39.1/en/main_classes/trainer#transformers.TrainingArguments).

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="nlp4all/seereer_whisper",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=1,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=10,
    save_steps=20,
    max_steps=100, # only for testing purposes, remove this from your final run :)
    remove_unused_columns=False,  # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    label_names=["labels"],  # same reason as above
)



In [None]:
# This callback helps to save only the adapter weights and remove the base model weights.
class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

## Launch the training <a name="training"></a>

In [None]:
metric = evaluate.load("wer") # we will use the Word Error Metric

In [None]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [None]:
trainer.train()

After training, the model is saved in your local machine so you can use it later.

In [None]:
trainer.save_model("Seereer_Bible_100steps")

# Evaluation <a name="evaluate"></a>

Once training is done, you can evaluate your model on the test split.

In [None]:
from nlp4all.asr.src.evaluate import evaluate

In [None]:
results = evaluate(processor, dataset, data_collator, language, task, metric, model)

Just a small comparaison between predicted and references transcriptions:

In [None]:
print(list(zip(results["unnormalized_predictions"],
               results["unnormalized_references"])))

**...**

# Challenges and future directions <a name="challenges"></a>

## Robustness <a name="robust"></a>

We used the Bible data. As said in the *Data Collection* section, there are some limitations that should be taken into account when using for instance the Bible in your ASR pipeline (high quality audio recorded in ideal situation without any background noise, predominent male voices, and so on). Acknowledging these limitations, we can for instance employ **data augmentation** techniques to enhance the robustness of our models. By introducing elements like background noise into the dataset, we can simulate more diverse listening environments, thereby preparing our models to perform reliably in a variety of real-world scenarios.

## Lexical diversity <a name="vocab"></a>

The Bible also employs a limited vocabulary, which is distinctly religious and different from words used in everyday life. If your language has texts from domains other than religious, you can attempt to synthesize these texts using a pretrained model like MMS or by training your own Text-to-Speech model using the data we've generated in this tutorial. Additionally, you can explore adversarial techniques to adapt your Whisper model to other diverse text sources (see this [paper](https://isl.anthropomatik.kit.edu/downloads/Text%20and%20Synthetic%20Data%20for%20Domain%20Adaptation%20in%20End-to-End%20Speech%20Recognition.pdf) for an example).

## Deploy your model <a name="deploy"></a>

Indeed, for your model to be useful, you must publish it and create applications that Seereer speakers can use and interact with.

You can deploy your model for free on CPU instances at HuggingFace Spaces. An example of a space we've created for deploying Windanam can be found here: https://huggingface.co/spaces/cawoylel/MMS-ASR-Fula. However, this can be quite slow as the model will be hosted on CPUs rather than GPUs.
Alternatively, you can deploy your model using traditional compute providers like Google Cloud Platform, AWS, or Azure.
It all depends on the number of people who will use your model and the budget you have.