<a href="https://colab.research.google.com/github/cawoylel/nlp4all/blob/main/KallaamaSpeechProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

<h1 align="center"><strong>
Kallaama project speech data processing</strong></h1>

<h4 align="center"><strong>Dioula Doucouré</strong></h4>

---

<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/cawoylel.png?raw=true:, width=200" alt="transformer" width=200>
<br>
</p>

The purpose of this notebook is to provide a template to be used to preprocess the **Kallaama project** speech corpora and create a Hugging Face 🤗 dataset for ASR models training

[About the Kallaama Project](https://www.openslr.org/151/):

*Description taken from Kallaama Open SLR page*

> Kallaama project is funded by Lacuna Fund for 1 year, in 2023. The recordings are about agriculture. The recorded consist of farmers, agricultural advisers, and agri-food business managers. Type of recordings comprise interactive radio programmes, focus groups, voice messages, push messages and interviews. Therefore, spontaneous speech is prevailing. Quality of audio may vary depending on the type of programme.

- speech_dataset_wol.tar.gz: Wolof (ISO Code 639-2: wol) speech dataset contains 55 hours of transcribed speech, including almost 13 hours of validated content check by an expert. It also contains a XSAMPA lexicon (49,132 phonetised entries) and a text corpus (1,140,508 words).

- speech_dataset_fuc.tar.gz: Pulaar (ISO Code 639-2: fuc) speech dataset contains nearly 32 hours of transcribed speech, including around 11 hours of validated content check by an expert. It also contains a text corpus (742,024 words).

- speech_dataset_srr.tar.gz: Sereer (ISO Code 639-2: srr) speech dataset contains 38 hours of transcribed speech, including nearly 11 hours of validated content check by an expert.

You can their repo by click on the following link: [kallaama github repository](https://github.com/gauthelo/kallaama-speech-dataset). Don't forget to star their repo.

## **Get data**

**Download data**

We will first download the data. You can get the link from the kallaama project open slr page, copy and paste the link in the cell below

In [1]:
!wget https://www.openslr.org/resources/151/speech_dataset_fuc.tar.gz

--2024-05-07 16:41:44--  https://www.openslr.org/resources/151/speech_dataset_fuc.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://openslr.elda.org/resources/151/speech_dataset_fuc.tar.gz [following]
--2024-05-07 16:41:44--  https://openslr.elda.org/resources/151/speech_dataset_fuc.tar.gz
Resolving openslr.elda.org (openslr.elda.org)... 141.94.109.138, 2001:41d0:203:ad8a::
Connecting to openslr.elda.org (openslr.elda.org)|141.94.109.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3143491621 (2.9G) [application/x-gzip]
Saving to: ‘speech_dataset_fuc.tar.gz’


2024-05-07 16:43:43 (25.4 MB/s) - ‘speech_dataset_fuc.tar.gz’ saved [3143491621/3143491621]



Once the data downloaded, we unzip the tar.gz archive as follows

In [None]:
!tar xzvf speech_dataset_fuc.tar.gz

**Read sample audio**

In [None]:
import soundfile as sf

In [None]:
import librosa
audio_file = "/content/clean_dataset_ready4release/pulaar/speech_dataset/1/fuc_4110.wav"
audio, sr = sf.read(audio_file)

In [None]:
onset, offset = int(229.514  * sr), int(234.971 * sr)
utterance = audio[onset:offset]

In [None]:
from IPython.display import Audio
Audio(utterance, rate=sr)

## **Process speech data**

We will first create a simple function to extract the transcription of an audio recording.

The downloaded data contains two format for the transcriptions: `.trs` and `.stm`in a subfolder called `stm_format`

In [3]:
import re
def extract_info_from_stem_file(file_path: str) -> list:
    """
    Extracts information from the STEM file containing the transcription.

    Args:
        file_path (str): The path to the STEM file to be processed.

    Returns:
        A list of dictionaries containing information extracted from the STEM file.
        Each dictionary represents a segment of the STEM file and contains the following keys:
            - 'onset' (float): The onset time of the segment.
            - 'offset' (float): The offset time of the segment.
            - 'transcription' (str): The transcription associated with the segment.
    """
    line_regex = re.compile(r"fuc_\d+(.*)$")
    onset_regex = re.compile(r"\d+\.\d+")
    offset_regex = re.compile(r"\d+\.\d+\s+(\d+\.\d+)")
    transcription_regex = re.compile(r"<o(?:,.+)?>(.*)$")

    extracted_info = []

    with open(file_path, "r") as file:
        for line in file:
            filename_match = line_regex.search(line)
            if filename_match:
                onset_match = onset_regex.search(line)
                offset_match = offset_regex.search(line)
                transcription_match = transcription_regex.search(line)

                onset = float(onset_match.group(0))
                offset = float(offset_match.group(1))
                transcription = transcription_match.group(1).strip()

                if not (
                    "ignore_time_segment" in transcription or "musique" in transcription
                ):
                    transcription = transcription.replace(":fra", "").replace(
                        ": fra", ""
                    )
                    extracted_info.append(
                        {
                            "onset": onset,
                            "offset": offset,
                            "transcription": transcription,
                        }
                    )

    return extracted_info

The output of the above function looks like this

In [4]:
# Test the function

file_path = "/content/clean_dataset_ready4release/pulaar/speech_dataset/1/stm_format/fuc_4110.stm"
result = extract_info_from_stem_file(file_path)

In [9]:
result[:3]

[{'onset': 0.0,
  'offset': 3.214,
  'transcription': 'maɗe puɗata a awat ɗo ladde nde foof jiiya guerteɗe mankka yaade'},
 {'onset': 3.214,
  'offset': 5.136,
  'transcription': 'pasque temps  kateɗe dose  o ɓuurti'},
 {'onset': 5.136,
  'offset': 9.463,
  'transcription': 'yaaha haɗi han yaaha wara jalluguel ngel wona jalluguel ngel wawata ɗumande'}]

Function to build the transcription given the audio path. Audio and transcription `.stm`file are not stored in different folders

In [10]:
import os

def get_transcription_path(path: str) -> str:
    """
    Get transcription path given the audio path
    """
    dir_path, file_name = os.path.split(path)
    subfolder_name = os.path.basename(dir_path)
    transcription_folder = os.path.join(dir_path, "stm_format")
    transcription_path = os.path.join(transcription_folder, file_name.replace(".wav", ".stm"))
    return transcription_path

Get the audio and transcription

In [None]:
from pathlib import Path
import uuid
import glob
import os
import soundfile as sf
import pandas as pd
import logging

audio_folder = Path("Audio")
audio_folder.mkdir(exist_ok=True, parents=True)

search_path = os.path.join("/content/clean_dataset_ready4release", "**/*." + "wav")

data = {"filename": [], "transcription":[]}

for fname in glob.iglob(search_path, recursive=True):
    file_path = os.path.realpath(fname)
    try:
        transcription_path = get_transcription_path(file_path)
        audio, sr = sf.read(file_path)
        extracted_info = extract_info_from_stem_file(transcription_path)
        for dictionary in extracted_info:
          filename = f"{str(uuid.uuid1())}.wav"
          onset, offset = int(dictionary['onset']  * sr), int(dictionary['offset'] * sr)
          utterance = audio[onset:offset]
          output_file = audio_folder / filename
          sf.write(output_file, utterance, sr)

          data["filename"].append(output_file)
          data["transcription"].append(dictionary['transcription'])

    except Exception as e:
      logging.error(f"An exception occured for the following file: {file_path}. The error is: {e}")
      continue

## Create dataset

Once each audio file segment is aligned with its transcription, you can build a Hugging Face dataset and use it to train your ASR model. If you train with fairseq, you can refer to their documentation for the data preparation. You can is also find all the scripts in [Cawoylel github repository](https://github.com/cawoylel/windanam/tree/main)

In [14]:
df_audio = pd.DataFrame.from_dict(data)
df_audio['filename'] = df_audio['filename'].apply(lambda x: str(x))

In [15]:
df_audio.head()

Unnamed: 0,filename,transcription
0,Audio/c82660e2-0c94-11ef-8614-0242ac1c000c.wav,banggue produit garnoɗo rawade o aprostar o o...
1,Audio/c8287c56-0c94-11ef-8614-0242ac1c000c.wav,bonni gueese amen foof produit aprostar garno...
2,Audio/c82a81ae-0c94-11ef-8614-0242ac1c000c.wav,taw basalal ngal aɗa anda kayi an deemowo o ha...
3,Audio/c82c23d8-0c94-11ef-8614-0242ac1c000c.wav,on saaha noon minen kam walli min e produit k...
4,Audio/c82d8d18-0c94-11ef-8614-0242ac1c000c.wav,min kam komi almudo makko ko biyanmi komi almu...


Install and import dataset library

In [16]:
%%capture
!pip install datasets

In [17]:
import datasets

Dataset generator

In [18]:
def generator(dataset):
    for dico in dataset.itertuples():
        _, path, transcription = dico
        yield({
            "audio": path,
            "transcription": transcription
        })

Dataset features

In [19]:
features = datasets.Features(
      {
          "audio": datasets.features.Audio(sampling_rate=16_000),
          "transcription": datasets.Value("string")
      })

In [20]:
dataset_dict = {
    "train": datasets.Dataset.from_generator(
        generator, features=features, gen_kwargs={"dataset": df_audio}
    ).cast_column("audio", datasets.Audio())
}

Generating train split: 0 examples [00:00, ? examples/s]

Create dataset

In [21]:
kallaama_fula_dataset = datasets.DatasetDict(dataset_dict)

In [22]:
kallaama_fula_dataset["train"][1]

{'audio': {'path': 'Audio/c8287c56-0c94-11ef-8614-0242ac1c000c.wav',
  'array': array([-0.00216675, -0.00036621,  0.00057983, ..., -0.01419067,
         -0.01611328, -0.02041626]),
  'sampling_rate': 16000},
 'transcription': 'bonni gueese amen foof produit  aprostar garnoɗo ɗo rawane o noon on kam vraiment min jiiye makko yitere par  ce  que  kala nde gaw ɗa ƴettuɗa badɗamo e awdi ma awdi ndi hay sinno hen abbere nde memani leydi ni nde memani faandu woni dow ene wawi wonde ɗon hakke lewru taw ala ko memata ɗum kadi yeehi hankkadi ha wasagogo mawni ha gawri ndi fuɗɗi ha wasagogo mawnoyi kadi ne wawi jogade ɗum fodde ɗon hedde lewru faawndu'}

In [23]:
from IPython.display import Audio
Audio(kallaama_fula_dataset["train"][1]['audio']['array'], rate=16000)

You are ready to train your model. We have created a step by step guide that walks you through the whole ASR model development. You can find it here: [asr-tutoral](https://github.com/cawoylel/nlp4all/blob/main/asr/src/asr_tutorial.ipynb). With the Kallaama dataset, you will be able to skip the data collection part in the tutorial and start directly with the finetuning task
