<a href="https://colab.research.google.com/github/cahya-wirawan/luganda-asr/blob/main/Luganda_ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Speech Recognition  (ASR) for Luganda
This is the Python Notebook to describe how to train ASR model for Luganda, evaluate it and create submission file for the [Mozilla Luganda Automatic Speech Recognition](https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/).

We use several python scripts to do the training/fine-tuning, evaluation, submission file creation:
- run_finetuning.py
- run_evaluation.py
- run_submission.py 

## Check the availability of the GPU

In [1]:
!nvidia-smi
!free -h

Thu Jan 20 13:04:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Preparation of the Luganda ASR  source code 

Clone the source code from https://github.com/cahya-wirawan/luganda-asr.git

In [2]:
#!rm -rf /content/luganda-asr
!git clone https://github.com/cahya-wirawan/luganda-asr.git

Cloning into 'luganda-asr'...
remote: Enumerating objects: 97, done.[K
remote: Counting objects: 100% (97/97), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 97 (delta 48), reused 52 (delta 16), pack-reused 0[K
Unpacking objects: 100% (97/97), done.


In [3]:
%cd /content/luganda-asr
!git pull

/content/luganda-asr
Already up to date.


Install the requirements

In [4]:
%%capture

!pip install -r requirements.txt

## Data Preparation

Since we can't provide the Mozilla Common Voice dataset directly here due to its License, we ask you to download it from https://commonvoice.mozilla.org/lg/datasets directly and the same with test dataset from Zindi which can be downloaded from https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/data

Please save all of it in your Google Drive directory, which we will mount it in this notebook. The directory structure in Google Drive should look as following after the mounting:

/content/drive/MyDrive/Luganda
- cv-corpus-7.0-2021-07-21-lg.tar.gz
- SampleSubmission.csv
- test_audio.zip
- Test.csv


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
luganda_data_dir = "/content/drive/MyDrive/Luganda"
luganda_cv_corpus = f"{luganda_data_dir}/cv-corpus-7.0-2021-07-21-lg.tar.gz"
luganda_test_audio_file = f"{luganda_data_dir}/test_audio.zip"
luganda_test_file = f"{luganda_data_dir}/Test.csv"

!ls -l $luganda_data_dir

total 1914072
-rw------- 1 root root 1722860000 Jan 18 18:05 cv-corpus-7.0-2021-07-21-lg.tar.gz
-rw------- 1 root root      91886 Jan 18 16:41 SampleSubmission.csv
-rw------- 1 root root  236764554 Jan 18 16:39 test_audio.zip
-rw------- 1 root root     292195 Jan 18 16:29 Test.csv


In [7]:
%%capture

# Uncompressing the test_audio.zip and cv-corpus-7.0-2021-07-21-lg.tar.gz to the 
# directory /content/data.
# It takes around 40s

%cd /content/

## clean up the directory /content/data/
!rm -rf /content/data

!mkdir -p /content/data/zindi
%cd /content/data/zindi
!unzip -o $luganda_test_audio_file && cp $luganda_test_file .
%cd /content/data
!tar xvzf $luganda_cv_corpus 

In [8]:
!echo "Content of /content/data:" && ls -l /content/data
!echo
!echo "Content of /content/data/zindi:" && ls -l /content/data/zindi

Content of /content/data:
total 8
drwxr-xr-x 3 root root 4096 Jan 20 13:04 cv-corpus-7.0-2021-07-21
drwxr-xr-x 4 root root 4096 Jan 20 13:04 zindi

Content of /content/data/zindi:
total 556
drwxr-xr-x 3 root root   4096 Jan 20 13:04 __MACOSX
drwxr-xr-x 2 root root 266240 Oct 22 06:14 test_audio
-rw------- 1 root root 292195 Jan 20 13:04 Test.csv


## Model Training

We have fine-tuned the Facebook Wav2Vec2 model with the Luganda Common Voice dataset and stored it in https://huggingface.co/indonesian-nlp/wav2vec2-luganda. The model has been trained for 200 epochs in 4 hours and 30 minutes using 8 GPUs.

We use the script run_finetuning.py to train the model. 
Due to resource and time limitation in Google Colab,
we skip the model training here, but we run the evaluation and creation of the submission file. However, feel free to run following command for testing purpose:

```!python run_finetuning.py finetuning_common_voice_1epoch.json```

It will run the training for only one epoch which will take around 2 hours and 40 minutes in Google Colab.

### Usage
For training using single GPU
``` 
% python run_finetuning.py <argument json file>
``` 
For training using multi GPUs, for example 8 GPUs.
``` 
% python -m torch.distributed.launch --nproc_per_node=8 run_finetuning.py <argument json file>
``` 
Our model "indonesian-nlp/wav2vec2-luganda" has been trained using following command:
``` 
% python -m torch.distributed.launch --nproc_per_node=8 run_finetuning.py finetuning_common_voice.json
```

List of possible arguments:
``` 
% python run_finetuning.py -h

```

In [9]:
%%time

# We disable here the training to save times, but feel free to run it by 
# uncommenting the command "!python run_finetuning.py finetuning_common_voice_1epoch.json"
%cd /content/luganda-asr

!mkdir -p output
#!python run_finetuning.py finetuning_common_voice_1epoch.json

/content/luganda-asr
CPU times: user 4.46 ms, sys: 7.3 ms, total: 11.8 ms
Wall time: 110 ms


## N-grams Language Model

We use the n-grams language model [KenLM](https://github.com/kpu/kenlm) to reduce further the Word Error Rate (WER).

First, we install the necessary linux packages to build the KenLM library

In [10]:
%%capture

!sudo apt-get update
!sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
!sudo apt-get install libboost-all-dev libeigen3-dev

Then we clone KenLM source and build it. The executables will be built in the directory "kenlm/build/bin"

In [11]:
%%capture

%cd /content/
!if [ ! -d kenlm ]; then git clone https://github.com/kpu/kenlm; fi
!cd kenlm && mkdir build && cd build && cmake .. && make -j 4

Now we can prepare the text from the CV corpus and build the KenLM binary "5gram.bin"

In [37]:
%cd /content/luganda-asr/

!python run_lm_preparation.py -n dataset/common_voice -c lg -d /content/data/cv-corpus-7.0-2021-07-21 -o 5gram.txt
!../kenlm/build/bin/lmplz -o 6 < "5gram.txt" > "5gram.arpa"
!../kenlm/build/bin/build_binary "5gram.arpa" "5gram.bin"

/content/luganda-asr
Using custom data configuration lg-b5dd5bec651497f8
Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/lg-b5dd5bec651497f8/7.0.0/d2815414a38db497e900b5788f75958a15449c23acb8371ded02d3ed746d6206)
=== 1/5 Counting and sorting n-grams ===
Reading /content/luganda-asr/5gram.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 3449143296 bytes == 0x563e9278c000 @  0x7f80b19cf1e7 0x563e91b517e2 0x563e91aec4fe 0x563e91acb2eb 0x563e91ab7066 0x7f80afb68bf7 0x563e91ab8baa
tcmalloc: large alloc 18395406336 bytes == 0x563f600e8000 @  0x7f80b19cf1e7 0x563e91b517e2 0x563e91b4080a 0x563e91b41248 0x563e91acb308 0x563e91ab7066 0x7f80afb68bf7 0x563e91ab8baa
****************************************************************************************************
Unigram tokens 315562 types 29210
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:350520 2:1345806592 3:2523387

## Evaluation
To test our fine-tuned model, we evaluate it with test split of Mozilla Common Voice dataset version 6.1 and 7.0. The evaluation also uses the Ken Language Model (KenLM) 5gram.bin we created from the text of Common Voice 7.0.


### Usage
Following is the command to evaluate test split of Mozilla Common Voice dataset using our model "indonesian-nlp/wav2vec2-luganda" and using the KenLM:
``` 
% python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n common_voice -c lg -k 5gram.bin --test_pct 100
```
List of possible arguments:
``` 
% python run_evaluation.py -h
usage: run_evaluation.py [-h] -m MODEL_NAME -n NAME -c CONFIG_NAME [-d DATA_DIR] [-b BATCH_SIZE] [-k KENLM] [--num_workers NUM_WORKERS] [-w BEAM_WIDTH] [--test_pct TEST_PCT] [--cpu]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        The wav2vec2 model name
  -n NAME, --name NAME  The name of dataset
  -c CONFIG_NAME, --config_name CONFIG_NAME
                        The config name of the dataset
  -d DATA_DIR, --data_dir DATA_DIR
                        The directory contains the dataset
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size
  -k KENLM, --kenlm KENLM
                        Path to KenLM model
  --num_workers NUM_WORKERS
                        KenLM's number of workers
  -w BEAM_WIDTH, --beam_width BEAM_WIDTH
                        KenLM's beam width
  --test_pct TEST_PCT   Percentage of the test set
  --cpu                 Force to use CPU

```

### Evaluation on Common Voice version 6.1

We evaluate the ASR model "indonesian-nlp/wav2vec2-luganda" on the test split of Common Voice Dataset for Luganda (version 6.1). It takes around 6 minutes. We get here the Word Error Rate (WER) of **7.37%**.

In [13]:
%%time

%cd /content/luganda-asr
!python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n common_voice -c lg -k 5gram.bin --test_pct 100

/content/luganda-asr
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/lg/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd)
100% 584/584 [00:06<00:00, 83.93ex/s]
100% 584/584 [01:08<00:00,  8.49ex/s]
tcmalloc: large alloc 1253105664 bytes == 0x55e61f20c000 @  0x7f7ed9857615 0x55e4b392c4cc 0x55e4b3a0c47a 0x55e4b392f2ed 0x55e4b392f240 0x55e4b39a2973 0x55e4b3930afa 0x55e4b39a2d00 0x55e4b3930afa 0x55e4b399e915 0x55e4b399d9ee 0x55e4b3930bda 0x55e4b399e915 0x55e4b3930afa 0x55e4b399e915 0x55e4b399d9ee 0x55e4b3930bda 0x55e4b399e915 0x55e4b399d9ee 0x55e4b393148c 0x55e4b3931698 0x55e4b399ffe4 0x55e4b399d9ee 0x55e4b3930bda 0x55e4b399e915 0

### Evaluation on Common Voice version 7.0

We evaluate the ASR model "indonesian-nlp/wav2vec2-luganda" on the test split of Common Voice Dataset for Luganda (version 7.0). It takes around 25 minutes. We get here the Word Error Rate (WER) of **7.53%**.

In [14]:
%%time

%cd /content/luganda-asr
!python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n ./dataset/common_voice -c lg --data_dir /content/data/cv-corpus-7.0-2021-07-21 -k 5gram.bin --test_pct 100

/content/luganda-asr
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Using custom data configuration lg-b5dd5bec651497f8
Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/lg-b5dd5bec651497f8/7.0.0/d2815414a38db497e900b5788f75958a15449c23acb8371ded02d3ed746d6206)
100% 4276/4276 [00:39<00:00, 109.16ex/s]
100% 4276/4276 [01:41<00:00, 42.28ex/s]
tcmalloc: large alloc 1253105664 bytes == 0x55d96d552000 @  0x7f20c5048615 0x55d820d7a4cc 0x55d820e5a47a 0x55d820d7d2ed 0x55d820d7d240 0x55d820df0973 0x55d820d7eafa 0x55d820df0d00 0x55d820d7eafa 0x55d820dec915 0x55d820deb9ee 0x55d820d7ebda 0x55d820dec915 0x55d820d7eafa 0x55d820dec915 0x55d820deb9ee 0x55d820d7ebda 0x55d820dec915 0x55d820deb9ee 0x55d820d7f48c 0x

## Submission

We will create the submission file "submissions/luganda-asr.csv"
using the ASR model "indonesian-nlp/wav2vec2-luganda" on the test set
provided by Zindi.
It takes around 40 minutes

### Usage
```
$ python run_submission.py -h
usage: run_submission.py [-h] -m MODEL_NAME -d DATA_DIR -o OUTPUT_FILE [-b BATCH_SIZE] [-k KENLM] [-n NUM_WORKERS] [-w BEAM_WIDTH] [--test_pct TEST_PCT]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        The wav2vec2 model name
  -d DATA_DIR, --data_dir DATA_DIR
                        The directory contains the Zindi dataset (Train.csv, Test.csv and validated_dataset)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        The file name of the prediction result
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size
  -k KENLM, --kenlm KENLM
                        Path to KenLM model
  -n NUM_WORKERS, --num_workers NUM_WORKERS
                        KenLM's number of workers
  -w BEAM_WIDTH, --beam_width BEAM_WIDTH
                        KenLM's beam width
  --test_pct TEST_PCT   Percentage of the test set
```

In [15]:
%%time

%cd /content/luganda-asr
!mkdir -p submissions
# For testing purpose, we can create submission file for only 1% of test dataset
# which takes around 44s.
#!python run_submission.py -o submissions/luganda-asr.csv -m indonesian-nlp/wav2vec2-luganda --data_dir /content/data/zindi -k 5gram.bin --test_pct 1

# Following command creates submission file for 100% of test dataset which takes
# around 40 minutes.
!python run_submission.py -o submissions/luganda-asr.csv -m indonesian-nlp/wav2vec2-luganda --data_dir /content/data/zindi -k 5gram.bin



/content/luganda-asr
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Using custom data configuration lg-e6ac6d8c877e86bb
Downloading and preparing dataset zindi/lg to /root/.cache/huggingface/datasets/zindi/lg-e6ac6d8c877e86bb/1.0.0/48535cc6e254ea4fe8bd529ebd18452c88df0776030e5c69a33ea03b6bcf7436...
manual_dir: /content/data/zindi
datadir: /content/data/zindi
Dataset zindi downloaded and prepared to /root/.cache/huggingface/datasets/zindi/lg-e6ac6d8c877e86bb/1.0.0/48535cc6e254ea4fe8bd529ebd18452c88df0776030e5c69a33ea03b6bcf7436. Subsequent calls will reuse this data.
100% 7067/7067 [01:02<00:00, 113.82ex/s]
  return (input_length - kernel_size) // stride + 1
100% 442/442 [20:25<00:00,  2.77s/it]

The prediction result has b