<a href="https://colab.research.google.com/github/cahya-wirawan/luganda-asr/blob/main/Luganda_ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Speech Recognition  (ASR) for Luganda
This is the Python Notebook to describe how to train ASR model for Luganda, evaluate it and create submission file for the [Mozilla Luganda Automatic Speech Recognition](https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/).

We use several python scripts to do the training/fine-tuning, evaluation, submission file creation:
- run_finetuning.py
- run_evaluation.py
- run_submission.py 

## Check the availability of the GPU

In [1]:
!nvidia-smi
!free -h

Wed Jan 19 09:47:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Preparation of the Luganda ASR  source code 

Clone the source code from https://github.com/cahya-wirawan/luganda-asr.git

In [3]:
#!rm -rf /content/luganda-asr
!git clone https://github.com/cahya-wirawan/luganda-asr.git

Cloning into 'luganda-asr'...
remote: Enumerating objects: 67, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 67 (delta 28), reused 43 (delta 12), pack-reused 0[K
Unpacking objects: 100% (67/67), done.


In [4]:
%cd /content/luganda-asr
!git pull

/content/luganda-asr
Already up to date.


Install the requirements

In [5]:
%%capture

!pip install -r requirements.txt

## Data Preparation

We prepare the training and test data from Google Drive directory with following file structure:

/content/drive/MyDrive/Luganda
- cv-corpus-7.0-2021-07-21-lg.tar.gz
- SampleSubmission.csv
- test_audio.zip
- Test.csv


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
luganda_data_dir = "/content/drive/MyDrive/Luganda"
luganda_cv_corpus = f"{luganda_data_dir}/cv-corpus-7.0-2021-07-21-lg.tar.gz"
luganda_test_audio_file = f"{luganda_data_dir}/test_audio.zip"
luganda_test_file = f"{luganda_data_dir}/Test.csv"

!ls -l $luganda_data_dir

total 1914072
-rw------- 1 root root 1722860000 Jan 18 18:05 cv-corpus-7.0-2021-07-21-lg.tar.gz
-rw------- 1 root root      91886 Jan 18 16:41 SampleSubmission.csv
-rw------- 1 root root  236764554 Jan 18 16:39 test_audio.zip
-rw------- 1 root root     292195 Jan 18 16:29 Test.csv


In [7]:

%%capture

# Uncompressing the test_audio.zip and cv-corpus-7.0-2021-07-21-lg.tar.gz to the 
# directory /content/data.
# It takes around 40s

%cd /content/

## clean up the directory /content/data/
!rm -rf /content/data

!mkdir -p /content/data/zindi
%cd /content/data/zindi
!unzip -o $luganda_test_audio_file && cp $luganda_test_file .
%cd /content/data
!tar xvzf $luganda_cv_corpus 

In [7]:
!echo "Content of /content/data:" && ls -l /content/data
!echo
!echo "Content of /content/data/zindi:" && ls -l /content/data/zindi

Content of /content/data:
total 8
drwxr-xr-x 3 root root 4096 Jan 19 08:11 cv-corpus-7.0-2021-07-21
drwxr-xr-x 4 root root 4096 Jan 19 08:11 zindi

Content of /content/data/zindi:
total 556
drwxr-xr-x 3 root root   4096 Jan 19 08:11 __MACOSX
drwxr-xr-x 2 root root 266240 Oct 22 06:14 test_audio
-rw------- 1 root root 292195 Jan 19 08:11 Test.csv


## Model Training

We have fine-tuned the Facebook Wav2Vec2 model with the Luganda Common Voice dataset and stored it in https://huggingface.co/indonesian-nlp/wav2vec2-luganda. The model has been trained for 200 epochs in 4 hours and 30 minutes using 8 GPUs.

We use the script run_finetuning.py to train the model. 
Due to resource and time limitation in Google Colab,
we skip the model training here, but we run the evaluation and creation of the submission file. However, feel free to run following command for testing purpose:

```!python run_finetuning.py finetuning_common_voice_1epoch.json```

It will run the training for only one epoch which will take around 2 hours and 40 minutes in Google Colab.

### Usage
For training using single GPU
``` 
% python run_finetuning.py <argument json file>
``` 
For training using multi GPUs, for example 8 GPUs.
``` 
% python -m torch.distributed.launch --nproc_per_node=8 run_finetuning.py <argument json file>
``` 
Our model "indonesian-nlp/wav2vec2-luganda" has been trained using following command:
``` 
% python -m torch.distributed.launch --nproc_per_node=8 run_finetuning.py finetuning_common_voice.json
```

List of possible arguments:
``` 
% python run_finetuning.py -h

```

In [8]:
%%time

# We disable here the training to save times, but feel free to run it by 
# uncommenting the command "!python run_finetuning.py finetuning_common_voice_1epoch.json"
%cd /content/luganda-asr

!mkdir -p output
#!python run_finetuning.py finetuning_common_voice_1epoch.json

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 15% 320/2132 [00:38<03:55,  7.70it/s][A
 15% 321/2132 [00:38<03:55,  7.70it/s][A
 15% 322/2132 [00:38<03:56,  7.64it/s][A
 15% 324/2132 [00:38<03:39,  8.22it/s][A
 15% 326/2132 [00:38<03:16,  9.21it/s][A
 15% 327/2132 [00:39<03:38,  8.26it/s][A
 15% 328/2132 [00:39<03:44,  8.05it/s][A
 15% 329/2132 [00:39<04:01,  7.48it/s][A
 15% 330/2132 [00:39<04:33,  6.58it/s][A
 16% 331/2132 [00:39<04:35,  6.54it/s][A
 16% 332/2132 [00:39<04:41,  6.39it/s][A
 16% 333/2132 [00:39<04:32,  6.61it/s][A
 16% 334/2132 [00:40<04:26,  6.75it/s][A
 16% 335/2132 [00:40<04:29,  6.68it/s][A
 16% 336/2132 [00:40<04:21,  6.88it/s][A
 16% 337/2132 [00:40<04:20,  6.88it/s][A
 16% 338/2132 [00:40<04:15,  7.02it/s][A
 16% 339/2132 [00:40<04:34,  6.53it/s][A
 16% 340/2132 [00:41<04:30,  6.64it/s][A
 16% 341/2132 [00:41<04:45,  6.28it/s][A
 16% 343/2132 [00:41<03:35,  8.29it/s][A
 16% 345/2132 [00:41<03:14,  9.19it/s][A
 16% 346/21

## Evaluation
To test our fine-tuned model, we evaluate it with test split of Mozilla Common Voice dataset version 6.1 and 7.0. The evaluation also uses the Ken Language Model (KenLM) 5gram.bin we created from the text of Common Voice 7.0.


### Usage
Following is the command to evaluate test split of Mozilla Common Voice dataset using our model "indonesian-nlp/wav2vec2-luganda" and using the KenLM:
``` 
% python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n common_voice -c lg -k 5gram.bin --test_pct 100
```
List of possible arguments:
``` 
% python run_evaluation.py -h
usage: run_evaluation.py [-h] -m MODEL_NAME -n NAME -c CONFIG_NAME [-d DATA_DIR] [-b BATCH_SIZE] [-k KENLM] [--num_workers NUM_WORKERS] [-w BEAM_WIDTH] [--test_pct TEST_PCT] [--cpu]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        The wav2vec2 model name
  -n NAME, --name NAME  The name of dataset
  -c CONFIG_NAME, --config_name CONFIG_NAME
                        The config name of the dataset
  -d DATA_DIR, --data_dir DATA_DIR
                        The directory contains the dataset
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size
  -k KENLM, --kenlm KENLM
                        Path to KenLM model
  --num_workers NUM_WORKERS
                        KenLM's number of workers
  -w BEAM_WIDTH, --beam_width BEAM_WIDTH
                        KenLM's beam width
  --test_pct TEST_PCT   Percentage of the test set
  --cpu                 Force to use CPU

```

### Evaluation on Common Voice version 6.1

We evaluate the ASR model "indonesian-nlp/wav2vec2-luganda" on the test split of Common Voice Dataset for Luganda (version 6.1). It takes around 9 minutes. We get here the Word Error Rate (WER) of **7.37%**.

In [None]:
%%time

!python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n common_voice -c lg -k 5gram.bin --test_pct 100

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading: 100% 1.18G/1.18G [00:32<00:00, 38.5MB/s]
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/lg/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd)
100% 584/584 [00:08<00:00, 70.98ex/s]
100% 584/584 [01:31<00:00,  6.37ex/s]
Downloading: 4.49kB [00:00, 773kB/s]        
tcmalloc: large alloc 1253105664 bytes == 0x55aea7cd0000 @  0x7f944d519615 0x55ad576814cc 0x55ad5776147a 0x55ad576842ed 0x55ad57684240 0x55ad576f7973 0x55ad57685afa 0x55ad576f7d00 0x55ad57685afa 0x55ad576f3915 0x55ad576f29ee 0x55ad57685bda 0x55ad576f3915 0x55ad57685afa 0x55ad576f3915 0x55ad576f29ee 0x55ad57685bda 0x55ad576f3915 0x55ad576f29ee 0x55ad5768648

### Evaluation on Common Voice version 7.0

We evaluate the ASR model "indonesian-nlp/wav2vec2-luganda" on the test split of Common Voice Dataset for Luganda (version 7.0). It takes around 25 minutes. We get here the Word Error Rate (WER) of **7.53%**.

In [None]:
%%time

!python run_evaluation.py -m indonesian-nlp/wav2vec2-luganda -n ./dataset/common_voice -c lg --data_dir /content/data/cv-corpus-7.0-2021-07-21 -k 5gram.bin --test_pct 100

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Using custom data configuration lg-b5dd5bec651497f8
Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/lg-b5dd5bec651497f8/7.0.0/d2815414a38db497e900b5788f75958a15449c23acb8371ded02d3ed746d6206)
100% 4276/4276 [00:48<00:00, 88.69ex/s] 
100% 4276/4276 [02:15<00:00, 31.65ex/s]
tcmalloc: large alloc 1253105664 bytes == 0x5590e86d8000 @  0x7f08bad88615 0x558fbd8524cc 0x558fbd93247a 0x558fbd8552ed 0x558fbd855240 0x558fbd8c8973 0x558fbd856afa 0x558fbd8c8d00 0x558fbd856afa 0x558fbd8c4915 0x558fbd8c39ee 0x558fbd856bda 0x558fbd8c4915 0x558fbd856afa 0x558fbd8c4915 0x558fbd8c39ee 0x558fbd856bda 0x558fbd8c4915 0x558fbd8c39ee 0x558fbd85748c 0x558fbd857698 0x558fbd

## Submission

We will create the submission file "submissions/luganda-asr.csv"
using the ASR model "indonesian-nlp/wav2vec2-luganda" on the test set
provided by Zindi.
It takes around 40 minutes

### Usage
```
$ python run_submission.py -h
usage: run_submission.py [-h] -m MODEL_NAME -d DATA_DIR -o OUTPUT_FILE [-b BATCH_SIZE] [-k KENLM] [-n NUM_WORKERS] [-w BEAM_WIDTH] [--test_pct TEST_PCT]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        The wav2vec2 model name
  -d DATA_DIR, --data_dir DATA_DIR
                        The directory contains the Zindi dataset (Train.csv, Test.csv and validated_dataset)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        The file name of the prediction result
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size
  -k KENLM, --kenlm KENLM
                        Path to KenLM model
  -n NUM_WORKERS, --num_workers NUM_WORKERS
                        KenLM's number of workers
  -w BEAM_WIDTH, --beam_width BEAM_WIDTH
                        KenLM's beam width
  --test_pct TEST_PCT   Percentage of the test set
```

In [None]:
%%time

!mkdir -p submissions
# For testing purpose, we can create submission file for only 1% of test dataset
# which takes around 44s.
#!python run_submission.py -o submissions/luganda-asr.csv -m indonesian-nlp/wav2vec2-luganda --data_dir /content/data/zindi -k 5gram.bin --test_pct 1

# Following command creates submission file for 100% of test dataset which takes
# around 40 minutes.
!python run_submission.py -o submissions/luganda-asr.csv -m indonesian-nlp/wav2vec2-luganda --data_dir /content/data/zindi -k 5gram.bin



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
No known unigrams provided, decoding results might be a lot worse.
Using custom data configuration lg-e6ac6d8c877e86bb
Reusing dataset zindi (/root/.cache/huggingface/datasets/zindi/lg-e6ac6d8c877e86bb/1.0.0/48535cc6e254ea4fe8bd529ebd18452c88df0776030e5c69a33ea03b6bcf7436)
100% 71/71 [00:00<00:00, 104.72ex/s]
  return (input_length - kernel_size) // stride + 1
100% 5/5 [00:21<00:00,  4.25s/it]

The prediction result has been saved to submissions/luganda-asr.csv
CPU times: user 320 ms, sys: 80.5 ms, total: 400 ms
Wall time: 43.5 s
