# Tutorial on ASR Finetuning with CTC model 
Let's finetune a pretrained ASR model!

Here we provide pre-trained speech recognition model with CTC loss that is trained on many open-sourced datasets. Details can be found in [Rethinking Evaluation in ASR: Are Our Models Robust Enough?](https://arxiv.org/abs/2010.11745)

## Step 1: Install `Flashlight`
First we install `Flashlight` and its dependencies. Flashlight is built from source with either CPU/CUDA backend and installation takes **~16 minutes**. 

For installation out of colab notebook please use [link](https://github.com/fairinternal/flashlight#building). 




In [None]:
# First, choose backend to build with
backend = 'CUDA' #@param ["CPU", "CUDA"]
# Clone Flashlight
!git clone https://github.com/flashlight/flashlight.git
# install all dependencies for colab notebook
!source flashlight/scripts/colab/colab_install_deps.sh


Build CPU/CUDA Backend of `Flashlight`:
- Build from current master. 
- Builds the ASR app. 
- Resulting binaries in `/content/flashlight/build/bin/asr`.

If using a GPU Colab runtime, build the CUDA backend; else build the CPU backend.

In [None]:
# export necessary env variables
%env MKLROOT=/opt/intel/mkl
%env ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
%env DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl

if backend == "CUDA":
  # Total time: ~13 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 && mkdir -p build && cd build && \
  cmake .. -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)
elif backend == "CPU":
  # Total time: ~14 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 &&  mkdir -p build && cd build && \
  cmake .. -DFL_ARRAYFIRE_USE_CPU=ON \
           -DFL_USE_ARRAYFIRE=ON \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)
else:
  raise ValueError(f"Unknown backend {backend}")

Let's take a look around.

In [None]:
# Binaries are located in
!ls flashlight/build/bin/asr

fl_asr_align   fl_asr_tutorial_finetune_ctc
fl_asr_decode  fl_asr_tutorial_inference_ctc
fl_asr_test    fl_asr_voice_activity_detection_ctc
fl_asr_train


## Step 2: Setup Finetuning

#### Downloading the model files

First, let's download the pretrained models for finetuning. 

For acoustic model, you can choose from 

>Architecture | # Params | Criterion | Model Name | Arch Name 
>---|---|:---|:---:|:---:
> Transformer|70Mil|CTC|am_transformer_ctc_stride3_letters_70Mparams.bin |am_transformer_ctc_stride3_letters_70Mparams.arch
> Transformer|300Mil|CTC|am_transformer_ctc_stride3_letters_300Mparams.bin | am_transformer_ctc_stride3_letters_300Mparams.arch
> Conformer|25Mil|CTC|am_conformer_ctc_stride3_letters_25Mparams.bin|am_conformer_ctc_stride3_letters_25Mparams.arch
> Conformer|87Mil|CTC|am_conformer_ctc_stride3_letters_87Mparams.bin|am_conformer_ctc_stride3_letters_87Mparams.arch
> Conformer|300Mil|CTC|am_conformer_ctc_stride3_letters_300Mparams.bin| am_conformer_ctc_stride3_letters_300Mparams.arch

For demonstration, we will use the model in first row and download the model and its arch file.

In [None]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.bin -O model.bin # acoustic model
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.arch -O arch.txt # model architecture file

Along with the acoustic model, we will also download the tokens file, lexicon file 

In [None]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt -O tokens.txt # tokens (defines predicted tokens)
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt -O lexicon.txt #  lexicon files (defines mapping between words)

#### Downloading the dataset

For finetuning the model, we provide a limited supervision dataset based on [AMI Corpus](http://groups.inf.ed.ac.uk/ami/corpus/). It consists of 10m, 1hr and 10hr subsets organized as follows. 

```
dev.lst           # development set 
test.lst          # test set 
train_10min_0.lst # first 10 min fold
train_10min_1.lst
train_10min_2.lst
train_10min_3.lst
train_10min_4.lst
train_10min_5.lst
train_9hr.lst     # remaining data of the 10h split (10h=1h+9h)
```
The 10h split is created by combining the data from the 9h split and the 1h split. The 1h split is itself made of 6 folds of 10 min splits.

The recipe used for preparing this corpus can be found [here](https://github.com/flashlight/wav2letter/tree/master/data/ami). 

**You can also use your own dataset to finetune the model instead of AMI Corpus.**

In [None]:
!rm ami_limited_supervision.tar.gz 
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/ami_limited_supervision.tar.gz -O ami_limited_supervision.tar.gz
!tar -xf ami_limited_supervision.tar.gz 
!ls ami_limited_supervision

rm: cannot remove 'ami_limited_supervision.tar.gz': No such file or directory
audio	  train_10min_0.lst  train_10min_3.lst	train_9hr.lst
dev.lst   train_10min_1.lst  train_10min_4.lst
test.lst  train_10min_2.lst  train_10min_5.lst


### Get baseline WER before finetuning

Before proceeding to finetuning, let's test (viterbi) WER on AMI dataset to we have something to compare results after finetuning.

In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am model.bin --datadir '' --emission_dir '' --uselexicon false \
            --test ami_limited_supervision/test.lst --tokens tokens.txt --lexicon lexicon.txt --show 

[1;30;43mStreaming output truncated to the last 100 lines.[0m
[sample: EN2002b_H02_FEO072_4.48_4.59, WER: 100%, TER: 100%, total WER: 26.5703%, total TER: 13.0473%, progress (thread 0): 99.7389%]
|T|: y e p
|P|: 
[sample: EN2002b_H03_MEE073_307.97_308.08, WER: 100%, TER: 100%, total WER: 26.5712%, total TER: 13.0479%, progress (thread 0): 99.7468%]
|T|: h m m
|P|: 
[sample: EN2002b_H03_MEE073_1578.82_1578.93, WER: 100%, TER: 100%, total WER: 26.572%, total TER: 13.0485%, progress (thread 0): 99.7547%]
|T|: h m m
|P|: 
[sample: EN2002c_H03_MEE073_1446.36_1446.47, WER: 100%, TER: 100%, total WER: 26.5728%, total TER: 13.0491%, progress (thread 0): 99.7627%]
|T|: o h
|P|: 
[sample: EN2002d_H01_FEO072_118.11_118.22, WER: 100%, TER: 100%, total WER: 26.5736%, total TER: 13.0495%, progress (thread 0): 99.7706%]
|T|: s o
|P|: 
[sample: EN2002d_H02_MEE071_2107.05_2107.16, WER: 100%, TER: 100%, total WER: 26.5744%, total TER: 13.0499%, progress (thread 0): 99.7785%]
|T|: t
|P|: 
[sample: ES20

We can see that the viterbi WER is 26.6% before finetuning.

## Step 3: Run Finetuning


Now, let's run finetuning with the AMI Corpus to see if we can improve the WER. 

Important parameters for `fl_asr_finetune_ctc`:

`--train`, `--valid` - list files for training and validation sets respectively. Use comma to separate multiple files

`--datadir` - [optional] base path to be used for `--train`, `--valid` flags

`--lr` - learning rate for SGD

`--momentum` - SGD momentum 

`--lr_decay` - epoch at which learning decay starts 

`--lr_decay_step` - learning rate halves after this epoch interval starting from epoch given by `lr_decay`  

`--arch` - architecture file. Tune droupout if necessary. 

`--tokens` - tokens file 

`--batchsize` - batchsize per process

`--lexicon` - lexicon file 

`--rundir` - path to store checkpoint logs

`--reportiters` - Number of updates after which we will run evaluation on validation data and save model, if 0 we only do this at end of each epoch


>Amount of train data | Config to use 
>---|---|
> 10 min| --train train_10min_0.lst
> 1 hr| --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst
> 10 hr| --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst

Let's run finetuning with 10hr AMI data (**~7min** for 1000 updates with evaluation on dev set)


In [None]:
! ./flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc model.bin \
  --datadir ami_limited_supervision \
  --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst \
  --valid dev:dev.lst \
  --arch arch.txt \
  --tokens tokens.txt \
  --lexicon lexicon.txt \
  --rundir checkpoint \
  --lr 0.025 \
  --netoptim sgd \
  --momentum 0.8 \
  --reportiters 1000 \
  --lr_decay 100 \
  --lr_decay_step 50 \
  --iter 25000 \
  --batchsize 4 \
  --warmup 0

I1224 06:39:48.599629 11517 FinetuneCTC.cpp:76] Parsing command line flags
Initialized NCCL 2.7.8 successfully!
I1224 06:39:49.002488 11517 FinetuneCTC.cpp:106] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.94999999999999996; --adambeta2=0.98999999999999999; --am=; --am_decoder_tr_dropout=0.20000000000000001; --am_decoder_tr_layerdrop=0.20000000000000001; --am_decoder_tr_layers=6; --arch=arch.txt; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=adagrad; --datadir=ami_limited_supervision; --decoderattnround=1;

## Step 4: Run Decoding 

#### Viterbi decoding


In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am checkpoint/001_model_dev.bin --datadir ''  --emission_dir '' --uselexicon false \
            --test ami_limited_supervision/test.lst --tokens tokens.txt --lexicon lexicon.txt --show 

[1;30;43mStreaming output truncated to the last 100 lines.[0m
[sample: EN2002b_H02_FEO072_4.48_4.59, WER: 100%, TER: 100%, total WER: 19.4745%, total TER: 8.68993%, progress (thread 0): 99.7389%]
|T|: y e p
|P|: m
[sample: EN2002b_H03_MEE073_307.97_308.08, WER: 100%, TER: 100%, total WER: 19.4754%, total TER: 8.69056%, progress (thread 0): 99.7468%]
|T|: h m m
|P|: m
[sample: EN2002b_H03_MEE073_1578.82_1578.93, WER: 100%, TER: 66.6667%, total WER: 19.4763%, total TER: 8.69097%, progress (thread 0): 99.7547%]
|T|: h m m
|P|: m e
[sample: EN2002c_H03_MEE073_1446.36_1446.47, WER: 100%, TER: 66.6667%, total WER: 19.4772%, total TER: 8.69137%, progress (thread 0): 99.7627%]
|T|: o h
|P|: h
[sample: EN2002d_H01_FEO072_118.11_118.22, WER: 100%, TER: 50%, total WER: 19.4781%, total TER: 8.69156%, progress (thread 0): 99.7706%]
|T|: s o
|P|: m
[sample: EN2002d_H02_MEE071_2107.05_2107.16, WER: 100%, TER: 100%, total WER: 19.479%, total TER: 8.69199%, progress (thread 0): 99.7785%]
|T|: t
|P|: 

Viterbi WER improved from 26.6% to 19.5% after 1 epoch with finetuning...

#### Beam Search decoding with a language model 

To do this, download the finetuned model and use the [Inference CTC tutorial](https://colab.research.google.com/github/flashlight/flashlight/blob/master/flashlight/app/asr/tutorial/notebooks/InferenceAndAlignmentCTC.ipynb)

## Step 5: Running with your own data 

To finetune on your own data, create `train`, `dev` and `test` list files and run the finetuning step. 

Each list file consists of multiple lines with each line describing one sample in the following format : 
```
<sample_id> <path_to_audio_file> <duration> <transcript>
```

For example, let's take a look at the `dev.lst` file from AMI corpus.

In [None]:
! head ami_limited_supervision/dev.lst 

ES2011a_H00_FEE041_34.27_37.14	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_34.27_37.14.flac	2870.0	here we go
ES2011a_H00_FEE041_37.14_39.15	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_37.14_39.15.flac	2010.0	welcome everybody
ES2011a_H00_FEE041_43.32_44.39	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_43.32_44.39.flac	1070.0	you can call me abbie
ES2011a_H00_FEE041_39.15_43.32	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_39.15_43.32.flac	4170.0	um i'm abigail claflin
ES2011a_H00_FEE041_46.43_47.63	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_46.43_47.63.flac	1200.0	's see
ES2011a_H00_FEE041_51.33_55.53	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_51.33_55.53.flac	4200.0	so this is our kick off meeting
ES2011a_H00_FEE041_55.53_56.85	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_55.53_56.85.flac	1320.0	um
ES2011a_H00_FEE041_47.63_50.2	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_47.63_50.2.fl

#### Recording your own audio

For example, you can record your own audio and finetune the model...

Installing a few packages first...

In [None]:
!apt-get install sox
!pip install ffmpeg-python sox

In [None]:
from flashlight.scripts.colab.record import record_audio

**Let's record now the following sentences:**

**1:** A flashlight or torch is a small, portable spotlight. 


In [None]:
record_audio("recorded_audio_1")

**2:** Its function is a beam of light which helps to see and it usually requires batteries.

In [None]:
record_audio("recorded_audio_2")

**3:** In 1896, the first dry cell battery was invented. 

In [None]:
record_audio("recorded_audio_3")

**4:** Unlike previous batteries, it used a paste electrolyte instead of a liquid.

In [None]:
record_audio("recorded_audio_4")

**5** This was the first battery suitable for portable electrical devices, as it did not spill or break easily and worked in any orientation. 

In [None]:
record_audio("recorded_audio_5")

### Create now new training/dev lists:

(yes, you need to edit transcriptions below to your recordings)

In [None]:
import sox
transcriptions = [
   "a flashlight or torch is a small portable spotlight",
   "its function is a beam of light which helps to see and it usually requires batteries",
   "in eighteen ninthy six the first dry cell battery was invented",
   "unlike previous batteries it used a paste electrolyte instead of a liquid",
   "this was the first battery suitable for portable electrical devices, as it did not spill or break easily and worked in any orientation"
]
with open("own_train.lst", "w") as f_train, open("own_dev.lst", "w") as f_dev:
  for index, transcription in enumerate(transcriptions):
    fname = "recorded_audio_" + str(index + 1) + ".wav"
    duration_ms = sox.file_info.duration(fname) * 1000
    if index % 2 == 0:
      f_train.write("{}\t{}\t{}\t{}\n".format(
          index + 1, fname, duration_ms, transcription))
    else:
      f_dev.write("{}\t{}\t{}\t{}\n".format(
          index + 1, fname, duration_ms, transcription))

### Check at first model quality on dev before finetuning

In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am model.bin --datadir ''  --emission_dir '' --uselexicon false \
            --test own_dev.lst --tokens tokens.txt --lexicon lexicon.txt --show 

### Finetune on recorded audio samples

Play with parameters if needed.

In [None]:
! ./flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc model.bin \
  --datadir= \
  --train own_train.lst \
  --valid dev:own_dev.lst \
  --arch arch.txt \
  --tokens tokens.txt \
  --lexicon lexicon.txt \
  --rundir own_checkpoint \
  --lr 0.025 \
  --netoptim sgd \
  --momentum 0.8 \
  --reportiters 1000 \
  --lr_decay 100 \
  --lr_decay_step 50 \
  --iter 25000 \
  --batchsize 4 \
  --warmup 0

### Test finetuned model

(unlikely you get significant improvement with just five phrases, but let's check!)

In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am own_checkpoint/001_model_dev.bin --datadir ''  --emission_dir '' --uselexicon false \
            --test own_dev.lst --tokens tokens.txt --lexicon lexicon.txt --show 