# New Section

# New Section

# Tutorial on ASR Finetuning with CTC model 
Here we provide pre-trained speech recognition model with CTC loss that is trained on many open-sourced datasets. Details can be found in [Rethinking Evaluation in ASR: Are Our Models Robust Enough?](https://arxiv.org/abs/2010.11745).

# New Section

## Step 1: Install Flashlight
First we need to install Flashlight and its dependencies. Flashlight is installed from source with either CPU/CUDA backend, so if you need you can simply tweak the code and recompile. It takes **~16 minutes** for installation. 

For further instructions on installation out of colab notebook please use [link](https://github.com/fairinternal/flashlight#building). 




#### Choose a backend

In [1]:
backend = 'CUDA' #@param ["CPU", "CUDA"]

#### Install Dependencies
Install FFTW3, libsndfile, glog, Boost, OpenMPI, KenLM, ArrayFire v3.7.1, Intel MKL 2020.0-088, and CMake 3.10.2.

In [2]:
# Total time: ~5 minutes
# Install dependencies from apt
!sudo apt-get install -y libfftw3-dev libsndfile1-dev libgoogle-glog-dev libopenmpi-dev libboost-all-dev
# Install Kenlm
!cd /tmp && git clone https://github.com/kpu/kenlm && cd kenlm && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j$(nproc)
# Download and unpack ArrayFire v3.7.1
!cd /tmp && wget https://arrayfire.s3.amazonaws.com/3.7.1/ArrayFire-v3.7.1-1_Linux_x86_64.sh
!mkdir -p /opt/arrayfire
!bash /tmp/ArrayFire-v3.7.1-1_Linux_x86_64.sh --skip-license --prefix=/opt/arrayfire
!rm /tmp/ArrayFire-v3.7.1-1_Linux_x86_64.sh 
# Remove some downloaded libs from the ArrayFire installer to avoid double linkeage
!rm /opt/arrayfire/lib64/libnvrtc* /opt/arrayfire/lib64/libcu* /opt/arrayfire/lib64/libiomp*
# Install Intel MKL 2020
!cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB 
!sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088
# Remove existing MKL libs to avoid double linkeage
!rm -rf /usr/local/lib/libmkl*
# Grab CMake 3.10.2
!cd /opt  && wget https://github.com/Kitware/CMake/releases/download/v3.10.2/cmake-3.10.2-Linux-x86_64.tar.gz && \
    tar -xzf cmake-3.10.2-Linux-x86_64.tar.gz && rm /usr/local/bin/cmake && ln -s cmake-3.10.2-Linux-x86_64/bin/cmake /usr/local/bin/cmake

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libboost-all-dev is already the newest version (1.65.1.0ubuntu1).
libopenmpi-dev is already the newest version (2.1.1-8).
libsndfile1-dev is already the newest version (1.0.28-4ubuntu0.18.04.1).
The following additional packages will be installed:
  libfftw3-bin libfftw3-long3 libfftw3-quad3 libfftw3-single3 libgflags-dev
  libgflags2.2 libgoogle-glog0v5
Suggested packages:
  libfftw3-doc
The following NEW packages will be installed:
  libfftw3-bin libfftw3-dev libfftw3-long3 libfftw3-quad3 libfftw3-single3
  libgflags-dev libgflags2.2 libgoogle-glog-dev libgoogle-glog0v5
0 upgraded, 9 newly installed, 0 to remove and 15 not upgraded.
Need to get 4,048 kB of archives.
After this operation, 22.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libfftw3-long3 amd64 3.3.7-1 [308 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libfft

### Build CPU/CUDA Backend of Flashlight
Build from current master. Builds the ASR app. Resulting binaries in `/content/flashlight/build/bin/asr`.

If using a GPU Colab runtime, build the CUDA backend; else build the CPU backend. Only one backend can be built at once.

In [3]:
if backend == "CUDA":
  # Total time: ~13 minutes
  # Use CUDA 10.0 - symlink to /usr/local/cuda
  !rm /usr/local/cuda && ln -s /usr/local/cuda-10.0 /usr/local/cuda
  # Clone and build flashlight from source
  !git clone https://github.com/facebookresearch/flashlight.git
  !cd flashlight && mkdir -p build && cd build && export MKLROOT=/opt/intel/mkl && \
  cmake .. -DArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake \
              -DCMAKE_BUILD_TYPE=Release \
              -DFL_BUILD_TESTS=OFF \
              -DFL_BUILD_EXAMPLES=OFF \
              -DFL_BUILD_APP_IMGCLASS=OFF \
              -DFL_BUILD_APP_LM=OFF && \
  make -j$(nproc)
elif backend == "CPU":
  # Total time: ~14 minutes
  # Download oneDNN/dnnl
  !mkdir -p /opt/dnnl && cd /opt/dnnl && \
      wget https://github.com/oneapi-src/oneDNN/releases/download/v2.0/dnnl_lnx_2.0.0_cpu_iomp.tgz && \
      tar -xf dnnl_lnx_2.0.0_cpu_iomp.tgz
  # Download and install Gloo
  !cd /tmp && git clone https://github.com/facebookincubator/gloo.git && cd gloo && \
      mkdir -p build && cd build && cmake .. -DUSE_MPI=ON && make install -j$(nproc)
  # Clone and build flashlight from source
  !git clone https://github.com/facebookresearch/flashlight.git
  !cd flashlight && mkdir -p build && cd build && export MKLROOT=/opt/intel/mkl && \
  cmake .. -DArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake \
              -DDNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl \
              -DFL_BACKEND=CPU \
              -DCMAKE_BUILD_TYPE=Release \
              -DFL_BUILD_TESTS=OFF \
              -DFL_BUILD_EXAMPLES=OFF \
              -DFL_BUILD_APP_IMGCLASS=OFF \
              -DFL_BUILD_APP_LM=OFF && \
  make -j$(nproc)
else:
  raise ValueError(f"Unknown backend {backend}")

Cloning into 'flashlight'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 17079 (delta 4), reused 25 (delta 3), pack-reused 17040[K
Receiving objects: 100% (17079/17079), 7.87 MiB | 23.29 MiB/s, done.
Resolving deltas: 100% (12040/12040), done.
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- The CUDA compiler identification is NVIDIA 10

### Let's take a look around.

In [4]:
# Binaries are located in
!ls flashlight/build/bin/asr

fl_asr_decode  fl_asr_train		     fl_asr_tutorial_inference_ctc
fl_asr_test    fl_asr_tutorial_finetune_ctc


## Step 2: Setup Finetuning

#### Downloading the model files

First, let's download the pretrained models for finetuning. 

For acoustic model, you can choose from 

>Architecture | # Params | Criterion | Model Name | Arch Name 
>---|---|:---|:---:|:---:
> Transformer|70Mil|CTC|am_transformer_ctc_stride3_letters_70Mparams.bin |am_transformer_ctc_stride3_letters_70Mparams.arch
> Transformer|300Mil|CTC|am_transformer_ctc_stride3_letters_300Mparams.bin | am_transformer_ctc_stride3_letters_300Mparams.arch
> Conformer|25Mil|CTC|am_conformer_ctc_stride3_letters_25Mparams.bin|am_conformer_ctc_stride3_letters_25Mparams.arch
> Conformer|87Mil|CTC|am_conformer_ctc_stride3_letters_87Mparams.bin|am_conformer_ctc_stride3_letters_87Mparams.arch
> Conformer|300Mil|CTC|am_conformer_ctc_stride3_letters_300Mparams.bin|

For demonstration, we will use the model in first row and download the model and its arch file.

In [5]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.bin -O model.bin # acoustic model
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.arch -O arch.txt # model architecture file

Along with the acoustic model, we will also download the tokens file, lexicon file 

In [6]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt -O tokens.txt # tokens (defines predicted tokens)
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/tutorial_tmp/lexicon.txt -O lexicon.txt #  lexicon files (defines mapping between words

#### Downloading the dataset

For finetuning the model, we provide a limited supervision dataset for [AMI Corpus](http://groups.inf.ed.ac.uk/ami/corpus/). It consists of 10m, 1hr and 10hr subsets organized as follows. 

```
dev.lst           # development set 
test.lst          # test set 
train_10min_0.lst # first 10 min fold
train_10min_1.lst
train_10min_2.lst
train_10min_3.lst
train_10min_4.lst
train_10min_5.lst
train_9hr.lst     # remaining data of the 10h split (10h=1h+9h)
```
The 10h split is created by combining the data from the 9h split and the 1h split. The 1h split is itself made of 6 folds of 10 min splits.

The recipe used for preparing this corpus can be found [here](https://github.com/facebookresearch/wav2letter/tree/master/data/ami). 

You can also use your own dataset to finetune the model instead of AMI Corpus. 

In [33]:
!rm ami_limited_supervision.tar.gz 
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/ami_limited_supervision.tar.gz -O ami_limited_supervision.tar.gz
!tar -xf ami_limited_supervision.tar.gz 
!ls ami_limited_supervision

audio	  train_10min_0.lst  train_10min_3.lst	train_9hr.lst
dev.lst   train_10min_1.lst  train_10min_4.lst
test.lst  train_10min_2.lst  train_10min_5.lst


### Get baseline WER before finetuning

Before proceeding to finetuning stage, we will test the viterbi WER on AMI dataset so that we can check the improvement we get with finetuning.

In [13]:
! ./flashlight/build/bin/asr/fl_asr_test --am model.bin --datadir '' --emission_dir '' \
            --test ami_limited_supervision/test.lst --tokens tokens.txt --lexicon lexicon.txt --show 

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|T|: m m
|P|: h m
[sample: ES2004c_H03_FEE016_1046.72_1047.07, WER: 100%, TER: 50%, total WER: 26.2114%, total TER: 12.6062%, progress (thread 0): 86.8275%]
|T|: m m
|P|: m m
[sample: ES2004c_H01_FEE013_1294.34_1294.69, WER: 100%, TER: 0%, total WER: 26.2123%, total TER: 12.6061%, progress (thread 0): 86.8354%]
|T|: m m h m m
|P|: m h m
[sample: ES2004c_H03_FEE016_1302.15_1302.5, WER: 100%, TER: 40%, total WER: 26.2131%, total TER: 12.6065%, progress (thread 0): 86.8434%]
|T|: y e a h
|P|: y e a h
[sample: ES2004c_H02_MEE014_1515.72_1516.07, WER: 0%, TER: 0%, total WER: 26.2128%, total TER: 12.6063%, progress (thread 0): 86.8513%]
|T|: y e a h
|P|: y e a h
[sample: ES2004c_H02_MEE014_1690.13_1690.48, WER: 0%, TER: 0%, total WER: 26.2125%, total TER: 12.6062%, progress (thread 0): 86.8592%]
|T|: m m h m m
|P|: m h m
[sample: ES2004c_H03_FEE016_2078.48_2078.83, WER: 100%, TER: 40%, total WER: 26.2134%, total TER: 12.6065%, 

We can see that the viterbi WER is 26.9% before finetuning.

## Step 3: Run Finetuning


Now, we will run finetuning with the AMI Corpus to see if we can improve the WER. 

Important parameters for `fl_asr_finetune_ctc`:

`--train`, `--valid` - list files for training and validation sets respectively. Use comma to separate multiple files

`--datadir` - [optional] base path to be used for `--train`, `--valid` flags

`--lr` - learning rate for SGD

`--momentum` - momentum 

`--lr_decay` - epoch at which learning decay starts 

`--lr_decay_step` - learning rate halves after this epoch interval starting from epoch given by `lr_decay`  

`--arch` - architecture file. Tune droupout if necessary. 

`--tokens` - tokens file 

`--batchsize` - batchsize per process

`--lexicon` - lexicon file 

`--rundir` - path to store checkpoint logs

`--reportiters` - Number of updates after which we will run evaluation on validation data and save model, if 0 we only do this at end of each epoch


>Amount of train data | Config to use 
>---|---|
> 10 min| --train train_10min_0.lst
> 1 hr| --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst
> 10 hr| --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst

Let's run finetuning with 10hr AMI data 


In [None]:
! ./flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc model.bin \
  --datadir ami_limited_supervision \
  --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst \
  --valid dev:dev.lst \
  --arch arch.txt \
  --tokens tokens.txt \
  --lexicon lexicon.txt \
  --rundir checkpoint \
  --lr 0.025 \
  --netoptim sgd \
  --momentum 0.8 \
  --reportiters 1000 \
  --lr_decay 100 \
  --lr_decay_step 50 \
  --iter 25000 \
  --batchsize 5 \
  --warmup 0

I1223 19:54:35.112895  9521 FinetuneCTC.cpp:76] Parsing command line flags
Initialized NCCL 2.7.8 successfully!
I1223 19:54:35.565654  9521 FinetuneCTC.cpp:106] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.94999999999999996; --adambeta2=0.98999999999999999; --am=; --am_decoder_tr_dropout=0.20000000000000001; --am_decoder_tr_layerdrop=0.20000000000000001; --am_decoder_tr_layers=6; --arch=arch.txt; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=5; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=adagrad; --datadir=ami_limited_supervision; --decoderattnround=1;

## Step 4: Run Decoding 

#### Viterbi decoding


In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am checkpoint/001_model_dev.bin --datadir ''  --emission_dir '' \
            --test ami_limited_supervision/test.lst --tokens tokens.txt --lexicon lexicon.txt --show 

Viterbi WER improved from 26.9 % to 19.6 % with finetuning...

#### Beam Search decoding with language model 

To do this, download the finetuned model and use the Inference CTC tutorial

## Step 5: Running with your own data 

For running the finetuning on your own data, for this you would have to create `train`, `dev` and `test` list files and run the finetuning step. 

Each list file consists of multiple lines with each line describing one sample in the following format : 
```
<sample_id> <path_to_audio_file> <duration> <transcript>
```

For example, let's take a look at the `dev.lst` file from AMI corpus.

In [34]:
! head ami_limited_supervision/dev.lst 

ES2011a_H00_FEE041_34.27_37.14	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_34.27_37.14.flac	2870.0	here we go
ES2011a_H00_FEE041_37.14_39.15	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_37.14_39.15.flac	2010.0	welcome everybody
ES2011a_H00_FEE041_43.32_44.39	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_43.32_44.39.flac	1070.0	you can call me abbie
ES2011a_H00_FEE041_39.15_43.32	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_39.15_43.32.flac	4170.0	um i'm abigail claflin
ES2011a_H00_FEE041_46.43_47.63	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_46.43_47.63.flac	1200.0	's see
ES2011a_H00_FEE041_51.33_55.53	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_51.33_55.53.flac	4200.0	so this is our kick off meeting
ES2011a_H00_FEE041_55.53_56.85	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_55.53_56.85.flac	1320.0	um
ES2011a_H00_FEE041_47.63_50.2	ami_limited_supervision/audio/ES2011a/ES2011a_H00_FEE041_47.63_50.2.fl

#### Recording your own audio [TODO]

For example, you can record your own audio and finetune the model...

Installing a few packages first...

In [24]:
!apt-get install sox
!pip install ffmpeg-python sox google-colab 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
sox is already the newest version (14.4.2-3ubuntu0.18.04.1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


In [33]:
from flashlight.scripts.colab.record import record_audio
from google.colab.output import eval_js
record_audio("recorded_audio")