# Tutorial on ASR inference and alignment with CTC model 
Let's play with the pre-trained speech recognition model!

Here we provide pre-trained speech recognition model with CTC loss on several open-sourced datasets, details can be found in [Rethinking Evaluation in ASR: Are Our Models Robust Enough?](https://arxiv.org/abs/2010.11745)

## Install `Flashlight`
First we need to install `Flashlight` and its dependencies. `Flashlight` is installed from source, it takes **~16 minutes**. 

For installation out of colab notebook please use [link](https://github.com/facebookresearch/flashlight#building).

In [None]:
# First, choose backend to build with
backend = 'CUDA' #@param ["CPU", "CUDA"]
# Clone Flashlight
!git clone https://github.com/facebookresearch/flashlight.git
# install all dependencies for colab notebook
!source flashlight/scripts/colab/colab_install_deps.sh

Build from current master. Builds the ASR app. Resulting binaries in `/content/flashlight/build/bin/asr`.

If using a GPU Colab runtime, build the CUDA backend; else build the CPU backend.

In [None]:
# export necessary env variables
%env MKLROOT=/opt/intel/mkl
%env ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
%env DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl

if backend == "CUDA":
  # Total time: ~13 minutes
  !cd flashlight && mkdir -p build && cd build && \
  cmake .. -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_IMGCLASS=OFF \
           -DFL_BUILD_APP_LM=OFF && \
  make -j$(nproc)
elif backend == "CPU":
  # Total time: ~14 minutes
  !cd flashlight && mkdir -p build && cd build && \
  cmake .. -DFL_BACKEND=CPU \
           -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_IMGCLASS=OFF \
           -DFL_BUILD_APP_LM=OFF && \
  make -j$(nproc)
else:
  raise ValueError(f"Unknown backend {backend}")


Let's take a look around.

In [None]:
# Binaries are located in
!ls flashlight/build/bin/asr

fl_asr_decode  fl_asr_train		     fl_asr_tutorial_inference_ctc
fl_asr_test    fl_asr_tutorial_finetune_ctc


## Inference: preparation steps


### Download Models
Download acoustic model, language model, tokens (defines predicted tokens) and lexicon (defines mapping between words and tokens sequence and used to restrict the beam search only to infer words from the lexicon) files.

In [None]:
!wget https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_300Mparams.bin
!wget https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_25Mparams.bin
!wget https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt
!wget https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt
!wget https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
!mkdir audio
for i in range(5):
  path = "https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/audio/116-288045-000{}.flac".format(i)
  !cd audio && wget $path

--2020-12-23 20:36:28--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_300Mparams.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2147270937 (2.0G) [application/octet-stream]
Saving to: ‘am_transformer_ctc_stride3_letters_300Mparams.bin’


2020-12-23 20:37:45 (27.0 MB/s) - ‘am_transformer_ctc_stride3_letters_300Mparams.bin’ saved [2147270937/2147270937]

--2020-12-23 20:37:45--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_25Mparams.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 

### Install dependencies to record/process audio

In [None]:
!apt-get install sox
!pip install ffmpeg-python sox

### Helper functions for inference

Define helper functions to run inference binary as subprocess

In [None]:
import os
import signal
from subprocess import Popen, PIPE  


def read_current_output(process):
    while True:
        output = process.stderr.readline()
        print(output.decode().strip())
        if "Waiting the input in the format" in output.decode():
          break;


def create_process(cmd):
    process = Popen([cmd],
                    stdin=PIPE, stdout=PIPE, stderr=PIPE,
                    shell=True, preexec_fn=os.setsid) 
    read_current_output(process)
    return process


def run_inference(audio_path, process):
    process.stdin.write("{}\n".format(audio_path).encode())
    process.stdin.flush()
    read_current_output(process)

### Run the inference process with a model

We are using best parameters we found on validation sets of training data with a language model we provide in this tutorial. You can play with `beam_size` (increasing it, but inference time will increse too), `lm_weight` and `word_score`

In [None]:
# you can switch here to small model am_conformer_ctc_stride3_letters_25Mparams.bin
# set for it also lm_weight=2 and word_score=0
inference_cmd = """./flashlight/build/bin/asr/fl_asr_tutorial_inference_ctc \
  --am_path=am_transformer_ctc_stride3_letters_300Mparams.bin \
  --tokens_path=tokens.txt \
  --lexicon_path=lexicon.txt \
  --lm_path=lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin \
  --logtostderr=true \
  --sample_rate=16000 \
  --beam_size=50 \
  --beam_size_token=30 \
  --beam_threshold=100 \
  --lm_weight=1.5 \
  --word_score=0"""
inference_process = create_process(inference_cmd)

I1223 21:06:02.916659  9835 InferenceCTC.cpp:65] Gflags after parsing
--flagfile=;--fromenv=;--tryfromenv=;--undefok=;--tab_completion_columns=80;--tab_completion_word=;--help=false;--helpfull=false;--helpmatch=;--helpon=;--helppackage=false;--helpshort=false;--helpxml=false;--version=false;--am_path=am_transformer_ctc_stride3_letters_300Mparams.bin;--audio_list=;--beam_size=50;--beam_size_token=30;--beam_threshold=100;--lexicon_path=lexicon.txt;--lm_path=lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin;--lm_weight=1.5;--sample_rate=16000;--tokens_path=tokens.txt;--word_score=0;--alsologtoemail=;--alsologtostderr=false;--colorlogtostderr=false;--drop_log_memory=true;--log_backtrace_at=;--log_dir=;--log_link=;--log_prefix=true;--logbuflevel=0;--logbufsecs=30;--logemaillevel=999;--logfile_mode=436;--logmailer=/bin/mail;--logtostderr=true;--max_log_size=1800;--minloglevel=0;--stderrthreshold=2;--stop_logging_if_full_disk=false;--symbolize_stacktrace=true;--v=0;--vmodule=;
I1223 21:06:

## Inference: record audio from your microphone and run inference




### Let's record!

In [None]:
from flashlight.scripts.colab.record import record_audio
record_audio("recorded_audio")

output_file: recorded_audio.wav already exists and will be overwritten on build


### Let's run inference on the audio file you have just recorded

In [None]:
run_inference("recorded_audio.wav", inference_process)

I1223 21:06:12.590214  9835 InferenceCTC.cpp:284] [Inference tutorial for CTC]: predicted output for recorded_audio.wav
my my
I1223 21:06:12.590270  9835 InferenceCTC.cpp:233] [Inference tutorial for CTC]: Waiting the input in the format [audio_path].


### Finish the process to release memory

You can skip if you still want to use this process

In [None]:
os.killpg(os.getpgid(inference_process.pid), signal.SIGTERM)

## Inference: run inference on a set of audio files provided in the txt file



### Prepare the file with all audio paths at first

In [None]:
!ls audio/*.flac > audio.lst

In [None]:
!cat audio.lst

audio/116-288045-0000.flac
audio/116-288045-0001.flac
audio/116-288045-0002.flac
audio/116-288045-0003.flac
audio/116-288045-0004.flac


### Run inference on all audio files from this list

In [None]:
!./flashlight/build/bin/asr/fl_asr_tutorial_inference_ctc \
  --am_path=am_transformer_ctc_stride3_letters_300Mparams.bin \
  --tokens_path=tokens.txt \
  --lexicon_path=lexicon.txt \
  --lm_path=lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin \
  --logtostderr=true \
  --sample_rate=16000 \
  --beam_size=50 \
  --beam_size_token=30 \
  --beam_threshold=100 \
  --lm_weight=1.5 \
  --word_score=0 \
  --audio_list=audio.lst

I1223 21:06:26.336019  9848 InferenceCTC.cpp:65] Gflags after parsing
--flagfile=;--fromenv=;--tryfromenv=;--undefok=;--tab_completion_columns=80;--tab_completion_word=;--help=false;--helpfull=false;--helpmatch=;--helpon=;--helppackage=false;--helpshort=false;--helpxml=false;--version=false;--am_path=am_transformer_ctc_stride3_letters_300Mparams.bin;--audio_list=audio.lst;--beam_size=50;--beam_size_token=30;--beam_threshold=100;--lexicon_path=lexicon.txt;--lm_path=lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin;--lm_weight=1.5;--sample_rate=16000;--tokens_path=tokens.txt;--word_score=0;--alsologtoemail=;--alsologtostderr=false;--colorlogtostderr=false;--drop_log_memory=true;--log_backtrace_at=;--log_dir=;--log_link=;--log_prefix=true;--logbuflevel=0;--logbufsecs=30;--logemaillevel=999;--logfile_mode=436;--logmailer=/bin/mail;--logtostderr=true;--max_log_size=1800;--minloglevel=0;--stderrthreshold=2;--stop_logging_if_full_disk=false;--symbolize_stacktrace=true;--v=0;--vmodule=;
I12

## Congrats, you reached the end!
![title](https://media1.giphy.com/media/3otPoS81loriI9sO8o/giphy.gif)
## Happy Holidays!


## Bonus Alignment: coming soon