<a href="https://colab.research.google.com/github/bhattacharya5/SpeechUnderstanding/blob/main/3_MMS_ASR_Inference_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running MMS-ASR inference in Colab

In this notebook, we will give an example on how to run simple ASR inference using MMS ASR model.

Credit to epk2112 [(github)](https://github.com/epk2112/fairseq_meta_mms_Google_Colab_implementation)

## Step 1: Clone fairseq-py and install latest version

In [1]:
!mkdir "temp_dir"
!git clone https://github.com/pytorch/fairseq

# Change current working directory
!pwd
%cd "/content/fairseq"
!pip install --editable ./
!pip install tensorboardX


Cloning into 'fairseq'...
remote: Enumerating objects: 35073, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 35073 (delta 0), reused 3 (delta 0), pack-reused 35061[K
Receiving objects: 100% (35073/35073), 25.12 MiB | 21.64 MiB/s, done.
Resolving deltas: 100% (25481/25481), done.
/content
/content/fairseq
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==0.12.2)
  Downloading hydra_core-1.0.7-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.8/123.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting omegaconf<2.1 (from fairseq==0.12.2)
  Downloading omegaconf-2.0.6-py3-no

## 2. Download MMS model
Un-comment to download your preferred model.
In this example, we use MMS-FL102 for demo purposes.
For better model quality and language coverage, user can use MMS-1B-ALL model instead (but it would require more RAM, so please use Colab-Pro instead of Colab-Free).


In [2]:
# MMS-1B:FL102 model - 102 Languages - FLEURS Dataset
!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

# # MMS-1B:L1107 - 1107 Languages - MMS-lab Dataset
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

# # MMS-1B-all - 1162 Languages - MMS-lab + FLEURS + CV + VP + MLS
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'

--2024-01-28 07:27:14--  https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.70, 13.227.219.10, 13.227.219.59, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4851043301 (4.5G) [binary/octet-stream]
Saving to: ‘./models_new/mms1b_fl102.pt’


2024-01-28 07:30:05 (27.3 MB/s) - ‘./models_new/mms1b_fl102.pt’ saved [4851043301/4851043301]



## 3. Prepare audio file
Create a folder on path '/content/audio_samples/' and upload your .wav audio files that you need to transcribe e.g. '/content/audio_samples/audio.wav'

Note: You need to make sure that the audio data you are using has a sample rate of 16kHz You can easily do this with FFMPEG like the example below that converts .mp3 file to .wav and fixing the audio sample rate

Here, we use a FLEURS english MP3 audio for the example.

In [4]:
#!wget -P ./audio_samples/ 'https://datasets-server.huggingface.co/assets/google/fleurs/--/en_us/train/0/audio/audio.mp3'
#!ffmpeg -y -i ./audio_samples/audio.mp3 -ar 16000 ./audio_samples/audio.wav
! mkdir -p /content/audio_samples/

In [8]:
#for key in ["en_us", "hi_in", "cmn_hans_cn"]:
#  !wget -O /content/audio_samples/tmp.mp3 /content/audio_samples/1.mp3
!ffmpeg -hide_banner -loglevel error -y -i   /content/audio_samples/1.mp3 -ar 16000 /content/audio_samples/1.wav
!ffmpeg -hide_banner -loglevel error -y -i   /content/audio_samples/2.mp3 -ar 16000 /content/audio_samples/2.wav

# 4: Run Inference and transcribe your audio(s)


In the below example, we will transcribe a sentence in English.

To transcribe other languages:
1. Go to [MMS README ASR section](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)
2. Open Supported languages link
3. Find your target languages based on Language Name column
4. Copy the corresponding Iso Code
5. Replace `--lang "eng"` with new Iso Code

To improve the transcription quality, user can use language-model (LM) decoding by following this instruction [ASR LM decoding](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)

In [10]:
import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

!python examples/mms/asr/infer/mms_infer.py --lang "eng" --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/audio_samples/2.wav"


>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2024-01-28 07:53:41.703729: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-28 07:53:41.703783: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-28 07:53:41.711538: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-28 07:53:41.731857: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with 

In [11]:
!python examples/mms/asr/infer/mms_infer.py --lang "hin" --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/audio_samples/1.wav"

>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2024-01-28 07:56:46.169254: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-28 07:56:46.169308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-28 07:56:46.176140: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-28 07:56:46.192534: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with 

In [12]:
!python examples/mms/asr/infer/mms_infer.py --lang "hin" --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/audio_samples/2_TTS_Interferance.wav"

>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2024-01-28 07:59:00.768186: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-28 07:59:00.768237: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-28 07:59:00.774992: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-28 07:59:00.792074: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with 

# 5: Beam search decoding using a Language Model and transcribe audio file(s)


Since MMS is a CTC model, we can further improve the accuracy by running beam search decoding using a language model.

While we have not open sourced the language models used in MMS (yet!), we have provided the details of the data and commands to used to train the LMs in the Appendix section of our paper.


For this tutorial, we will use a alternate English language model based on Common Crawl data which has been made publicly available through the efforts of [Likhomanenko, Tatiana, et al. "Rethinking evaluation in asr: Are our models robust enough?."](https://arxiv.org/abs/2010.11745). The language model can be accessed from the GitHub repository [here](https://github.com/flashlight/wav2letter/tree/main/recipes/rasr).

In [7]:
!mkdir -p /content/lmdecode

!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin # smaller LM
!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt

--2024-01-28 07:32:39--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.10, 13.227.219.33, 13.227.219.70, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2627163608 (2.4G) [application/octet-stream]
Saving to: ‘/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin’


2024-01-28 07:34:09 (27.9 MB/s) - ‘/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin’ saved [2627163608/2627163608]

--2024-01-28 07:34:09--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.10, 13.227.219.33, 13.227.219.59, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.10|:443... connected.
HTTP request sent, awaiting response...


Install decoder bindings from [flashlight](https://github.com/flashlight/flashlight)


In [None]:
# Taken from https://github.com/flashlight/flashlight/blob/main/scripts/colab/colab_install_deps.sh
# Install dependencies from apt
! sudo apt-get install -y libfftw3-dev libsndfile1-dev libgoogle-glog-dev libopenmpi-dev libboost-all-dev
# Install Kenlm
! cd /tmp && git clone https://github.com/kpu/kenlm && cd kenlm && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j$(nproc)

# Install Intel MKL 2020
! cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
! sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088
# Remove existing MKL libs to avoid double linkeage
! rm -rf /usr/local/lib/libmkl*


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libboost-all-dev is already the newest version (1.74.0.3ubuntu7).
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
libsndfile1-dev is already the newest version (1.0.31-2ubuntu0.1).
The following additional packages will be installed:
  libfftw3-bin libfftw3-double3 libfftw3-long3 libfftw3-quad3 libfftw3-single3
  libgflags-dev libgflags2.2 libgoogle-glog0v5 libunwind-dev
Suggested packages:
  libfftw3-doc
The following NEW packages will be installed:
  libfftw3-bin libfftw3-dev libfftw3-double3 libfftw3-long3 libfftw3-quad3
  libfftw3-single3 libgflags-dev libgflags2.2 libgoogle-glog-dev
  libgoogle-glog0v5 libunwind-dev
0 upgraded, 11 newly installed, 0 to remove and 31 not upgraded.
Need to get 6,861 kB of archives.
After this operation, 32.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfftw3-double3 amd64 3.3.8-2ubunt

In [None]:
! rm -rf flashlight
! git clone --recursive https://github.com/flashlight/flashlight.git
%cd flashlight
! git checkout 035ead6efefb82b47c8c2e643603e87d38850076
%cd bindings/python
! python3 setup.py install

%cd /content/fairseq

Cloning into 'flashlight'...
remote: Enumerating objects: 25857, done.[K
remote: Counting objects: 100% (45/45), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 25857 (delta 11), reused 40 (delta 6), pack-reused 25812[K
Receiving objects: 100% (25857/25857), 15.82 MiB | 20.66 MiB/s, done.
Resolving deltas: 100% (18540/18540), done.
/content/fairseq/flashlight
Note: switching to '035ead6efefb82b47c8c2e643603e87d38850076'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 035ead6e Advan

Next, we download an audio file from [People's speech](https://huggingface.co/datasets/MLCommons/peoples_speech) data. We will the audio sample from their 'dirty' subset which will be more challenging for the ASR model.

In [None]:
#!wget -O ./audio_samples/tmp.wav 'https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav'
!ffmpeg -y -i /content/audio_samples/1.wav -ar 16000 /content/audio_samples/audio_noisy1.wav
!ffmpeg -y -i /content/audio_samples/2.wav -ar 16000 /content/audio_samples/audio_noisy2.wav


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

Let's listen to the audio file


In [None]:
import IPython
IPython.display.display(IPython.display.Audio("./audio_samples/audio_noisy.wav"))
print("Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and")

Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and


Run inference with both greedy decoding and LM decoding

In [None]:
import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

print("======= WITHOUT LM DECODING=======")

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav"

print("\n\n\n======= WITH LM DECODING=======")

# Note that the lmweight, wordscore needs to tuned for each LM
# Using the same values may not be optimal
decoding_cmds = """
decoding.type=kenlm
decoding.beam=500
decoding.beamsizetoken=50
decoding.lmweight=2.69
decoding.wordscore=2.8
decoding.lmpath=/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
decoding.lexicon=/content/lmdecode/lexicon.txt
""".replace("\n", " ")
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav" \
    --extra-infer-args '{decoding_cmds}'


>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2024-01-27 18:45:24.496849: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-27 18:45:24.496964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-27 18:45:24.500326: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-27 18:45:24.521853: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appr