# Fusing LM with Whisper for lower WER
The aim is to fuse a BPE-level LM scores with the generated tokens scores while beam-search decoding in Whisper.

## **MILESTONE 1**:
Instantiate a Language Model to be integrated with Whisper.
The chosen LM is an n-gram language model, trained with [KenLM](https://github.com/kpu/kenlm) library.

### Step 1:
Write code to run an already available LM in a standalone manner and be able to give a score any input sequence.
**Chosen Model**: [Riva ASR Hindi LM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)


Download and build the KenLM toolkit

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

--2023-11-04 13:14:44--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-11-04 13:14:45 (1.68 MB/s) - written to stdout [491888/491888]

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check fo

Download the KenLM Python library

❗❗**Don't forget to restart the runtime after running this cell** ❗❗


In [5]:
!pip install https://github.com/kpu/kenlm/archive/master.zip

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip
[2K     [32m\[0m [32m553.6 kB[0m [31m3.8 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184306 sha256=6a7e20994a5b1af9df08981a429943e3ea93f0c66de5ea8750ce83297ba946de
  Stored in directory: /tmp/pip-ephem-wheel-cache-_66qe1hq/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0


In [None]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:03<00:00, 78.8MB/s]


'language_model_3p0.bin'

In [None]:
import kenlm
model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(model.order))

This is a 4-gram model


In [None]:
# Below are 2 pairs of sentences that sound exactly the same in hindi but one of them is incorrect (lexically or semantically)
# Generated using Bing Chat
book_correct = "मुझे यह किताब पसंद है।"
book_incorrect = "मुझे यह किताब पसन्द है।"

correct_score = model.score(book_correct)
incorrect_score = model.score(book_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-20.935253143310547 -21.663846969604492


In [None]:
sings_correct = "वह बहुत अच्छा गाता है।"
sings_incorrect = "वह बहुत अच्छा घाता है।"


correct_score = model.score(sings_correct)
incorrect_score = model.score(sings_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-19.827430725097656 -22.76061248779297


In [10]:
# download the original arpa LM file for inspection
url = "https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL"
output = "language_model_3p0.arpa"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL
To: /content/language_model_3p0.arpa
100%|██████████| 3.18G/3.18G [00:36<00:00, 88.3MB/s]


'language_model_3p0.arpa'

In [11]:
# inspect the last 20 lines inside the LM source
!tail -20 language_model_3p0.arpa

-1.2548329	<s> दोनों ही झल्लाये
-0.2760634	<s> चौधरी के अशुभचिंतकों
-0.04973512	डालियों पर बैठी शुकमंडली
-0.07060567	मनुष्यों को उन्हें बेमुरौवत
-0.049646165	और कड़क कर बोलेमेरी
-0.04038189	निराश हो कर कहानहीं
-0.08863469	पड़ते ही वह अव्यवस्थितचित्त
-0.19321889	दोनों पक्षों से सवालजवाब
-0.051110353	झगड़ू साहु ने कहासमझू
-0.20675866	करें तो उनकी भलमनसी
-0.04876329	नीति को सराहता थाइसे
-0.06436408	<s> मित्रता की मुरझायी
-0.23735626	की गहराई से उपजतें
-0.17502813	पूर्णता की ओर बढातें
-0.18197767	जहाँ से अच्छा हिन्दोसिताँ
-0.04437429	हैं इसकी यह गुलसिताँ
-0.06926097	संतरी हमारा वह पासबाँ
-0.09434804	जिनके दम से रश्कएजनाँ

\end\


In [None]:
# some useful KenLM commands for future reference
# generate binary
# !kenlm/build/bin/build_binary dataset_tokenized_3gram.arpa dataset_tokenized_3gram.binary
# create a new LM
# !kenlm/build/bin/lmplz -o 3 --text dataset_tokenized.txt --arpa dataset_tokenized_3gram.arpa --discount_fallback
# !tail -20 dataset_tokenized_3gram.arpa

# Integrating the LM with Whisper

In [2]:
# !pip install https://github.com/HKAB/whisper/archive/master.zip
!pip install openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20231106.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton==2.0.0 (from openai-whisper)
  Downloading triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->openai-whisper)
  Downloading lit-17.0.4.tar.gz (153 kB)
[2K

In [1]:
import whisper
import torch
import kenlm

In [2]:
model = whisper.load_model("small")

In [3]:
# Download a sample audio file from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
output = "sample.wav"
gdown.download(url, output, quiet=False)

transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"

Downloading...
From: https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG
To: /content/sample.wav
100%|██████████| 197k/197k [00:00<00:00, 84.9MB/s]


In [3]:
audio = whisper.load_audio("/content/sample.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)


## Baseline Whisper
Without nbest or LM integration

In [5]:

options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True)
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with LM integration

In [None]:
# adding the LM
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

In [11]:
options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=5,
        patience=1.0, lm_path="/content/language_model_3p0.bin", lm_alpha=1.0, lm_beta=0.0,
        without_timestamps=True)
decoding_withLM = whisper.decode(model, mel, options)


In [12]:
decoding_withLM.text

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with nbest

In [13]:
options = whisper.DecodingOptions(fp16 = False, beam_size=5, nbest = True, without_timestamps=True)
nbest = whisper.decode(model, mel, options)

for candidate in nbest:
  print(candidate.text, candidate.avg_logprob)

ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45290601664576036
ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.4552239240226099
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है. -0.4729596477443889
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं -0.4831788014557402
ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है -0.46668367385864257


In [14]:
assert nbest[0].text == baseline

### Adding LM rescoring

In [16]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(lm_model.order))

This is a 4-gram model


In [22]:
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
nbest_with_lm_score

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45290601664576036,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.4552239240226099,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.',
  -0.4729596477443889,
  -44.10427474975586),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं',
  -0.4831788014557402,
  -37.30411148071289),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है',
  -0.46668367385864257,
  -37.54087829589844)]

In [24]:
lm_weight = 0.01
combined_scores = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores.sort(key=lambda t: t[1], reverse=True)
combined_scores

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8135363953689049),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8158543027457544),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है', -0.8420924568176269),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं', -0.8562199162628692),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.', -0.9140023952419476)]