<a href="https://colab.research.google.com/github/chandan110791/hindiWhisper/blob/main/Whisper_LM_fusion_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fusing LM with Whisper for lower WER
The aim is to fuse a BPE-level LM scores with the generated tokens scores while beam-search decoding in Whisper.

## **MILESTONE 1**:
Instantiate a Language Model to be integrated with Whisper.
The chosen LM is an n-gram language model, trained with [KenLM](https://github.com/kpu/kenlm) library.

### Step 1:
Write code to run an already available LM in a standalone manner and be able to give a score any input sequence.
**Chosen Model**: [Riva ASR Hindi LM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)


Download and build the KenLM toolkit

In [2]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

--2023-11-16 09:27:07--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-11-16 09:27:07 (974 KB/s) - written to stdout [491888/491888]

mkdir: cannot create directory ‘kenlm/build’: File exists
build_binary  filter	kenlm_benchmark  phrase_table_vocab	       query
count_ngrams  fragment	lmplz		 probing_hash_table_benchmark


Download the KenLM Python library

❗❗**Don't forget to restart the runtime after running this cell** ❗❗


In [26]:
!pip install https://github.com/kpu/kenlm/archive/master.zip

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Using cached https://github.com/kpu/kenlm/archive/master.zip (553 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## ***Pointer 1:Please add comments on the model being downloaded and whats its purpose ***

Downloading the Hindi ASR n-gram language model from Nvidia which can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)

This will be used for fusion with Whisper.
I uploaded the binary version to a shareable location in my Gdrive.

In [24]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:02<00:00, 119MB/s]


'language_model_3p0.bin'

In [4]:
import kenlm
model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(model.order))

This is a 4-gram model


In [5]:
# Below are 2 pairs of sentences that sound exactly the same in hindi but one of them is incorrect (lexically or semantically)
# Generated using Bing Chat
book_correct = "मुझे यह किताब पसंद है।"
book_incorrect = "मुझे यह किताब पसन्द है।"

correct_score = model.score(book_correct)
incorrect_score = model.score(book_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-20.935253143310547 -21.663846969604492


In [None]:
sings_correct = "वह बहुत अच्छा गाता है।"
sings_incorrect = "वह बहुत अच्छा घाता है।"


correct_score = model.score(sings_correct)
incorrect_score = model.score(sings_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-19.827430725097656 -22.76061248779297


## ***Pointer 2:Please elaborate on the model being downloaded and whats its purpose . Also , why are we doing it two times  ***

This is unnecessary and can be skipped.

In [None]:
# # download the original arpa LM file for inspection
# url = "https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL"
# output = "language_model_3p0.arpa"
# gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL
To: /content/language_model_3p0.arpa
100%|██████████| 3.18G/3.18G [00:33<00:00, 95.6MB/s]


'language_model_3p0.arpa'

$$:Please elaborate if we need to prepare a new arpa file for our model

We can use this model for the time being, and finetune it later with the text data from our training set.
However, I think it's not likely to come up with a better model than this, as this model is built by a team in Nvidia and it's probable that our training set is included in training this model.

In [6]:
# inspect the last 20 lines inside the LM source
!tail -20 language_model_3p0.arpa

tail: cannot open 'language_model_3p0.arpa' for reading: No such file or directory


In [None]:
# some useful KenLM commands for future reference
# generate binary
# !kenlm/build/bin/build_binary dataset_tokenized_3gram.arpa dataset_tokenized_3gram.binary
# create a new LM
# !kenlm/build/bin/lmplz -o 3 --text dataset_tokenized.txt --arpa dataset_tokenized_3gram.arpa --discount_fallback
# !tail -20 dataset_tokenized_3gram.arpa

# Integrating the LM with Whisper

In [7]:
# !pip install openai-whisper
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Downloading https://github.com/chandan110791/hindiWhisper/archive/master.zip
[2K     [32m\[0m [32m7.4 MB[0m [31m16.5 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton==2.0.0 (from openai-whisper==20231106)
  Downloading triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper==20231106)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->openai-whisper==2

In [8]:
import whisper
import torch
import kenlm

In [None]:
model = whisper.load_model("small")

100%|████████████████████████████████████████| 461M/461M [00:04<00:00, 112MiB/s]


In [9]:
# Download a sample audio file from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
output = "sample.wav"
gdown.download(url, output, quiet=False)

transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"

Downloading...
From: https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG
To: /content/sample.wav
100%|██████████| 197k/197k [00:00<00:00, 79.9MB/s]


In [12]:
audio = whisper.load_audio("/content/sample.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)


AttributeError: ignored

## Baseline Whisper
Without nbest or LM integration

In [13]:
options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True, language="hi")
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

NameError: ignored

# Decoding with LM integration

In [14]:
# adding the LM
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:01<00:00, 233MB/s]


'language_model_3p0.bin'

$$:

## ***Pointer 4: Getting an error here while trying to run the code . Elaborate a bit on the what happening here . Can you please refer to the function with changes done  ***

The error was due to the fact that the imported whisper here is the original one, which doesn't include the changed that we did to the decoding options.

Replaced this:
```
!pip install openai-whisper
```
with this:
```
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip
```

and the error is fixed.

In [16]:
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Using cached https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [17]:
options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=5,
        patience=1.0, lm_path="/content/language_model_3p0.bin", lm_alpha=1.0, lm_beta=0.0,
        without_timestamps=True, language="hi")
decoding_withLM = whisper.decode(model, mel, options)


NameError: ignored

In [None]:
decoding_withLM.text

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with nbest (beam search)

In [None]:
options = whisper.DecodingOptions(fp16 = False, beam_size=5, return_nbest = True, without_timestamps=True, language="hi")
nbest = whisper.decode(model, mel, options)


for candidate in nbest:
  print(candidate.text, candidate.avg_logprob)

ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45291578358617324
ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45523316981428763
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है. -0.47296941078315347
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं -0.48318856449450476
ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है -0.46669308344523114


### Adding LM rescoring

In [None]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(lm_model.order))

This is a 4-gram model


In [None]:
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
nbest_with_lm_score

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45291578358617324,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45523316981428763,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.',
  -0.47296941078315347,
  -44.10427474975586),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं',
  -0.48318856449450476,
  -37.30411148071289),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है',
  -0.46669308344523114,
  -37.54087829589844)]

In [None]:
lm_weight = 0.01
combined_scores = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores.sort(key=lambda t: t[1], reverse=True)
combined_scores

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8135461623093178),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8158635485374321),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है', -0.8421018664042155),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं', -0.8562296793016337),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.', -0.9140121582807121)]

# Decoding with nbest (best of N hypothesis)

In [None]:
options = whisper.DecodingOptions(fp16 = False, best_of=10, return_nbest=True, without_timestamps=True, temperature=0.3, language="hi")
nbest_best_of_n_hyp = whisper.decode(model, mel, options)


In [None]:
for candidate in nbest_best_of_n_hyp:
  print(candidate.text, candidate.avg_logprob)

ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45742596898760113
ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है -0.5202431113032971
ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है -0.5065439448637121
ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey. -0.9538334877260269
ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है -0.5101523081461589
ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है -0.4865361798194147
ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है -0.49694071144893254
ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है -0.5291237149919782
ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है -0.5426437135726686
ब्रूद बाँच लंश्टोट छद्टे का एक अनिवारे हिस्चा है -0.49345003325363684


In [None]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest_best_of_n_hyp]
nbest_with_lm_score


[('ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45742596898760113,
  -35.82866668701172),
 ('ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है',
  -0.5202431113032971,
  -49.43550491333008),
 ('ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है',
  -0.5065439448637121,
  -54.91597366333008),
 ('ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey.',
  -0.9538334877260269,
  -81.96270751953125),
 ('ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है',
  -0.5101523081461589,
  -46.626407623291016),
 ('ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है',
  -0.4865361798194147,
  -46.603515625),
 ('ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है',
  -0.49694071144893254,
  -49.201133728027344),
 ('ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है',
  -0.5291237149919782,
  -46.626407623291016),
 ('ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है',
  -0.5426437135726686,
  -42.747047424316406),
 ('ब्रूद बाँच लंश्टोट छद्टे का एक अनिव

In [None]:
lm_weight = 0.01
combined_scores_bestofNSampling = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores_bestofNSampling.sort(key=lambda t: t[1], reverse=True)
combined_scores_bestofNSampling

[('ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8157126358577182),
 ('ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है',
  -0.9525713360694148),
 ('ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है',
  -0.9701141878158327),
 ('ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है', -0.976416384379069),
 ('ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है', -0.9889520487292061),
 ('ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है', -0.9953877912248883),
 ('ब्रूद बाँच लंश्टोट छद्टे का एक अनिवारे हिस्चा है', -1.0002398129167227),
 ('ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है', -1.014598160436598),
 ('ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है',
  -1.0557036814970129),
 ('ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey.',
  -1.7734605629213394)]

# **RUN THE NOTEBOOK STARTING HERE**

# The next sections include:
- Importing a finetuned huggingface model to our codebase
- Running the evaluation of the 4 decoding variants on the imported dataset
- Computing the WER and CER for each of the decoding variants

## The decoding strategies are:
1. baseline decoding without LM
2. Deep fusion of the LM with the token probabilities during beam search decoding (try to find the optimal value for lm_alpha)
3. Shallow fusion by rescoring the N best candidates generated through beam search (try to find the optimal value for lm_weight)
4. Shallow fusion by rescoring the N best candidates generated through greedy decoding using best of N sampling (try to find the optimal value for lm_weight and temperature)





In [1]:
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip
!pip install https://github.com/kpu/kenlm/archive/master.zip
#!pip install transformers

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Downloading https://github.com/chandan110791/hindiWhisper/archive/master.zip
[2K     [32m-[0m [32m7.4 MB[0m [31m10.5 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton==2.0.0 (from openai-whisper==20231106)
  Downloading triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper==20231106)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->openai-whisper==

$$

$$:

## ***Pointer 5: its looking for a file named as pytorchmodel.bin , it does not exists hence note  able to integrate the model : CKSINGH/whisper-small-hi-iiib tuned at : https://colab.research.google.com/github/chandan110791/HindiSpeechRecognition/blob/main/fine_tune_whisper_iiitb.ipynb#scrollTo=d7030622-caf7-4039-939b-6195cdaa2585  . Any inputs ***

The reason for this is that the model file after finetuning must be pickled:
so pytorch_model.bin should be found in this path:
https://huggingface.co/CKSINGH/whisper-small-hi-iiib/tree/main

Will use this one https://huggingface.co/sanchit-gandhi/whisper-small-hi/tree/main to make the required changes to the code.

In [2]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-gee0q1dm
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-gee0q1dm
  Resolved https://github.com/huggingface/transformers to commit 85fde09c97213bf7e8625f83096bb2a9e183f987
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers<0.19,>=0.14 (from transformers==4.36.0.dev0)
  Downloading tokenizers-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers==4.36.0.dev0)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 M

In [3]:
from transformers import WhisperForConditionalGeneration

In [20]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-j2eve0re
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-j2eve0re
  Resolved https://github.com/huggingface/transformers to commit 06343b06335a1f8417bd32d3ffc7cf2cca9a24ac
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## In the next cell, you need to change the repo id to your repo after generating the pytorch_model.bin (pickle format of the finetuned model)

In [4]:
import whisper
import kenlm
import torch
import jiwer
from tqdm import tqdm
import os
import pandas as pd
# using pickle to serialize the map_dict
import pickle

from huggingface_hub import hf_hub_download
filename = "pytorch_model.bin"


# hf_hub_download(repo_id="CKSINGH/whisper-small-hi-iiib", filename=filename, local_dir="/content/")
hf_hub_download(repo_id="sanchit-gandhi/whisper-small-hi", filename=filename, local_dir="/content/")


pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

'/content/pytorch_model.bin'

# Here you need to change the model size from "small" to "medium" if you finetuned the whisper-medium (as the one in this repo: https://huggingface.co/CKSINGH/whisper-small-hi-iiib/tree/main)

In [5]:
# to enable verbose printing of exceptions (+ layers matching name)
DEBUG = False

# set to True if your custom model has been trained using DDP (multi-gpu)
# as in my case, in the custom HF model, keys have a prefix (model.)
# it should come from the fact that I have trained on a milti-gpu machine, using DDP
DDP_TRAINED = True

# if DDP we have to add a prefix to match with the HF state_dict
if DDP_TRAINED:
    PREFIX = "model."
else:
    PREFIX = ""

MODEL_SIZE = "small"

# the device where you're running this code
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# the name of the file with your fine-tuned model
FINETUNED_MODEL = "pytorch_model.bin"


# the name of the file for the serialized map_dict
# a different name, to avoid overwrite it
FILE_DICT = "map_dict_7.pkl"


In [6]:

def import_hf_model(finetuned_model, debug=False, model_size="small", device="cpu", file_dict="map_dict_7.pkl"):

  def has_numbers(inputString):
      return any(char.isdigit() for char in inputString)

  # next functions are used to make sanity checks for the mappings

  # get if it is encoder or decoder
  def extract_function(key_name):
      # encoder or decoder is the first part of the key
      first_part = key_name.split(".")[0]

      key_func = None
      if first_part in ["enconder", "decoder"]:
          key_func = first_part

      return key_func

  def extract_layer_num(key_name):
      # layer num is the third piece
      layer_num = None

      if has_numbers(key_name):
          layer_num = key_name.split(".")[2]

      return layer_num

  # check that the two keys are for layers
  # with the same function
  # (both encoder or both decoder)
  # and have the same layer number
  # this way we are super-safe (I think)
  def sanity_check(key1, key2):
      is_ok = True

      # check same func (encoder or decoder)
      func1 = extract_function(key1)
      func2 = extract_function(key2)

      if func1 != func2:
          print(f"Warning: layers seem to have different functions: {key1},{key2}")
          is_ok = False

      # check same layer_num
      layer1 = extract_layer_num(key1)
      layer2 = extract_layer_num(key2)

      if layer1 != layer2:
          print(f"Warning: layers seem to have different numbers: {key1},{key2}")
          is_ok = False

      return is_ok

  if not os.path.isfile(file_dict):
    # Vanilla means: not custom trained
    print()
    print("Loading vanilla Whisper model")
    model = whisper.load_model(model_size, device=device)

    print("Loading vanilla HF Model")
    hugging_face_model = WhisperForConditionalGeneration.from_pretrained(
        "openai/whisper-" + model_size
    )

    # extract state-dict from both
    state_d_openai = model.state_dict()
    state_d_huggingface = hugging_face_model.model.state_dict()

    # build the mapping between keys...
    map_dict = {}
    print("Matching layers...")

    # for every layer in OpenAI model
    n_sanity_ok = 0

    #
    # here we're considering the cartesian product of the two state dict and try to match
    # rules applied:
    # 1. the two layers have the same shape
    # 2. the two layer have the same parameters' values
    # 3. we apply sanity check (see function above)
    #
    for k in tqdm(state_d_openai):
        # find a layer in the HF model, check with j
        for j in state_d_huggingface:
            # where parameters have same shape and same values
            if state_d_huggingface[j].shape == state_d_openai[k].shape:
                if torch.all(torch.eq(state_d_huggingface[j], state_d_openai[k])).item():
                    # found, register the mapping
                    map_dict[k] = j
                    # make some check and eventually print a warning
                    if sanity_check(k, j) == True:
                        n_sanity_ok += 1

                        # if you enable thsi print you can see the name of the layer
                        # chosen in the match and you will se that they have the same functions
                        if debug:
                            print(k, j)

                    break


    # check if we have matched every entry
    print("Check if we have matched every entry in state_dict...")
    print()
    print(f"Number of keys: {len(map_dict.keys())}")
    print(f"Number of keys: {len(state_d_openai.keys())}")
    # assert len(map_dict.keys()) == len(state_d_openai.keys()), "The match is not complete !"

    print(f"Number of sanity_check ok: {n_sanity_ok}")
    print()

    print("Match is complete !!!")
    print()


    # serialize the map_dict to file
    print("Serializing map_dict...")

    with open(file_dict, "wb") as f:
        pickle.dump(map_dict, f)
        f.close()

    print(f"map_dict saved as: {file_dict}...")
    print()


    # # loading with match keys
    # # restart from pickle file
    # print("Reloading map_dict...")
    # print()
    # with open(file_dict, "rb") as f:
    #     map_dict = pickle.load(f)

  # loading fine-tuned dict
  print("Loading fine tuned dict...")

  # added map_location to handle the fact that the custom model has been trained on GPU
  state_dict_finetuned = torch.load(finetuned_model, map_location=torch.device(device))

  # build the state_dict to be used
  # take the key name from standard (OpenAI) and the value from finetuned (HF)
  print("Rebuild the state dict...")
  new_state_dict = {}
  n_except = 0
  for k in tqdm(map_dict.keys()):
      try:
          # You must add "model." if you have used DDP in custom training
          # see DDP_TRAINED above
          # PREFIX is added to a HF fine-tuned 8with DDP). It is not in vanulla HF models
          new_state_dict[k] = state_dict_finetuned[PREFIX + map_dict[k]]
      except:
          n_except += 1

          if debug:
              print(PREFIX + map_dict[k])




  msg_err = f"Rebuild state dict failed, {n_except} pick failed"
  assert n_except == 0, msg_err



  print()
  print("Loading the final model...")
  model.load_state_dict(new_state_dict)
  return model

In [17]:
!pip show typing_extensions

Name: typing_extensions
Version: 4.8.0
Summary: Backported and Experimental Type Hints for Python 3.8+
Home-page: 
Author: 
Author-email: "Guido van Rossum, Jukka Lehtosalo, Łukasz Langa, Michael Lee" <levkivskyi@gmail.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: arviz, chex, fastapi, flax, gradio, gradio_client, huggingface-hub, ibis-framework, inflect, librosa, orbax-checkpoint, panel, polars, pydantic, pydantic_core, pymc, pytensor, python-utils, qudida, SQLAlchemy, tensorflow, tensorflow-probability, torch, typer, uvicorn


In [28]:
!pip install typing-extensions --upgrade



In [18]:
from transformers import WhisperForConditionalGeneration

In [7]:
model = import_hf_model(finetuned_model=FINETUNED_MODEL)


Loading vanilla Whisper model


100%|███████████████████████████████████████| 461M/461M [00:07<00:00, 67.8MiB/s]


Loading vanilla HF Model


(…)i/whisper-small/resolve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

(…)mall/resolve/main/generation_config.json:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

Matching layers...


100%|██████████| 479/479 [00:06<00:00, 78.04it/s]


Check if we have matched every entry in state_dict...

Number of keys: 479
Number of keys: 479
Number of sanity_check ok: 479

Match is complete !!!

Serializing map_dict...
map_dict saved as: map_dict_7.pkl...

Loading fine tuned dict...
Rebuild the state dict...


100%|██████████| 479/479 [00:00<00:00, 271606.27it/s]


Loading the final model...





In [8]:
model

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=768, out_features=768, bias=True)
          (key): Linear(in_features=768, out_features=768, bias=False)
          (value): Linear(in_features=768, out_features=768, bias=True)
          (out): Linear(in_features=768, out_features=768, bias=True)
        )
        (attn_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (mlp_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((768,), eps=1e-0

In [None]:
# code I used for testing purposes
# # Download a sample audio file from Gdrive
# import gdown
# url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
# output = "sample.wav"
# gdown.download(url, output, quiet=False)

# transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"
# # adding the LM
# url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
# output = "language_model_3p0.bin"
# gdown.download(url, output, quiet=False)


# audio = whisper.load_audio("/content/sample.wav")
# audio = whisper.pad_or_trim(audio)
# mel = whisper.log_mel_spectrogram(audio).to(model.device)

# res = decode_baseline(model, mel, 5)
# res

# lm_path = '/content/language_model_3p0.bin'
# lm_alpha = 1.0
# beam_size = 5
# res = decode_deep_fusion(model, mel, beam_size=beam_size, lm_path=lm_path, lm_alpha=lm_alpha)
# res

# beam_size = 5
# lm_weight = 0.01
# res = decode_shallow_fusion_beam_search(model, mel, beam_size=beam_size, lm_path=lm_path, lm_weight=lm_weight, debug=True)
# res

# best_of = 10
# temperature = 0.3

# res = decode_shallow_fusion_nbest(model, mel, best_of=best_of, lm_path=lm_path, temperature=temperature, lm_weight=lm_weight, debug=False)
# res

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-fhheg_im
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-fhheg_im
  Resolved https://github.com/huggingface/transformers to commit 78f6ed6c70b29c1560780e3869a7ad4c6b3d2710
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
# from huggingface_hub import notebook_login
# #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
# notebook_login()

$$:

$$

$$:

## ***Pointer 6: We need to evaluate and calculate the WER,CER  on the Test data shared using our final model . Below are some boiler plates code to do the same  ***



In [9]:
#Prepare Test Data for calculating Wer and CER

import datasets
# import the load_dataset function
from datasets import load_dataset

# specify the URL directory and the data files
# load the dataset from the URL directory


# datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
test_dataset = load_dataset("/content/datadownload.py")

Downloading data:   0%|          | 0.00/453M [00:00<?, ?B/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879420eec0>


Generating train split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879420eec0>


Generating test split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879423aaa0>


Generating validation split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879423a350>


Generating other split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879423a4d0>


Generating validated split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879423a590>


Generating invalidated split: 0 examples [00:00, ? examples/s]

<datasets.download.download_manager.ArchiveIterable object at 0x7e879423a860>


In [10]:
test_dataset = test_dataset.remove_columns(["age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes","accents","variant"])


In [26]:
test_dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4630
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3072
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2416
    })
    other: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3767
    })
    validated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 10173
    })
    invalidated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 757
    })
})

In [11]:
combined_dataset = datasets.concatenate_datasets([test_dataset["test"]])


In [14]:
from datasets import load_dataset, concatenate_datasets, DatasetDict, load_metric

# Define the split ratios
#train_split = 0.75  # 80% of the data
validation_split = 0.99  # 10% of the data
test_split = 0.1  # 10% of the data

# Compute the number of samples for each split
num_samples = len(combined_dataset)
num_validation = int(validation_split * num_samples)
num_test = num_samples - num_validation  # Remaining 10%

# Split the combined dataset
validation_dataset = combined_dataset.select(indices=list(range(num_validation)))
test_dataset = combined_dataset.select(indices=list(range(num_validation, num_samples)))

# If you want to organize the split datasets in a DatasetDict for convenience:
split_test_datasets = DatasetDict({
    'validation': validation_dataset,
    'test': test_dataset
})

# Verify the resulting datasets
print(f'Validation Dataset: {len(validation_dataset)} samples')
print(f'Test Dataset: {len(test_dataset)} samples')

Validation Dataset: 3041 samples
Test Dataset: 31 samples


In [13]:
## you can use only 10% of the data to perform evaluation

In [15]:
split_test_datasets["test"][0]


{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/9274dfa08465814d9b528cc161d00ba351615d0fffa64a36a31ba4eef3620161/cv-corpus-15.0-2023-09-08/hi/Zclips/common_voice_hi_36482276.mp3',
  'array': array([-8.88178420e-16, -1.50990331e-14, -4.44089210e-15, ...,
          2.29964789e-05,  2.15262316e-05,  1.06726802e-05]),
  'sampling_rate': 48000},
 'sentence': 'हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल कांडा को नहीं मिली जगह'}

Evaluation Metrics on the Test data

In [16]:
class customDataset(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, dataset, device=DEVICE):
        self.dataset = dataset
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        # audio, sample_rate, text, _, _, _ = self.dataset[item]
        # assert sample_rate == 16000
        # audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        # mel = whisper.log_mel_spectrogram(audio)
        audio = self.dataset[item]['audio']
        sentence = self.dataset[item]['sentence']
        path = audio['path']
        # array = audio['array']
        # sampling_rate = audio['sampling_rate']
        audio = whisper.load_audio(path)
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(model.device)
        return (mel, sentence)

dataset = customDataset(split_test_datasets["test"])
loader = torch.utils.data.DataLoader(dataset, batch_size=1)

In [17]:
def decode_baseline(model, mel, beam_size):
  """
    This function performs the transcription with Whisper to provide a baseline without LM fusion

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
  """
  options = whisper.DecodingOptions(fp16 = False, beam_size=beam_size, without_timestamps=True, language="hi")
  result = whisper.decode(model, mel, options)
  result = [r.text for r in result]
  return result

def decode_deep_fusion(model, mel, beam_size, lm_path, lm_alpha):
  """
    This function performs the deep fusion of the LM with Whisper during the beam search decoding step

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
    - lm_path: A string representing the path to the language model file used for fusion.
    - lm_alpha: A numerical value representing the weight assigned to the language model scores during deep fusion.
  """
  options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=beam_size,
        patience=1.0, lm_path=lm_path, lm_alpha=lm_alpha, lm_beta=0.0,
        without_timestamps=True, language="hi")
  result = whisper.decode(model, mel, options)
  result = [r.text for r in result]
  return result

def decode_shallow_fusion_beam_search(model, mel, beam_size, lm_path, lm_weight, debug=False):
  """
    this function performs shallow fusion using best of N hypothesis (decoding)
    by combining the scores of whisper and the language model score (which gets weighted by the lm_weight factor)

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
    - lm_path: A string representing the path to the language model file used for fusion.
    - lm_weight: A numerical value representing the weight assigned to the language model scores during shallow fusion.
    - debug: A boolean flag (optional) indicating whether to print debug information (default is False).
      Useful for inspecting the outputs with different lm_weights for finding the optimal value for lm_weight

  """
  options = whisper.DecodingOptions(fp16 = False, beam_size=beam_size, return_nbest = True, without_timestamps=True, language="hi")
  nbests = whisper.decode(model, mel, options)

  lm_model = kenlm.LanguageModel(lm_path)
  combined_scores = []
  # if testing with a single utterance without a dataloader
  if not type(nbests) == list:
    nbests = list(nbests)
  for nbest in nbests:
    nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
    combined_score = [(text, whisper_score + lm_score*lm_weight, whisper_score, lm_score) for text, whisper_score, lm_score in nbest_with_lm_score]
    combined_score.sort(key=lambda t: t[1], reverse=True)
    combined_scores.append(combined_score)
  if debug:
    print(combined_scores)
  # text, final_score, whisper_score, lm_score = combined_scores[0]
  # return the highest score element for each input in the batch
  result = [combined_score[0] for combined_score in combined_scores]
  return text

def decode_shallow_fusion_nbest(model, mel, best_of, lm_path, temperature, lm_weight, debug=False):
  """
    this function performs shallow fusion using best of N hypothesis (greedy decoding)
    by combining the scores of whisper and the language model score (which gets weighted by the lm_weight factor)

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - best_of: An integer specifying the number of best hypotheses to consider during decoding.
    - lm_path: A string representing the path to the language model used for fusion.
    - temperature: A numerical value indicating the temperature parameter used during sampling. Higher temperature corresponds to more variation in the n best list
    - lm_weight: A numerical value representing the weight assigned to the language model scores during shallow fusion.
    - debug: A boolean flag (optional) indicating whether to print debug information (default is False).
      Useful when finding the optimal value for the lm_weight
  """
  options = whisper.DecodingOptions(fp16 = False, best_of=best_of, return_nbest=True, without_timestamps=True, temperature=temperature, language="hi")
  nbest = whisper.decode(model, mel, options)

  lm_model = kenlm.LanguageModel(lm_path)
  combined_scores = []
  # if testing with a single utterance without a dataloader
  if not type(nbests) == list:
    nbests = list(nbests)
  for nbest in nbests:
    nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
    combined_score = [(text, whisper_score + lm_score*lm_weight, whisper_score, lm_score) for text, whisper_score, lm_score in nbest_with_lm_score]
    combined_score.sort(key=lambda t: t[1], reverse=True)
    combined_scores.append(combined_score)
  if debug:
    print(combined_scores)
  # text, final_score, whisper_score, lm_score = combined_scores[0]
  # return the highest score element for each input in the batch
  result = [combined_score[0] for combined_score in combined_scores]
  return text



# Decoding without LM as a baseline

In [18]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 5

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_baseline(model, mel, beam_size)
    hypotheses.extend(result)
    references.extend(text)

baseline_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
baseline_df

100%|██████████| 31/31 [05:28<00:00, 10.58s/it]


Unnamed: 0,hypothesis,reference
0,हरियाणाः मनोहरलाल खट्टर की कैबिनेट में गोपाल-क...,हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल ...
1,"कासकंज हेंसाः आरोपी राहत कोरैश्टी भी गिरफ्तार,...","कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, प..."
2,क्या आपने देखा राधिका आपते के फिल्म फोबिया काट...,क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' ...
3,इनफोसिव सॉफिस में महिला इंजीनियर की कंप्लूटर क...,इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के...
4,कॉम्रिस की उम्मीदें परवान चलाने को पद्यात्रा क...,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...
5,मध्य प्रदेशः चव्थी ट्रेन में यूवती से रेप,मध्य प्रदेश: चलती ट्रेन में युवती से रेप
6,"दिल्लीः महेराशेफ से रेप की कोशिश, नाकाम हुआ तो...","दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ त..."
7,टॉम ने तो दरवाज़ा बंद तक नहीं किया।,टॉम ने तो दरवाज़ा बंद तक नहीं किया।
8,पति के खार्राटों के कारण पत्नी गांवाती है तीन ...,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...
9,चैनर चार लगाइए।,चैनल चार लगाइये।


In [22]:
wer = jiwer.wer(list(baseline_df["reference"]), list(baseline_df["hypothesis"]))
cer = jiwer.cer(list(baseline_df["reference"]), list(baseline_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

WER: 37.93 %
CER: 13.06 %


# Decoding with LM deep fusion

In [None]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 5
lm_alpha = 1.0

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    # result = decode_baseline(model, mel, beam_size)
    result = decode_deep_fusion(model, mel, beam_size=beam_size, lm_path=lm_path, lm_alpha=lm_alpha)
    hypotheses.extend(result)
    references.extend(text)

deep_fusion_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
deep_fusion_df

 10%|▉         | 3/31 [01:28<13:32, 29.01s/it]

In [None]:
wer = jiwer.wer(list(deep_fusion_df["reference"]), list(deep_fusion_df["hypothesis"]))
cer = jiwer.cer(list(deep_fusion_df["reference"]), list(deep_fusion_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

# Decoding with shallow fusion (Beam Search)

In [None]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# # decoding parameters you can try playing around with to reach the optimal WER
beam_size = 5
lm_weight = 0.01

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_shallow_fusion_beam_search(model, mel, beam_size=beam_size, lm_path=lm_path, lm_weight=lm_weight)
    hypotheses.extend(result)
    references.extend(text)
    exit()

shallow_fusion_beam_search_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
shallow_fusion_beam_search_df

In [None]:
wer = jiwer.wer(list(shallow_fusion_beam_search_df["reference_clean"]), list(shallow_fusion_beam_search_df["hypothesis_clean"]))
cer = jiwer.cer(list(shallow_fusion_beam_search_df["reference_clean"]), list(shallow_fusion_beam_search_df["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

# Decoding with shallow fusion (Best of N hypothesis)

In [None]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# decoding parameters you can try playing around with to reach the optimal WER
best_of = 10
temperature = 0.3
lm_weight = 0.01

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_shallow_fusion_nbest(model, mel, best_of=best_of, lm_path=lm_path, temperature=temperature, lm_weight=lm_weight, debug=False)
    hypotheses.extend(result)
    references.extend(text)

shallow_fusion_nbest_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
shallow_fusion_nbest_df

In [None]:
wer = jiwer.wer(list(shallow_fusion_nbest_df["reference_clean"]), list(shallow_fusion_nbest_df["hypothesis_clean"]))
cer = jiwer.cer(list(shallow_fusion_nbest_df["reference_clean"]), list(shallow_fusion_nbest_df["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

# **Pointer 7 ## Final code t o push our model to hugging face **

In [None]:
##Code to push the final model to hugging face