<a href="https://colab.research.google.com/github/chandan110791/hindiWhisper/blob/main/Whisper_LM_fusion_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fusing LM with Whisper for lower WER
The aim is to fuse a BPE-level LM scores with the generated tokens scores while beam-search decoding in Whisper.

## **MILESTONE 1**:
Instantiate a Language Model to be integrated with Whisper.
The chosen LM is an n-gram language model, trained with [KenLM](https://github.com/kpu/kenlm) library.

### Step 1:
Write code to run an already available LM in a standalone manner and be able to give a score any input sequence.
**Chosen Model**: [Riva ASR Hindi LM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)


Download and build the KenLM toolkit

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

--2023-11-13 11:27:39--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-11-13 11:27:40 (1.67 MB/s) - written to stdout [491888/491888]

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check fo

Download the KenLM Python library

❗❗**Don't forget to restart the runtime after running this cell** ❗❗


In [None]:
!pip install https://github.com/kpu/kenlm/archive/master.zip

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.6/553.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184345 sha256=6cc9a5db8180947baa0ae0b8fd35c040b77ff8e307bf437b2fafd6295101aa79
  Stored in directory: /tmp/pip-ephem-wheel-cache-9az3r8uo/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0


## ***Pointer 1:Please add comments on the model being downloaded and whats its purpose ***

Downloading the Hindi ASR n-gram language model from Nvidia which can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)

This will be used for fusion with Whisper.
I uploaded the binary version to a shareable location in my Gdrive.

In [None]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:01<00:00, 147MB/s]


'language_model_3p0.bin'

In [None]:
import kenlm
model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(model.order))

This is a 4-gram model


In [None]:
# Below are 2 pairs of sentences that sound exactly the same in hindi but one of them is incorrect (lexically or semantically)
# Generated using Bing Chat
book_correct = "मुझे यह किताब पसंद है।"
book_incorrect = "मुझे यह किताब पसन्द है।"

correct_score = model.score(book_correct)
incorrect_score = model.score(book_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-20.935253143310547 -21.663846969604492


In [None]:
sings_correct = "वह बहुत अच्छा गाता है।"
sings_incorrect = "वह बहुत अच्छा घाता है।"


correct_score = model.score(sings_correct)
incorrect_score = model.score(sings_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-19.827430725097656 -22.76061248779297


## ***Pointer 2:Please elaborate on the model being downloaded and whats its purpose . Also , why are we doing it two times  ***

This is unnecessary and can be skipped.

In [None]:
# # download the original arpa LM file for inspection
# url = "https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL"
# output = "language_model_3p0.arpa"
# gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL
To: /content/language_model_3p0.arpa
100%|██████████| 3.18G/3.18G [00:33<00:00, 95.6MB/s]


'language_model_3p0.arpa'

$$:Please elaborate if we need to prepare a new arpa file for our model

We can use this model for the time being, and finetune it later with the text data from our training set.
However, I think it's not likely to come up with a better model than this, as this model is built by a team in Nvidia and it's probable that our training set is included in training this model.

In [None]:
# inspect the last 20 lines inside the LM source
!tail -20 language_model_3p0.arpa

-1.2548329	<s> दोनों ही झल्लाये
-0.2760634	<s> चौधरी के अशुभचिंतकों
-0.04973512	डालियों पर बैठी शुकमंडली
-0.07060567	मनुष्यों को उन्हें बेमुरौवत
-0.049646165	और कड़क कर बोलेमेरी
-0.04038189	निराश हो कर कहानहीं
-0.08863469	पड़ते ही वह अव्यवस्थितचित्त
-0.19321889	दोनों पक्षों से सवालजवाब
-0.051110353	झगड़ू साहु ने कहासमझू
-0.20675866	करें तो उनकी भलमनसी
-0.04876329	नीति को सराहता थाइसे
-0.06436408	<s> मित्रता की मुरझायी
-0.23735626	की गहराई से उपजतें
-0.17502813	पूर्णता की ओर बढातें
-0.18197767	जहाँ से अच्छा हिन्दोसिताँ
-0.04437429	हैं इसकी यह गुलसिताँ
-0.06926097	संतरी हमारा वह पासबाँ
-0.09434804	जिनके दम से रश्कएजनाँ

\end\


In [None]:
# some useful KenLM commands for future reference
# generate binary
# !kenlm/build/bin/build_binary dataset_tokenized_3gram.arpa dataset_tokenized_3gram.binary
# create a new LM
# !kenlm/build/bin/lmplz -o 3 --text dataset_tokenized.txt --arpa dataset_tokenized_3gram.arpa --discount_fallback
# !tail -20 dataset_tokenized_3gram.arpa

# Integrating the LM with Whisper

In [None]:
# !pip install openai-whisper
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Downloading https://github.com/chandan110791/hindiWhisper/archive/master.zip
[2K     [32m\[0m [32m7.4 MB[0m [31m16.9 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton==2.0.0 (from openai-whisper==20231106)
  Downloading triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper==20231106)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m74.0 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->openai-whisper==2

In [None]:
import whisper
import torch
import kenlm

In [None]:
model = whisper.load_model("small")

100%|████████████████████████████████████████| 461M/461M [00:04<00:00, 112MiB/s]


In [None]:
# Download a sample audio file from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
output = "sample.wav"
gdown.download(url, output, quiet=False)

transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"

Downloading...
From: https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG
To: /content/sample.wav
100%|██████████| 197k/197k [00:00<00:00, 77.9MB/s]


In [None]:
audio = whisper.load_audio("/content/sample.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)


In [None]:
# Download a sample audio file from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
output = "sample.wav"
gdown.download(url, output, quiet=False)

transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"


In [None]:
audio = whisper.load_audio("/content/sample.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True, language="hi")
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

## Baseline Whisper
Without nbest or LM integration

In [None]:

options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True, language="hi")
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with LM integration

In [None]:
# adding the LM
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:04<00:00, 67.3MB/s]


'language_model_3p0.bin'

$$:

## ***Pointer 4: Getting an error here while trying to run the code . Elaborate a bit on the what happening here . Can you please refer to the function with changes done  ***

The error was due to the fact that the imported whisper here is the original one, which doesn't include the changed that we did to the decoding options.

Replaced this:
```
!pip install openai-whisper
```
with this:
```
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip
```

and the error is fixed.

In [None]:
options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=5,
        patience=1.0, lm_path="/content/language_model_3p0.bin", lm_alpha=1.0, lm_beta=0.0,
        without_timestamps=True, language="hi")
decoding_withLM = whisper.decode(model, mel, options)


In [None]:
decoding_withLM.text

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with nbest (beam search)

In [None]:
options = whisper.DecodingOptions(fp16 = False, beam_size=5, return_nbest = True, without_timestamps=True, language="hi")
nbest = whisper.decode(model, mel, options)


for candidate in nbest:
  print(candidate.text, candidate.avg_logprob)

ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45291578358617324
ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45523316981428763
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है. -0.47296941078315347
ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं -0.48318856449450476
ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है -0.46669308344523114


### Adding LM rescoring

In [None]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(lm_model.order))

This is a 4-gram model


In [None]:
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
nbest_with_lm_score

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45291578358617324,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45523316981428763,
  -36.06303787231445),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.',
  -0.47296941078315347,
  -44.10427474975586),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं',
  -0.48318856449450476,
  -37.30411148071289),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है',
  -0.46669308344523114,
  -37.54087829589844)]

In [None]:
lm_weight = 0.01
combined_scores = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores.sort(key=lambda t: t[1], reverse=True)
combined_scores

[('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8135461623093178),
 ('ब्रूद बाँक्ष लांश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8158635485374321),
 ('ब्रूद बाँक्ष लांश्टोट छत्टे का एक अनिवार्य हिस्सा है', -0.8421018664042155),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा हैं', -0.8562296793016337),
 ('ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है.', -0.9140121582807121)]

# Decoding with nbest (best of N hypothesis)

In [None]:
options = whisper.DecodingOptions(fp16 = False, best_of=10, return_nbest=True, without_timestamps=True, temperature=0.3, language="hi")
nbest_best_of_n_hyp = whisper.decode(model, mel, options)


In [None]:
for candidate in nbest_best_of_n_hyp:
  print(candidate.text, candidate.avg_logprob)

ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है -0.45742596898760113
ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है -0.5202431113032971
ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है -0.5065439448637121
ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey. -0.9538334877260269
ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है -0.5101523081461589
ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है -0.4865361798194147
ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है -0.49694071144893254
ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है -0.5291237149919782
ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है -0.5426437135726686
ब्रूद बाँच लंश्टोट छद्टे का एक अनिवारे हिस्चा है -0.49345003325363684


In [None]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest_best_of_n_hyp]
nbest_with_lm_score


[('ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है',
  -0.45742596898760113,
  -35.82866668701172),
 ('ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है',
  -0.5202431113032971,
  -49.43550491333008),
 ('ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है',
  -0.5065439448637121,
  -54.91597366333008),
 ('ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey.',
  -0.9538334877260269,
  -81.96270751953125),
 ('ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है',
  -0.5101523081461589,
  -46.626407623291016),
 ('ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है',
  -0.4865361798194147,
  -46.603515625),
 ('ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है',
  -0.49694071144893254,
  -49.201133728027344),
 ('ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है',
  -0.5291237149919782,
  -46.626407623291016),
 ('ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है',
  -0.5426437135726686,
  -42.747047424316406),
 ('ब्रूद बाँच लंश्टोट छद्टे का एक अनिव

In [None]:
lm_weight = 0.01
combined_scores_bestofNSampling = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores_bestofNSampling.sort(key=lambda t: t[1], reverse=True)
combined_scores_bestofNSampling

[('ब्रूद बाँच लंश्टोट छत्ते का एक अनिवार्य हिस्सा है', -0.8157126358577182),
 ('ब्रूद बाँच लंच्टोड छत्टे का एक अनिवार्य लिए हिस्सा है',
  -0.9525713360694148),
 ('ब्रूद भोग्ष लंग स्थ्टोड शथ्टे का एक अनिवार्य हिस्सा है',
  -0.9701141878158327),
 ('ब्रुद बाँच लांश्टोट छद्टे का एक अनिवार्य हिस्चा है', -0.976416384379069),
 ('ब्रूद बाँच लांश्टोट छत्ते का एक अनीवार्य हिस्था है', -0.9889520487292061),
 ('ब्रूद बाँच लन्च्टोट चते का एक अनिवार्य हिस्चा है', -0.9953877912248883),
 ('ब्रूद बाँच लंश्टोट छद्टे का एक अनिवारे हिस्चा है', -1.0002398129167227),
 ('ब्रूद बाओ्छ लंश्टोट छत्ते का एक अनिवार्ये हिस्चा है', -1.014598160436598),
 ('ब्रूड़ ब्रूड़ लंच्टोट चद्ते का एक अनीवार्य लिए हिस्चा है',
  -1.0557036814970129),
 ('ब्रुध बाँच लंश्टोट चते का एक अनिवार्य हिस morally part of the Landstorch Chhattey.',
  -1.7734605629213394)]

# **RUN THE NOTEBOOK STARTING HERE**

# The next sections include:
- Importing a finetuned huggingface model to our codebase
- Running the evaluation of the 4 decoding variants on the imported dataset
- Computing the WER and CER for each of the decoding variants

## The decoding strategies are:
1. baseline decoding without LM
2. Deep fusion of the LM with the token probabilities during beam search decoding (try to find the optimal value for lm_alpha)
3. Shallow fusion by rescoring the N best candidates generated through beam search (try to find the optimal value for lm_weight)
4. Shallow fusion by rescoring the N best candidates generated through greedy decoding using best of N sampling (try to find the optimal value for lm_weight and temperature)





In [61]:
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install transformers
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Downloading https://github.com/chandan110791/hindiWhisper/archive/master.zip
[2K     [32m|[0m [32m7.5 MB[0m [31m14.0 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting https://github.com/kpu/kenlm/archive/master.zip
  Using cached https://github.com/kpu/kenlm/archive/master.zip (553 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-yvrqh2x6
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-yvrqh2x6
  Resolved https://gith

In [1]:
!pip install https://github.com/chandan110791/hindiWhisper/archive/master.zip

Collecting https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Using cached https://github.com/chandan110791/hindiWhisper/archive/master.zip
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


$$

$$:

## ***Pointer 5: its looking for a file named as pytorchmodel.bin , it does not exists hence note  able to integrate the model : CKSINGH/whisper-small-hi-iiib tuned at : https://colab.research.google.com/github/chandan110791/HindiSpeechRecognition/blob/main/fine_tune_whisper_iiitb.ipynb#scrollTo=d7030622-caf7-4039-939b-6195cdaa2585  . Any inputs ***

The reason for this is that the model file after finetuning must be pickled:
so pytorch_model.bin should be found in this path:
https://huggingface.co/CKSINGH/whisper-small-hi-iiib/tree/main

Will use this one https://huggingface.co/sanchit-gandhi/whisper-small-hi/tree/main to make the required changes to the code.

## In the next cell, you need to change the repo id to your repo after generating the pytorch_model.bin (pickle format of the finetuned model)

In [2]:
import whisper
import kenlm
from transformers import WhisperForConditionalGeneration
import torch
import jiwer
from tqdm import tqdm
import os
import pandas as pd
# using pickle to serialize the map_dict
import pickle

from huggingface_hub import hf_hub_download
filename = "pytorch_model.bin"


hf_hub_download(repo_id="CKSINGH/whisper-small-hi-iiib", filename=filename, local_dir="/content/")
# hf_hub_download(repo_id="sanchit-gandhi/whisper-small-hi", filename=filename, local_dir="/content/")


'/content/pytorch_model.bin'

# Here you need to change the model size from "small" to "medium" if you finetuned the whisper-medium (as the one in this repo: https://huggingface.co/CKSINGH/whisper-small-hi-iiib/tree/main)

In [3]:
# to enable verbose printing of exceptions (+ layers matching name)
DEBUG = False

# set to True if your custom model has been trained using DDP (multi-gpu)
# as in my case, in the custom HF model, keys have a prefix (model.)
# it should come from the fact that I have trained on a milti-gpu machine, using DDP
DDP_TRAINED = False

# if DDP we have to add a prefix to match with the HF state_dict
if DDP_TRAINED:
    PREFIX = "model."
else:
    PREFIX = ""

MODEL_SIZE = "small"

# the device where you're running this code
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# the name of the file with your fine-tuned model
FINETUNED_MODEL = "pytorch_model.bin"

# the name of the file for the serialized map_dict
# a different name, to avoid overwrite it
FILE_DICT = MODEL_SIZE + "_map_dict8.pkl"


In [4]:

def import_hf_model(finetuned_model, model_size, device, file_dict, debug=False):

  def has_numbers(inputString):
      return any(char.isdigit() for char in inputString)

  # next functions are used to make sanity checks for the mappings

  # get if it is encoder or decoder
  def extract_function(key_name):
      # encoder or decoder is the first part of the key
      first_part = key_name.split(".")[0]

      key_func = None
      if first_part in ["enconder", "decoder"]:
          key_func = first_part

      return key_func

  def extract_layer_num(key_name):
      # layer num is the third piece
      layer_num = None

      if has_numbers(key_name):
          layer_num = key_name.split(".")[2]

      return layer_num

  # check that the two keys are for layers
  # with the same function
  # (both encoder or both decoder)
  # and have the same layer number
  # this way we are super-safe (I think)
  def sanity_check(key1, key2):
      is_ok = True

      # check same func (encoder or decoder)
      func1 = extract_function(key1)
      func2 = extract_function(key2)

      if func1 != func2:
          print(f"Warning: layers seem to have different functions: {key1},{key2}")
          is_ok = False

      # check same layer_num
      layer1 = extract_layer_num(key1)
      layer2 = extract_layer_num(key2)

      if layer1 != layer2:
          print(f"Warning: layers seem to have different numbers: {key1},{key2}")
          is_ok = False

      return is_ok

  if not os.path.isfile(file_dict):
    # Vanilla means: not custom trained
    print()
    print("Loading vanilla Whisper model")
    model = whisper.load_model(model_size, device=device)

    print("Loading vanilla HF Model")
    hugging_face_model = WhisperForConditionalGeneration.from_pretrained(
        "openai/whisper-" + model_size
    ).to(device)

    # extract state-dict from both
    state_d_openai = model.state_dict()
    state_d_huggingface = hugging_face_model.model.state_dict()

    # build the mapping between keys...
    map_dict = {}
    print("Matching layers...")

    # for every layer in OpenAI model
    n_sanity_ok = 0

    #
    # here we're considering the cartesian product of the two state dict and try to match
    # rules applied:
    # 1. the two layers have the same shape
    # 2. the two layer have the same parameters' values
    # 3. we apply sanity check (see function above)
    #
    for k in tqdm(state_d_openai):
        # find a layer in the HF model, check with j
        for j in state_d_huggingface:
            # where parameters have same shape and same values
            if state_d_huggingface[j].shape == state_d_openai[k].shape:
                if torch.all(torch.eq(state_d_huggingface[j], state_d_openai[k])).item():
                    # found, register the mapping
                    map_dict[k] = j
                    # make some check and eventually print a warning
                    if sanity_check(k, j) == True:
                        n_sanity_ok += 1

                        # if you enable thsi print you can see the name of the layer
                        # chosen in the match and you will se that they have the same functions
                        if debug:
                            print(k, j)

                    break


    # check if we have matched every entry
    print("Check if we have matched every entry in state_dict...")
    print()
    print(f"Number of keys: {len(map_dict.keys())}")
    assert len(map_dict.keys()) == len(state_d_openai.keys()), "The match is not complete !"

    print(f"Number of sanity_check ok: {n_sanity_ok}")
    print()

    print("Match is complete !!!")
    print()


    # serialize the map_dict to file
    print("Serializing map_dict...")

    with open(file_dict, "wb") as f:
        pickle.dump(map_dict, f)
        f.close()

    print(f"map_dict saved as: {file_dict}...")
    print()

  else:
    # loading with match keys
    # restart from pickle file
    print("Reloading map_dict...")
    print()
    with open(file_dict, "rb") as f:
        map_dict = pickle.load(f)

  # loading fine-tuned dict
  print("Loading fine tuned dict...")

  # added map_location to handle the fact that the custom model has been trained on GPU
  state_dict_finetuned = torch.load(finetuned_model, map_location=torch.device(device))

  print(state_dict_finetuned.keys())
  # build the state_dict to be used
  # take the key name from standard (OpenAI) and the value from finetuned (HF)
  print("Rebuild the state dict...")
  new_state_dict = {}
  n_except = 0
  for k in tqdm(map_dict.keys()):
      try:
        # You must add "model." if you have used DDP in custom training
        # see DDP_TRAINED above
        # PREFIX is added to a HF fine-tuned 8with DDP). It is not in vanulla HF models
        new_state_dict[k] = state_dict_finetuned[PREFIX + map_dict[k]]
      except Exception as ex:
        n_except += 1

        if debug:
            print("exception")
            print(PREFIX + map_dict[k])




  msg_err = f"Rebuild state dict failed, {n_except} pick failed"
  assert n_except == 0, msg_err



  print()
  print("Loading the final model...")
  model.load_state_dict(new_state_dict)
  return model

In [5]:
model = import_hf_model(finetuned_model=FINETUNED_MODEL, debug= False, model_size=MODEL_SIZE, device=DEVICE, file_dict=FILE_DICT)


Loading vanilla Whisper model
Loading vanilla HF Model
Matching layers...


100%|██████████| 479/479 [00:01<00:00, 265.04it/s]


Check if we have matched every entry in state_dict...

Number of keys: 479
Number of sanity_check ok: 479

Match is complete !!!

Serializing map_dict...
map_dict saved as: small_map_dict8.pkl...

Loading fine tuned dict...
odict_keys(['encoder.conv1.weight', 'encoder.conv1.bias', 'encoder.conv2.weight', 'encoder.conv2.bias', 'encoder.embed_positions.weight', 'encoder.layers.0.self_attn.k_proj.weight', 'encoder.layers.0.self_attn.v_proj.weight', 'encoder.layers.0.self_attn.v_proj.bias', 'encoder.layers.0.self_attn.q_proj.weight', 'encoder.layers.0.self_attn.q_proj.bias', 'encoder.layers.0.self_attn.out_proj.weight', 'encoder.layers.0.self_attn.out_proj.bias', 'encoder.layers.0.self_attn_layer_norm.weight', 'encoder.layers.0.self_attn_layer_norm.bias', 'encoder.layers.0.fc1.weight', 'encoder.layers.0.fc1.bias', 'encoder.layers.0.fc2.weight', 'encoder.layers.0.fc2.bias', 'encoder.layers.0.final_layer_norm.weight', 'encoder.layers.0.final_layer_norm.bias', 'encoder.layers.1.self_attn.k_pr

100%|██████████| 479/479 [00:00<00:00, 1075520.14it/s]


Loading the final model...





In [6]:
model

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=768, out_features=768, bias=True)
          (key): Linear(in_features=768, out_features=768, bias=False)
          (value): Linear(in_features=768, out_features=768, bias=True)
          (out): Linear(in_features=768, out_features=768, bias=True)
        )
        (attn_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (mlp_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((768,), eps=1e-0

In [None]:
from huggingface_hub import notebook_login
# #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
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

$$:

$$

$$:

## ***Pointer 6: We need to evaluate and calculate the WER,CER  on the Test data shared using our final model . Below are some boiler plates code to do the same  ***



In [7]:
#Prepare Test Data for calculating Wer and CER

import datasets
# import the load_dataset function
from datasets import load_dataset

# specify the URL directory and the data files
# load the dataset from the URL directory


# datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
test_dataset = load_dataset("/content/datadownload.py")

In [8]:
test_dataset = test_dataset.remove_columns(["age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes","accents","variant"])


In [None]:
test_dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4630
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3072
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2416
    })
    other: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3767
    })
    validated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 10173
    })
    invalidated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 757
    })
})

In [9]:
combined_dataset = datasets.concatenate_datasets([test_dataset["test"]])


In [10]:
from datasets import load_dataset, concatenate_datasets, DatasetDict, load_metric

# Define the split ratios
#train_split = 0.75  # 80% of the data
validation_split = 0.99  # 10% of the data
test_split = 0.1  # 10% of the data

# Compute the number of samples for each split
num_samples = len(combined_dataset)
num_validation = int(validation_split * num_samples)
num_test = num_samples - num_validation  # Remaining 10%

# Split the combined dataset
validation_dataset = combined_dataset.select(indices=list(range(num_validation)))
test_dataset = combined_dataset.select(indices=list(range(num_validation, num_samples)))

# If you want to organize the split datasets in a DatasetDict for convenience:
split_test_datasets = DatasetDict({
    'validation': validation_dataset,
    'test': test_dataset
})

# Verify the resulting datasets
print(f'Validation Dataset: {len(validation_dataset)} samples')
print(f'Test Dataset: {len(test_dataset)} samples')

Validation Dataset: 3041 samples
Test Dataset: 31 samples


In [None]:
## you can use only 10% of the data to perform evaluation

In [31]:
split_test_datasets["test"][0]


{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/9274dfa08465814d9b528cc161d00ba351615d0fffa64a36a31ba4eef3620161/cv-corpus-15.0-2023-09-08/hi/Zclips/common_voice_hi_36482276.mp3',
  'array': array([-8.88178420e-16, -1.50990331e-14, -4.44089210e-15, ...,
          2.29964789e-05,  2.15262316e-05,  1.06726802e-05]),
  'sampling_rate': 48000},
 'sentence': 'हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल कांडा को नहीं मिली जगह'}

Evaluation Metrics on the Test data

In [11]:
class customDataset(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, dataset, device=DEVICE):
        self.dataset = dataset
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        # audio, sample_rate, text, _, _, _ = self.dataset[item]
        # assert sample_rate == 16000
        # audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        # mel = whisper.log_mel_spectrogram(audio)
        audio = self.dataset[item]['audio']
        sentence = self.dataset[item]['sentence']
        path = audio['path']
        # array = audio['array']
        # sampling_rate = audio['sampling_rate']
        audio = whisper.load_audio(path)
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(model.device)
        return (mel, sentence)

dataset = customDataset(split_test_datasets["test"])
loader = torch.utils.data.DataLoader(dataset, batch_size=1)

In [62]:
def decode_baseline(model, mel, beam_size):
  """
    This function performs the transcription with Whisper to provide a baseline without LM fusion

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
  """
  if mel.shape[0] != 1:
    mel = torch.unsqueeze(mel, 0)
  options = whisper.DecodingOptions(fp16 = False, beam_size=beam_size, without_timestamps=True, language="hi")
  result = whisper.decode(model, mel, options)
  result = [r.text for r in result]
  return result

def decode_deep_fusion(model, mel, beam_size, lm_path, lm_alpha):
  """
    This function performs the deep fusion of the LM with Whisper during the beam search decoding step

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
    - lm_path: A string representing the path to the language model file used for fusion.
    - lm_alpha: A numerical value representing the weight assigned to the language model scores during deep fusion.
  """
  if mel.shape[0] != 1:
    mel = torch.unsqueeze(mel, 0)
  options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=beam_size,
        patience=1.0, lm_path=lm_path, lm_alpha=lm_alpha, lm_beta=0.0,
        without_timestamps=True, language="hi")
  result = whisper.decode(model, mel, options)
  result = [r.text for r in result]
  return result

def decode_shallow_fusion_beam_search(model, mel, beam_size, lm_path, lm_weight, debug=False):
  """
    this function performs shallow fusion using best of N hypothesis (decoding)
    by combining the scores of whisper and the language model score (which gets weighted by the lm_weight factor)

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - beam_size: An integer specifying the size of the beam for beam search decoding.
    - lm_path: A string representing the path to the language model file used for fusion.
    - lm_weight: A numerical value representing the weight assigned to the language model scores during shallow fusion.
    - debug: A boolean flag (optional) indicating whether to print debug information (default is False).
      Useful for inspecting the outputs with different lm_weights for finding the optimal value for lm_weight

  """
  # if testing with a single utterance without a dataloader
  if mel.shape[0] != 1:
    mel = torch.unsqueeze(mel, 0)
  options = whisper.DecodingOptions(fp16 = False, beam_size=beam_size, return_nbest = True, without_timestamps=True, language="hi")
  nbests = whisper.decode(model, mel, options)

  lm_model = kenlm.LanguageModel(lm_path)
  combined_scores = []

  for nbest in nbests:
    nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
    combined_score = [(text, whisper_score + lm_score*lm_weight, whisper_score, lm_score) for text, whisper_score, lm_score in nbest_with_lm_score]
    combined_score.sort(key=lambda t: t[1], reverse=True)
    combined_scores.append(combined_score)
  if debug:
    print(combined_scores)
  # text, final_score, whisper_score, lm_score = combined_scores[0]
  # return the highest score element for each input in the batch
  #result = [combined_score[0] for combined_score in combined_scores]
  result = [hyp for hyp, comb_score, w_score, lm_score in combined_scores]
  #return text
  return result[0]

def decode_shallow_fusion_nbest(model, mel, best_of, lm_path, temperature, lm_weight, debug=False):
  """
    this function performs shallow fusion using best of N hypothesis (greedy decoding)
    by combining the scores of whisper and the language model score (which gets weighted by the lm_weight factor)

    - model: The Whisper model used for transcription.
    - mel: Represents Mel-spectrogram data, likely input features for the model.
    - best_of: An integer specifying the number of best hypotheses to consider during decoding.
    - lm_path: A string representing the path to the language model used for fusion.
    - temperature: A numerical value indicating the temperature parameter used during sampling. Higher temperature corresponds to more variation in the n best list
    - lm_weight: A numerical value representing the weight assigned to the language model scores during shallow fusion.
    - debug: A boolean flag (optional) indicating whether to print debug information (default is False).
      Useful when finding the optimal value for the lm_weight
  """
  # if testing with a single utterance without a dataloader
  if mel.shape[0] != 1:
    mel = torch.unsqueeze(mel, 0)
  options = whisper.DecodingOptions(fp16 = False, best_of=best_of, return_nbest=True, without_timestamps=True, temperature=temperature, language="hi")
  nbests = whisper.decode(model, mel, options)

  lm_model = kenlm.LanguageModel(lm_path)
  combined_scores = []

  for nbest in nbests:
    nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
    combined_score = [(text, whisper_score + lm_score*lm_weight, whisper_score, lm_score) for text, whisper_score, lm_score in nbest_with_lm_score]
    combined_score.sort(key=lambda t: t[1], reverse=True)
    combined_scores.append(combined_score)
  if debug:
    print(combined_scores)
  # text, final_score, whisper_score, lm_score = combined_scores[0]
  # return the highest score element for each input in the batch
  result = [combined_score[0] for combined_score in combined_scores]
  #return text
  return result[0]



# Decoding without LM as a baseline

In [None]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 5

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_baseline(model, mel, beam_size)
    hypotheses.extend(result)
    references.extend(text)

baseline_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
baseline_df

100%|██████████| 31/31 [00:53<00:00,  1.72s/it]


Unnamed: 0,hypothesis,reference
0,हरियाणा मनोहरलाल खट्टर की कैबिनेट में गोपाल का...,हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल ...
1,कासकंज हिंसा आरोपी राहत कोरैशी भी गिरफ्तार पुल...,"कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, प..."
2,क्या आपने देखा राधि का आपदे के फ़िल्म फोबिया क...,क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' ...
3,इन्फोसिव सॉफिस में महिला इंजीनियर की कंप्यूटर ...,इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के...
4,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...
5,मध्य प्रदेश चलती ट्रेन में युगती से रेप,मध्य प्रदेश: चलती ट्रेन में युवती से रेप
6,दिल्ली महिरा सेफ से रेब की कोशिश नाकाम हुआ तो ...,"दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ त..."
7,टॉम ने तो दरवाजा बंद तट नहीं किया,टॉम ने तो दरवाज़ा बंद तक नहीं किया।
8,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...
9,चैनल चार लगाइए,चैनल चार लगाइये।


In [None]:

wer = jiwer.wer(list(baseline_df["reference"]), list(baseline_df["hypothesis"]))
cer = jiwer.cer(list(baseline_df["reference"]), list(baseline_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

WER: 37.59 %
CER: 10.83 %


# Decoding with LM deep fusion

In [None]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:03<00:00, 81.6MB/s]


'language_model_3p0.bin'

In [None]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 20
lm_alpha = 0.2

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    # result = decode_baseline(model, mel, beam_size)
    result = decode_deep_fusion(model, mel, beam_size=beam_size, lm_path=lm_path, lm_alpha=lm_alpha)
    hypotheses.extend(result)
    references.extend(text)

deep_fusion_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
deep_fusion_df

100%|██████████| 31/31 [23:30<00:00, 45.51s/it]


Unnamed: 0,hypothesis,reference
0,हरियाणा मनोहर लाल खट्टर की कैबिनेट में गोपाल क...,हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल ...
1,कासकंज हिंसा आरोपी राहत कोरैशी भी गिरफ्तार पुल...,"कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, प..."
2,क्या आपने देखा राधि का आपदे के फ़िल्म फोबिया क...,क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' ...
3,इन्फोसिव सॉफिस में महिला इंजीनियर की कंप्यूटर ...,इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के...
4,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...
5,मध्य प्रदेश चलती ट्रेन में युगती से रेप,मध्य प्रदेश: चलती ट्रेन में युवती से रेप
6,दिल्ली महिरा सेफ से रेब की कोशिश नाकाम हुआ तो ...,"दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ त..."
7,टॉम ने तो दरवाजा बंद तट नहीं किया,टॉम ने तो दरवाज़ा बंद तक नहीं किया।
8,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...
9,चैनल चार लगाइए,चैनल चार लगाइये।


# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 15
lm_alpha = 0.5

WER: 37.24 %
CER: 10.83 %


# decoding parameters you can try playing around with to reach the optimal WER
beam_size = 15
lm_alpha = 1.5

WER: 37.24 %
CER: 10.83 %


In [None]:
wer = jiwer.wer(list(deep_fusion_df["reference"]), list(deep_fusion_df["hypothesis"]))
cer = jiwer.cer(list(deep_fusion_df["reference"]), list(deep_fusion_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

WER: 36.55 %
CER: 10.76 %


# Decoding with shallow fusion (Beam Search)

In [13]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:01<00:00, 164MB/s]


'language_model_3p0.bin'

In [53]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# # decoding parameters you can try playing around with to reach the optimal WER
beam_size = 20
lm_weight = 0.05

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_shallow_fusion_beam_search(model, mel, beam_size=beam_size, lm_path=lm_path, lm_weight=lm_weight)
    hypotheses.extend(result)
    references.extend(text)
    #exit()



100%|██████████| 31/31 [01:56<00:00,  3.75s/it]


In [54]:
import jiwer

shallow_fusion_beam_search_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
wer = jiwer.wer(list(shallow_fusion_beam_search_df["reference"]), list(shallow_fusion_beam_search_df["hypothesis"]))
cer = jiwer.cer(list(shallow_fusion_beam_search_df["reference"]), list(shallow_fusion_beam_search_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

ValueError: ignored

In [None]:
shallow_fusion_beam_search_df

Unnamed: 0,hypothesis,reference
0,तमिलनाडु में आतंकियों की मदद के आरोप में संदिग...,तमिलनाडु में आतंकियों की मदद के आरोप में संदिग...
1,"सेमीफाइनल में भी दिखेगा युवराज का जलवा, फिट घोषित","सेमीफाइनल में भी दिखेगा युवराज का जलवा, फिट घोषित"
2,"मोर्गन नॉटिंघम वनडे से निलंबित, बेयरस्टो को लग...","मोर्गन नॉटिंघम वनडे से निलंबित, बेयरस्टो को लग..."
3,"नोटबंदी पर 'पलटूराम': पहले किया सपोर्ट, फिर मा...","नोटबंदी पर 'पलटूराम': पहले किया सपोर्ट, फिर मा..."
4,"आंखों के लिए फायदेमंद है योग, ऐसे करें अभ्यास","आंखों के लिए फायदेमंद है योग, ऐसे करें अभ्यास"
...,...,...
763,"गाजियाबाद: लड़की की मिली लाश, रेप की आशंका","गाजियाबाद: लड़की की मिली लाश, रेप की आशंका"
764,जेट एयरवेज के कार्यवाहक सीईओ ने इस्तीफा दिया,जेट एयरवेज के कार्यवाहक सीईओ ने इस्तीफा दिया
765,"दिल्लीः लिफ्ट के बहाने कार लूट की कोशिश, ड्राइ...","दिल्लीः लिफ्ट के बहाने कार लूट की कोशिश, ड्राइ..."
766,डोकलाम विवाद के बाद भारत से लंबी दोस्ती करना च...,डोकलाम विवाद के बाद भारत से लंबी दोस्ती करना च...


In [None]:
!pip install jiwer




# Decoding with shallow fusion (Best of N hypothesis)

In [56]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# decoding parameters you can try playing around with to reach the optimal WER
best_of = 10
temperature = 0.3
lm_weight = 0.01

for mel, text in tqdm(loader):
    # results = model.decode(mels, options)
    # hypotheses.extend([result.text for result in results])
    result = decode_shallow_fusion_nbest(model, mel, best_of=best_of, lm_path=lm_path, temperature=temperature, lm_weight=lm_weight, debug=False)
    hypotheses.extend(result)
    references.extend(text)


shallow_fusion_nbest_df = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
shallow_fusion_nbest_df

100%|██████████| 31/31 [00:54<00:00,  1.75s/it]


Unnamed: 0,hypothesis,reference
0,हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल ...,हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल ...
1,"कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, प...","कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, प..."
2,क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' ...,क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' ...
3,इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के...,इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के...
4,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...,कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा ...
5,मध्य प्रदेश: चलती ट्रेन में युवती से रेप,मध्य प्रदेश: चलती ट्रेन में युवती से रेप
6,"दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ त...","दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ त..."
7,टॉम ने तो दरवाज़ा बंद तक नहीं किया।,टॉम ने तो दरवाज़ा बंद तक नहीं किया।
8,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...,पति के खर्राटों के कारण पत्नी गंवाती है तीन सप...
9,चैनल चार लगाइये।,चैनल चार लगाइये।


In [47]:
len(references)

31

In [48]:
references

['हरियाणा: मनोहर लाल खट्टर की कैबिनेट में गोपाल कांडा को नहीं मिली जगह',
 'कासगंज हिंसा: आरोपी राहत कुरैशी भी गिरफ्तार, पुलिस पूछताछ में जुटी',
 "क्या आपने देखा राधिका आप्टे की फिल्म 'फोबिया' का टीजर",
 'इंफोसिस ऑफिस में महिला इंजीनियर की कंप्यूटर के तार से गला घोंटकर हत्या',
 'कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा करेंगे राहुल',
 'मध्य प्रदेश: चलती ट्रेन में युवती से रेप',
 'दिल्ली: महिला शेफ से रेप की कोशिश, नाकाम हुआ तो चौथी मंजिल से फेंका',
 'टॉम ने तो दरवाज़ा बंद तक नहीं किया।',
 'पति के खर्राटों के कारण पत्नी गंवाती है तीन सप्ताह की नींद',
 'चैनल चार लगाइये।',
 'आधी रात को दिल्ली हवाई अड्डे पर फायरिंग से सनसनी',
 'घुसपैठ पर सेना प्रमुख के बयान से मचा सियासी बवाल',
 'फोन पर पिता से इटली बात करते-करते बेटे ने दी जान',
 '...तो इस वजह से कपिल के शो पर नहीं जाएंगे ‘सुल्तान’',
 'कॉलेज ऑफ इवेंट्स एंड मीडिया, पुणे',
 "'मणिशंकर अय्यर टी स्टाल' में लीजिए 'नमो चाय' की चुस्कियां",
 'मुझे लगा मैं यहां अकेला हूं।',
 'अपनों के ही जाल में फंसते भुजबल',
 'श्रीदेवी की मौत की जांच संबंधी या

In [52]:
hypotheses

['हरियाणा मनोहर लाल खट्टर की कैबिनेट में गोपाल कांडा को नहीं मिली जगहा',
 -0.41793007135391236,
 -0.010053408145904542,
 -40.78766632080078,
 'कासकंज हिंसा आरोपी राहत कोरैशी भी गिरफ्तार पुलिस पूँछ ताछ में जुड़ी',
 -0.628240288016084,
 -0.018196228262665985,
 -61.0044059753418,
 'क्या आपने देखा राधि का आपदे के फ़िल्म फोबिया का टीजर',
 -0.4918291026970436,
 -0.020716126622824835,
 -47.111297607421875,
 'इन्फोसिव सॉफिस में महिला इंजीनियर की कंप्यूटर के तार से गला घोटकर हत्या',
 -0.5373387240074776,
 -0.04611252771841513,
 -49.12261962890625,
 'कांग्रेस की उम्मीदें परवान चढ़ाने को पदयात्रा करेंगे राहुल',
 -0.329933881405741,
 -0.006149167660623789,
 -32.37847137451172,
 'मध्य प्रदेश चलती ट्रेन में युगती से रेप',
 -0.2934632857008414,
 -0.005557707087560134,
 -28.790557861328125,
 'दिल्ली महिरा शेख से रेब की कोशिश नाकाम हुआ तो चौकी मंजर से फेंका',
 -0.623125996518491,
 -0.04700256817376436,
 -57.612342834472656,
 'टॉम ने तो दरवाजा बंद तट नहीं किया',
 -0.2719667205061668,
 -0.002617431565737

In [58]:
wer = jiwer.wer(list(shallow_fusion_nbest_df["reference"]), list(shallow_fusion_nbest_df["hypothesis"]))
cer = jiwer.cer(list(shallow_fusion_nbest_df["reference"]), list(shallow_fusion_nbest_df["hypothesis"]))

print(f"WER: {wer * 100:.2f} %")
print(f"CER: {cer * 100:.2f} %")

WER: 0.00 %
CER: 0.00 %


# **Pointer 7 ## Final code t o push our model to hugging face **

In [None]:
##Code to push the final model to hugging face

In [None]:
kwargs = {
    "language": "hi",
    "model_name": "Whisper integrated LM 2011",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-medium",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

In [None]:
model.push_to_hub(**kwargs)

AttributeError: ignored

**##Inferences of model vs BenchMarkModels -wave2wec and GoogleUSM**

**##Orignal  Whisper Model without any integration**

In [None]:
!pip install openai-whisper


Collecting openai-whisper
  Downloading openai-whisper-20231117.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai-whisper: filename=openai_whisper-20231117-py3-none-any.whl size=801358 sha256=dd836e0126dac7c20ed84b460d598c223253b0b252cf89883aa6e023ff0895f8
  Stored in directory: /root/.cache/pip/wheels/d0/85/e1/9361b4cb

In [None]:
import whisper

# Load the model
model = whisper.load_model("small")  # You can choose different model sizes like "tiny", "small", "medium", "large"

# Transcribe the audio file
result = model.transcribe("/content/audio.wav",language="hi")

# Print the transcription
print(result["text"])


 अमाना नाम चंदिन कमार स्टिंग है, और हम राची से चारकंच्टेड है, आमार.


[link text](https://)**##Orignal Integrated Whisper Model without fusion**

In [None]:
#inferencing using audio files
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'

# # decoding parameters you can try playing around with to reach the optimal WER
beam_size = 20
lm_weight = 0.05
audio = whisper.load_audio("/content/audio.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True, language="hi")
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

'हमारा नाम चंदन कुमार सिंह और हम राँची से झारखंड स्टेट याम'

**#Orignal Integrated Whisper Model with shallow fusion and beam search**

In [63]:
hypotheses = []
references = []

lm_path = '/content/language_model_3p0.bin'
# # decoding parameters you can try playing around with to reach the optimal WER
beam_size = 20
lm_weight = 0.01
audio = whisper.load_audio("/content/audio_2.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# decode_baseline decode_shallow_fusion_beam_search
result = decode_shallow_fusion_beam_search(model, mel, beam_size=beam_size, lm_path=lm_path, lm_weight=lm_weight,debug=True)
result

[[('हमारा नाम चंदन कुमार सिंह और हम राँची से झारखंड स्टेट', -0.4159112863373338, -0.07522887514348615, -34.068241119384766), ('हमारा नाम चंदन कुमार सिंह और हम राज्य से झारखंड स्टेट याम', -0.4201896166801453, -0.06313570737838745, -35.70539093017578), ('हमारा नाम चंदन कुमार सिंह और हम राँची से झारखंड स्टेट याम', -0.4260845669762033, -0.019786234761847824, -40.62983322143555), ('हमारा नाम चंदन कुमार सिंह और हम रांची से झारखंड स्टेट याम', -0.43641326427459715, -0.0646193265914917, -37.17939376831055), ('हमारा नाम चंदन कुमार सिंह और हम राँची से झारखंड स्टेट हैम', -0.4581754334767659, -0.05169571240743001, -40.647972106933594), ('हमारा नाम चंदन कुमार सिंह और हम राँजी से झारखंड स्टेट याम', -0.45854578831156745, -0.049068630718794026, -40.947715759277344), ('हमारा नाम चन्दन कुमार सिंह और हम राँची से झारखंड स्टेट हैम', -0.4752447270565346, -0.06497426502040175, -41.02704620361328), ('हमारा नाम चंदन कुमार सिंह और हम राँजी से झारखंड स्टेट हैम', -0.4932946713765462, -0.08363612492879231, -40.9658

ValueError: ignored

**#Orignal Integrated Whisper Model with shallow fusion and greedy decoding:**

In [32]:
hypotheses = []
references = []

best_of = 10
temperature = 1.0
lm_weight = 1.0

lm_path = '/content/language_model_3p0.bin'
# # decoding parameters you can try playing around with to reach the optimal WER
beam_size = 20
lm_weight = 1.0
audio = whisper.load_audio("/content/audio_2.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# decode_baseline decode_shallow_fusion_beam_search
result = decode_shallow_fusion_nbest(model, mel, best_of=best_of, lm_path=lm_path, temperature=temperature, lm_weight=lm_weight, debug=False)
result

('हमारा नाम चन्दन कुमार सिंह और हम रांची से झारखंड स्टेट याम',
 -37.63978749415914,
 -0.08131962916890129,
 -37.558467864990234)

**##Orignal Wave2wec**

In [None]:
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import torch

# Load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Function to read and preprocess the audio file
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = sf.read(path)
    return speech_array

# Function to perform inference
def asr_transcript(audio_file):
    # Load and preprocess the audio
    speech = speech_file_to_array_fn(audio_file)
    input_values = tokenizer(speech, return_tensors="pt", padding="longest").input_values

    # Perform inference
    with torch.no_grad():
        logits = model(input_values).logits

    # Get predicted IDs and decode them to text
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)

    return transcription[0]

# Path to your audio file
audio_file = "/content/audio.wav"

# Perform inference and print the result
print("*******************wave 2 wec transcription***************** ")
print(asr_transcript(audio_file))


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2

*******************wave 2 wec transcription***************** 
CHAMMUSHAA U CHACHA TRU I'L SHAR


**Orignal Google USM**

In [None]:
#Google USM

!pip install google-cloud-speech


Collecting google-cloud-speech
  Downloading google_cloud_speech-2.22.0-py2.py3-none-any.whl (275 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/275.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m163.8/275.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.2/275.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: google-cloud-speech
Successfully installed google-cloud-speech-2.22.0


##Google USM Transcrption

In [None]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/drivedownloaduplaod-eed71df0371c.json"


In [None]:
!pip install pydub


Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [60]:
import torchaudio
import requests



# Load the audio file
waveform, sample_rate = torchaudio.load('/content/audio_4.wav')

# Resample to 16kHz
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resampler(waveform)

# Save the resampled audio
torchaudio.save('resampled_audio.wav', resampled_waveform, 16000)


OSError: ignored

In [None]:
from pydub import AudioSegment

# Load the WAV file
wav_audio = AudioSegment.from_file("/content/resampled_audio.wav", format="wav")

# Export as FLAC
wav_audio.export("/content/audio.flac", format="flac")


<_io.BufferedRandom name='/content/audio.flac'>

In [None]:
!curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    https://speech.googleapis.com/v1/speech:recognize \
    -d @sync-request.json

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "हमारा नाम चंदन कुमार सिंह है और हम रांची से झारखंड स्टेट नाम है",
          "confidence": 0.8868657
        }
      ],
      "resultEndTime": "6.760s",
      "languageCode": "hi-in"
    }
  ],
  "totalBilledTime": "7s",
  "requestId": "9093070027649121811"
}


 ('हमारा नाम चंदन कुमार सिंह और हम राँजी से झारखंड स्टेट हैम',