# Fusing LM with Whisper for lower WER
The aim is to fuse a BPE-level LM scores with the generated tokens scores while beam-search decoding in Whisper.

## **MILESTONE 1**:
Instantiate a Language Model to be integrated with Whisper.
The chosen LM is an n-gram language model, trained with [KenLM](https://github.com/kpu/kenlm) library.

### Step 1:
Write code to run an already available LM in a standalone manner and be able to give a score any input sequence.
**Chosen Model**: [Riva ASR Hindi LM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_hi_in_lm/files?version=deployable_v3.1)


Download and build the KenLM toolkit

In [1]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

--2023-11-13 11:27:39--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-11-13 11:27:40 (1.67 MB/s) - written to stdout [491888/491888]

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check fo

Download the KenLM Python library

❗❗**Don't forget to restart the runtime after running this cell** ❗❗


In [1]:
!pip install https://github.com/kpu/kenlm/archive/master.zip

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip
[2K     [32m-[0m [32m553.6 kB[0m [31m1.5 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184349 sha256=5395b5cef89bc2727c1e2fd3f8bae5336a3279e7331625c40addd1434cec92a8
  Stored in directory: /tmp/pip-ephem-wheel-cache-zwysov5w/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0


$1:Please elaborate on the model being downloaded and whats its purpose

In [2]:
# Download the binary LM from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:06<00:00, 45.5MB/s]


'language_model_3p0.bin'

In [4]:
import kenlm
model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(model.order))

This is a 4-gram model


In [5]:
# Below are 2 pairs of sentences that sound exactly the same in hindi but one of them is incorrect (lexically or semantically)
# Generated using Bing Chat
book_correct = "मुझे यह किताब पसंद है।"
book_incorrect = "मुझे यह किताब पसन्द है।"

correct_score = model.score(book_correct)
incorrect_score = model.score(book_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-20.935253143310547 -21.663846969604492


In [None]:
sings_correct = "वह बहुत अच्छा गाता है।"
sings_incorrect = "वह बहुत अच्छा घाता है।"


correct_score = model.score(sings_correct)
incorrect_score = model.score(sings_incorrect)
assert correct_score > incorrect_score
print(correct_score, incorrect_score)

-19.827430725097656 -22.76061248779297


$2:Please elaborate on the model being downloaded and whats its purpose . Also , why are we doing it two times

In [7]:
# download the original arpa LM file for inspection
url = "https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL"
output = "language_model_3p0.arpa"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-4xQ3YCtsyONtpccGjOD1s9FtHqBX7RL
To: /content/language_model_3p0.arpa
100%|██████████| 3.18G/3.18G [00:33<00:00, 95.6MB/s]


'language_model_3p0.arpa'

$$:Please elaborate if we need to prepare a new arpa file for our model

In [8]:
# inspect the last 20 lines inside the LM source
!tail -20 language_model_3p0.arpa

-1.2548329	<s> दोनों ही झल्लाये
-0.2760634	<s> चौधरी के अशुभचिंतकों
-0.04973512	डालियों पर बैठी शुकमंडली
-0.07060567	मनुष्यों को उन्हें बेमुरौवत
-0.049646165	और कड़क कर बोलेमेरी
-0.04038189	निराश हो कर कहानहीं
-0.08863469	पड़ते ही वह अव्यवस्थितचित्त
-0.19321889	दोनों पक्षों से सवालजवाब
-0.051110353	झगड़ू साहु ने कहासमझू
-0.20675866	करें तो उनकी भलमनसी
-0.04876329	नीति को सराहता थाइसे
-0.06436408	<s> मित्रता की मुरझायी
-0.23735626	की गहराई से उपजतें
-0.17502813	पूर्णता की ओर बढातें
-0.18197767	जहाँ से अच्छा हिन्दोसिताँ
-0.04437429	हैं इसकी यह गुलसिताँ
-0.06926097	संतरी हमारा वह पासबाँ
-0.09434804	जिनके दम से रश्कएजनाँ

\end\


In [None]:
# some useful KenLM commands for future reference
# generate binary
# !kenlm/build/bin/build_binary dataset_tokenized_3gram.arpa dataset_tokenized_3gram.binary
# create a new LM
# !kenlm/build/bin/lmplz -o 3 --text dataset_tokenized.txt --arpa dataset_tokenized_3gram.arpa --discount_fallback
# !tail -20 dataset_tokenized_3gram.arpa

# Integrating the LM with Whisper

In [9]:
# !pip install https://github.com/HKAB/whisper/archive/master.zip
!pip install openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20231106.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton==2.0.0 (from openai-whisper)
  Downloading triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m95.1 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->openai-whisper)
  Downloading lit-17.0.4.tar.gz (153 kB)
[2K

In [10]:
import whisper
import torch
import kenlm

In [11]:
model = whisper.load_model("small")

100%|███████████████████████████████████████| 461M/461M [00:06<00:00, 76.3MiB/s]


In [12]:
# Download a sample audio file from Gdrive
import gdown
url = "https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG"
output = "sample.wav"
gdown.download(url, output, quiet=False)

transcription = "ब्रूड बॉक्स लैंगस्ट्रॉथ छत्ते का एक अनिवार्य हिस्सा है।"

Downloading...
From: https://drive.google.com/uc?id=1kKeSvrZo8z5Rsp1q-h3GXpG7vHctKMcG
To: /content/sample.wav
100%|██████████| 197k/197k [00:00<00:00, 3.28MB/s]


In [13]:
audio = whisper.load_audio("/content/sample.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)


## Baseline Whisper
Without nbest or LM integration

In [14]:

options = whisper.DecodingOptions(fp16 = False, beam_size=5, without_timestamps=True)
result = whisper.decode(model, mel, options)
baseline = result.text
baseline

'ब्रूद बाँक्ष लंश्टोट छत्ते का एक अनिवार्य हिस्सा है'

# Decoding with LM integration

In [15]:
# adding the LM
import gdown
url = "https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB"
output = "language_model_3p0.bin"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-AspJVZRXcrFMuLKx8C4N7uo9PnQytcB
To: /content/language_model_3p0.bin
100%|██████████| 280M/280M [00:02<00:00, 103MB/s]


'language_model_3p0.bin'

$$: Elaborate a bit on the what happening here . Please refer to the file and function with change done and

In [17]:
options = whisper.DecodingOptions(fp16 = False, withlm=True, beam_size=5,
        patience=1.0, lm_path="/content/language_model_3p0.bin", lm_alpha=1.0, lm_beta=0.0,
        without_timestamps=True)
decoding_withLM = whisper.decode(model, mel, options)


TypeError: ignored

In [18]:
decoding_withLM.text

NameError: ignored

# Decoding with nbest

In [19]:
options = whisper.DecodingOptions(fp16 = False, beam_size=5, nbest = True, without_timestamps=True)
nbest = whisper.decode(model, mel, options)

for candidate in nbest:
  print(candidate.text, candidate.avg_logprob)

TypeError: ignored

In [20]:
assert nbest[0].text == baseline

NameError: ignored

### Adding LM rescoring

In [21]:
lm_model = kenlm.LanguageModel('/content/language_model_3p0.bin')
print("This is a {}-gram model".format(lm_model.order))

This is a 4-gram model


In [22]:
nbest_with_lm_score = [(c.text, c.avg_logprob, lm_model.score(c.text)) for c in nbest]
nbest_with_lm_score

NameError: ignored

In [23]:
lm_weight = 0.01
combined_scores = [(text, whisper_score + lm_score*lm_weight) for text, whisper_score, lm_score in nbest_with_lm_score]
combined_scores.sort(key=lambda t: t[1], reverse=True)
combined_scores

NameError: ignored

## Importing a HF finetuned model

In [24]:
!pip install openai-whisper
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
Col

$$ not able to integrate the model : CKSINGH/whisper-small-hi-iiib tuned at : https://colab.research.google.com/github/chandan110791/HindiSpeechRecognition/blob/main/fine_tune_whisper_iiitb.ipynb#scrollTo=d7030622-caf7-4039-939b-6195cdaa2585

In [25]:
import whisper
from transformers import WhisperForConditionalGeneration
import torch
from tqdm import tqdm
import os

# using pickle to serialize the map_dict
import pickle

from huggingface_hub import hf_hub_download
filename = "pytorch_model.bin"
hf_hub_download(repo_id="CKSINGH/whisper-small-hi-iiib", filename=filename, local_dir="/content/")


EntryNotFoundError: ignored

In [26]:
# to enable verbose printing of exceptions (+ layers matching name)
DEBUG = False

# set to True if your custom model has been trained using DDP (multi-gpu)
# as in my case, in the custom HF model, keys have a prefix (model.)
# it should come from the fact that I have trained on a milti-gpu machine, using DDP
DDP_TRAINED = True

# if DDP we have to add a prefix to match with the HF state_dict
if DDP_TRAINED:
    PREFIX = "model."
else:
    PREFIX = ""

# for now, tested only with medium
MODEL_SIZE = "small"

# the device where you're running this code
DEVICE = "cpu"

# the name of the file with your fine-tuned model
FINETUNED_MODEL = "pytorch_model.bin"


# the name of the file for the serialized map_dict
# a different name, to avoid overwrite it
FILE_DICT = "map_dict.pkl"


In [27]:

def import_hf_model(finetuned_model, debug=False, model_size="small", device="cpu", file_dict="map_dict.pkl"):

  def has_numbers(inputString):
      return any(char.isdigit() for char in inputString)

  # next functions are used to make sanity checks for the mappings

  # get if it is encoder or decoder
  def extract_function(key_name):
      # encoder or decoder is the first part of the key
      first_part = key_name.split(".")[0]

      key_func = None
      if first_part in ["enconder", "decoder"]:
          key_func = first_part

      return key_func

  def extract_layer_num(key_name):
      # layer num is the third piece
      layer_num = None

      if has_numbers(key_name):
          layer_num = key_name.split(".")[2]

      return layer_num

  # check that the two keys are for layers
  # with the same function
  # (both encoder or both decoder)
  # and have the same layer number
  # this way we are super-safe (I think)
  def sanity_check(key1, key2):
      is_ok = True

      # check same func (encoder or decoder)
      func1 = extract_function(key1)
      func2 = extract_function(key2)

      if func1 != func2:
          print(f"Warning: layers seem to have different functions: {key1},{key2}")
          is_ok = False

      # check same layer_num
      layer1 = extract_layer_num(key1)
      layer2 = extract_layer_num(key2)

      if layer1 != layer2:
          print(f"Warning: layers seem to have different numbers: {key1},{key2}")
          is_ok = False

      return is_ok

  if not os.path.isfile(file_dict):
    # Vanilla means: not custom trained
    print()
    print("Loading vanilla Whisper model")
    model = whisper.load_model(model_size, device=device)

    print("Loading vanilla HF Model")
    hugging_face_model = WhisperForConditionalGeneration.from_pretrained(
        "openai/whisper-" + model_size
    )

    # extract state-dict from both
    state_d_openai = model.state_dict()
    state_d_huggingface = hugging_face_model.model.state_dict()

    # build the mapping between keys...
    map_dict = {}
    print("Matching layers...")

    # for every layer in OpenAI model
    n_sanity_ok = 0

    #
    # here we're considering the cartesian product of the two state dict and try to match
    # rules applied:
    # 1. the two layers have the same shape
    # 2. the two layer have the same parameters' values
    # 3. we apply sanity check (see function above)
    #
    for k in tqdm(state_d_openai):
        # find a layer in the HF model, check with j
        for j in state_d_huggingface:
            # where parameters have same shape and same values
            if state_d_huggingface[j].shape == state_d_openai[k].shape:
                if torch.all(torch.eq(state_d_huggingface[j], state_d_openai[k])).item():
                    # found, register the mapping
                    map_dict[k] = j
                    # make some check and eventually print a warning
                    if sanity_check(k, j) == True:
                        n_sanity_ok += 1

                        # if you enable thsi print you can see the name of the layer
                        # chosen in the match and you will se that they have the same functions
                        if debug:
                            print(k, j)

                    break


    # check if we have matched every entry
    print("Check if we have matched every entry in state_dict...")
    print()
    print(f"Number of keys: {len(map_dict.keys())}")
    assert len(map_dict.keys()) == len(state_d_openai.keys()), "The match is not complete !"

    print(f"Number of sanity_check ok: {n_sanity_ok}")
    print()

    print("Match is complete !!!")
    print()


    # serialize the map_dict to file
    print("Serializing map_dict...")

    with open(file_dict, "wb") as f:
        pickle.dump(map_dict, f)
        f.close()

    print(f"map_dict saved as: {file_dict}...")
    print()


    # # loading with match keys
    # # restart from pickle file
    # print("Reloading map_dict...")
    # print()
    # with open(file_dict, "rb") as f:
    #     map_dict = pickle.load(f)

  # loading fine-tuned dict
  print("Loading fine tuned dict...")

  # added map_location to handle the fact that the custom model has been trained on GPU
  state_dict_finetuned = torch.load(finetuned_model, map_location=torch.device(device))

  # build the state_dict to be used
  # take the key name from standard (OpenAI) and the value from finetuned (HF)
  print("Rebuild the state dict...")
  new_state_dict = {}
  n_except = 0
  for k in tqdm(map_dict.keys()):
      try:
          # You must add "model." if you have used DDP in custom training
          # see DDP_TRAINED above
          # PREFIX is added to a HF fine-tuned 8with DDP). It is not in vanulla HF models
          new_state_dict[k] = state_dict_finetuned[PREFIX + map_dict[k]]
      except:
          n_except += 1

          if debug:
              print(PREFIX + map_dict[k])




  msg_err = f"Rebuild state dict failed, {n_except} pick failed"
  assert n_except == 0, msg_err



  print()
  print("Loading the final model...")
  model.load_state_dict(new_state_dict)
  return model

In [28]:
model = import_hf_model(finetuned_model=FINETUNED_MODEL)


Loading vanilla Whisper model
Loading vanilla HF Model


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

Matching layers...


100%|██████████| 479/479 [00:18<00:00, 26.45it/s]

Check if we have matched every entry in state_dict...

Number of keys: 479
Number of sanity_check ok: 479

Match is complete !!!

Serializing map_dict...
map_dict saved as: map_dict.pkl...

Loading fine tuned dict...





FileNotFoundError: ignored

In [29]:
model

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=768, out_features=768, bias=True)
          (key): Linear(in_features=768, out_features=768, bias=False)
          (value): Linear(in_features=768, out_features=768, bias=True)
          (out): Linear(in_features=768, out_features=768, bias=True)
        )
        (attn_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (mlp_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((768,), eps=1e-0

In [32]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-w7d6ohf2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-w7d6ohf2
  Resolved https://github.com/huggingface/transformers to commit 8f577dca4f2e9153d152afffe209fee643a90124
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.36.0.dev0-py3-none-any.whl size=7987444 sha256=9274d0d5e4bcaab19e0f3c00e914551faf1edd87195bd5d0b7fff38fa71aca66
  Stored in directory: /tmp/pip-ephem-wheel-cache-yman986d/wheels/c0/14/d6/6c9a5582d2ac191ec0a483be151a4495fe1eb2a6706ca49f1b
Successfully built transformers

Collecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.0.3 rapidfuzz-3.5.2
Collecting gradio
  Downloading gradio-4.2.0-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.104.1-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1

In [33]:
from huggingface_hub import notebook_login
#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
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

$$: We need to calculate the WER,CER  on the Test data shared using our final model

In [36]:
#Prepare Test Data for calculating Wer and CER

import datasets
# import the load_dataset function
from datasets import load_dataset

# specify the URL directory and the data files
# load the dataset from the URL directory


datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
test_dataset = load_dataset("/content/datadownload.py")

In [37]:
test_dataset = test_dataset.remove_columns(["age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes","accents","variant"])


In [38]:
test_dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4630
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3072
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2416
    })
    other: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3767
    })
    validated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 10173
    })
    invalidated: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 757
    })
})

In [39]:
combined_dataset = datasets.concatenate_datasets([test_dataset["test"]])


In [44]:
from datasets import load_dataset, concatenate_datasets, DatasetDict, load_metric

# Define the split ratios
#train_split = 0.75  # 80% of the data
validation_split = 0.80  # 10% of the data
test_split = 0.20  # 10% of the data

# Compute the number of samples for each split
num_samples = len(combined_dataset)
num_validation = int(validation_split * num_samples)
num_test = num_samples - num_validation  # Remaining 10%

# Split the combined dataset
validation_dataset = combined_dataset.select(indices=list(range(num_validation)))
test_dataset = combined_dataset.select(indices=list(range(num_validation, num_samples)))

# If you want to organize the split datasets in a DatasetDict for convenience:
split_test_datasets = DatasetDict({
    'validation': validation_dataset,
    'test': test_dataset
})

# Verify the resulting datasets
print(f'Validation Dataset: {len(validation_dataset)} samples')
print(f'Test Dataset: {len(test_dataset)} samples')

Validation Dataset: 2457 samples
Test Dataset: 615 samples


In [45]:
split_test_datasets["test"][0]


{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/9274dfa08465814d9b528cc161d00ba351615d0fffa64a36a31ba4eef3620161/cv-corpus-15.0-2023-09-08/hi/Zclips/common_voice_hi_27096805.mp3',
  'array': array([-4.54747351e-13, -1.13686838e-12, -1.36424205e-12, ...,
          2.98086479e-06,  3.94603649e-06,  1.65145821e-06]),
  'sampling_rate': 48000},
 'sentence': 'श्रीलंका में राष्ट्रपति चुनाव के लिए कल वोटिंग, इन दो उम्मीदवारों के बीच टक्कर'}

Evaluation Metrics on the Test data

In [46]:
import evaluate

metric = evaluate.load("wer")
metric_c = evaluate.load("cer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

In [47]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    cer = 100 * metric_c.compute(predictions=pred_str, references=label_str)


    return {"wer": wer, "cer": cer}


$$##Code to perform call compute metrics on test dataset using our final model

In [48]:
##Code to perform call compute metrics on test dataset using our final model and calculate the WER and CER

In [49]:
##Code to push the final model to hugging face