# Whisper ASR Transcriptor Pipeline for generating Weakly Labelled Training Data
**with Gemini powered Translation, Summarization for Transliteration**

---

Author:
*Fyroz S Dadapeer*       
*MTech CSE (Specialization in AI and DS)*        
*Indian Institute of Information Technology, Kota*

---

Whisper is a SOTA multi-lingual general purpose speech recognition model that can transcribe, translate in more than 96 languages. This model is developed by OpenAI using 680,000 hours weakly supervised audio data across different languages providing checkpoints for fine-tuning for low-resource languages. Although, for lower, WER fine-tuning for low-resource language is required, whisper out-of-the-box performs well than other ASR models.


The below project does ASR using Multilingual whisper `large-v3` over any audio/video file given as input or a youtube url as input. We then use Google's Gemini API to translate/summarize/question-answering the transcription.

The entire pipeline applies other modules on top of vanilla whisper module such as Faster-whisper (which does inference over CTranslate2) and some audio preprocessing techniques (Audio Normalization, Dynamic Range Compress, Noise Reduce using spectral gating, weiner filter), a speaker diarization pipeline, Silero VAD (to remove noise), Demucs (to isolate voice activity) and Custom prompting in an attempt to improve accuracy of the quality of transcription.

The pipeline is discussed in detail in the below cells.

To run the app, first set runtime to `T4 GPU` and run all the below setup cells.

## Run Full Setup 🏗️
Click the below play button. Will take approximately (~5mins) to setup the entire app

In [None]:
#@markdown # **Install libraries**
#@markdown This cell can take a little while (~5mins) to download several packages/libraries, including Whisper and install them.

#@markdown ---

%%capture
# ^ Magic command to not show cell output

# System Modules
import io
import os
import gc
import sys
import subprocess
import importlib

# External Packages
# Function to install a package in the kernel **without restarting session**
def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install_package("pyannote.audio") # !pip install pyannote.audio
install_package("stable-ts") # !pip install stable-ts
install_package("pytube") # !pip install pytube
install_package("noisereduce") # !pip install noisereduce
install_package("pydub") # !pip install pydub
install_package("faster-whisper") # !pip install faster-whisper

!pip install demucs
!pip install gradio
!pip install -q -U google-generativeai

!mkdir -p generated_data

# External Modules
from pytube import YouTube # Library to download youtube video - Alternatives: youtube_dl, google API, pafy
import noisereduce as nr
import stable_whisper
from faster_whisper import WhisperModel

from pydub import AudioSegment
from pydub.effects import normalize
from pydub.effects import compress_dynamic_range

from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook
from pyannote.core import Annotation

from scipy.io import wavfile
from scipy.signal import wiener

# Pytorch Modules
import torch

device_id = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device_id)

import torchaudio
import torchaudio.transforms as T
from torchaudio.utils import download_asset # Utility function that downloads the audio file from the URL

# String Modules
import re
import json
import random
import string
import mimetypes
from urllib.parse import urlparse
import pathlib
import textwrap

from IPython.display import display
from IPython.display import Markdown

from bs4 import BeautifulSoup
from markdown import markdown

# Google Dev Module
import google.generativeai as genai

# Login Modules
from huggingface_hub import notebook_login
from google.colab import userdata

# Gradio
import gradio as gr

### Functions and Classes Definitions

Code Structure



In [None]:
#@markdown #### **Utility Functions**
#@markdown * `log_file()` - This function logs audio files as changes are made to it (Needed for debugging)
#@markdown * `json_to_dict()` - This function converts a JSON file to dict
#@markdown * `dict_to_json()` - This function converts Python dictionary to JSON file and saves it
#@markdown * `get_file_type()` - This function checks if a file is an audio file or video file
#@markdown * `check_yt_link()` - This function checks if a URL is a youtube URL or not
#@markdown * `to_markdown()` - This function converts a text file to markdown object
#@markdown * `markdown_to_plain()` - This function converts a markdown style text to plain text
#@markdown * `personify()` - Extracts the last int at the end of string SPEAKER_00 -> 0
#@markdown ---

def log_file(updated_file_path, log_file_name):
  '''This function logs a file by using `cp` and `mv` commands in a subprocess (less compute expensive)
      Parameters:
        - updated_file_path - The path of the file that needs to be logged
        - log_file_name - The log file name

      Returns: Doesnot return anything, prints error in case an exception
      Note: The log file will be saved in a logs folder where the file is saved.
  '''
  try:
    file_name = os.path.basename(updated_file_path)
    log_path = os.path.dirname(updated_file_path) + '/logs' # Location of where all the logs are stored

    subprocess.run(["cp", updated_file_path, log_path+'/'], check=True) # Copies the file
    subprocess.run(["mv", log_path+"/"+file_name, log_path+"/"+log_file_name], check=True) # Renames the file

  except Exception as e:
    print(f"Error logging: {e}")


def json_to_dict(file_path):
    '''Converts a file to JSON dict
       Parameters: file_path - path to the file
       Returns a JSON Dict
    '''
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data


def dict_to_json(data, file_name):
  '''Writes a Python Dictionary to a file
      Parameters:
        - data - JSON dictionary to save
        - file_name - File in which the dictionary needs to be saved

      Returns: None
  '''
  with open(file_name, 'w', encoding='utf-8') as json_file:
    json.dump(data, json_file, ensure_ascii=False, indent=4)


def get_file_type(file_path):
  '''This file checks if a file given by a file_path is an audio file or not
      Parameters: file_path - Takes a file_path
      Returns: str -> 'audio' or 'video or 'unknown'
  '''
  mime_type, _ = mimetypes.guess_type(file_path)

  if mime_type:
      if mime_type.startswith('audio'):
          return 'audio'
      elif mime_type.startswith('video'):
          return 'video'
  return 'unknown'

def check_yt_link(video_url):
  ''' Checks if a URL is an Youtube link or not
      Parameters: video_url - URL of the Youtube video/audio
      Returns: Bool -> True - valid Youtube URL else False
  '''
  # Code to check if the URL is valid youtube URL or not
  parsed_url = urlparse(video_url)
  if parsed_url.hostname == "www.youtube.com" or parsed_url.hostname == "youtube.com" or  parsed_url.hostname == "youtu.be" or parsed_url.hostname == "m.youtube.com":
    print("Valid Youtube URL!")
    return True
  else:
    print("Invalid Youtube URL!")
    return False

def to_markdown(text):
  '''Converts text to markdown format
      Parameters: text -> string that needs to be converted to markdown
      Returns: Markdown object that has the string in markdown format
  '''
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

def markdown_to_plain(text):
  '''Converts the markdown file to plain text
    Parameters: text -> Text string have markdown syntax
    Returns: Text (str) without markdown symbols
  '''
  html = markdown(text)
  text = ''.join(BeautifulSoup(html).findAll(text=True))
  return text

def personify(text):
  '''Extracts the last int at the end of string SPEAKER_00 -> 0
  '''
  match = re.search(r'\d+$', text)
  if match:
      return str(int(match.group()))
  else:
      return None

# Language lists and their corressponding codes (Taken from whisper documentation)
# Required for language -> code conversion in below code
languages = {'english': 'en', 'chinese': 'zh', 'german': 'de', 'spanish': 'es', 'russian': 'ru', 'korean': 'ko', 'french': 'fr', 'japanese': 'ja', 'portuguese': 'pt', 'turkish': 'tr', 'polish': 'pl', 'catalan': 'ca', 'dutch': 'nl', 'arabic': 'ar', 'swedish': 'sv', 'italian': 'it', 'indonesian': 'id', 'hindi': 'hi', 'finnish': 'fi', 'vietnamese': 'vi', 'hebrew': 'he', 'ukrainian': 'uk', 'greek': 'el', 'malay': 'ms', 'czech': 'cs', 'romanian': 'ro', 'danish': 'da', 'hungarian': 'hu', 'tamil': 'ta', 'norwegian': 'no', 'thai': 'th', 'urdu': 'ur', 'croatian': 'hr', 'bulgarian': 'bg', 'lithuanian': 'lt', 'latin': 'la', 'maori': 'mi', 'malayalam': 'ml', 'welsh': 'cy', 'slovak': 'sk', 'telugu': 'te', 'persian': 'fa', 'latvian': 'lv', 'bengali': 'bn', 'serbian': 'sr', 'azerbaijani': 'az', 'slovenian': 'sl', 'kannada': 'kn', 'estonian': 'et', 'macedonian': 'mk', 'breton': 'br', 'basque': 'eu', 'icelandic': 'is', 'armenian': 'hy', 'nepali': 'ne', 'mongolian': 'mn', 'bosnian': 'bs', 'kazakh': 'kk', 'albanian': 'sq', 'swahili': 'sw', 'galician': 'gl', 'marathi': 'mr', 'punjabi': 'pa', 'sinhala': 'si', 'khmer': 'km', 'shona': 'sn', 'yoruba': 'yo', 'somali': 'so', 'afrikaans': 'af', 'occitan': 'oc', 'georgian': 'ka', 'belarusian': 'be', 'tajik': 'tg', 'sindhi': 'sd', 'gujarati': 'gu', 'amharic': 'am', 'yiddish': 'yi', 'lao': 'lo', 'uzbek': 'uz', 'faroese': 'fo', 'haitian creole': 'ht', 'pashto': 'ps', 'turkmen': 'tk', 'nynorsk': 'nn', 'maltese': 'mt', 'sanskrit': 'sa', 'luxembourgish': 'lb', 'myanmar': 'my', 'tibetan': 'bo', 'tagalog': 'tl', 'malagasy': 'mg', 'assamese': 'as', 'tatar': 'tt', 'hawaiian': 'haw', 'lingala': 'ln', 'hausa': 'ha', 'bashkir': 'ba', 'javanese': 'jw', 'sundanese': 'su', 'cantonese': 'yue'}
language_list = list(languages.keys())

In [None]:
#@markdown #### **Youtube Video Downloader Function**
#@markdown * `YTVideoDownloader()` - This function given the youtube URL downloads the audio file and sets up the file for processing

#@markdown ---

def YTVideoDownloader(wrapper_object, video_url):
  '''
   This function given the youtube URL downloads the audio file and sets up the file for processing
   Parameters:
    - wrapper_object -> An instance of the SemanticWrapper() class that will store all the metadata include the transcription for a specific audio file
    - video_url -> Path to the youtube video
   Returns: [Bool]
     True - in case the audio file is downloaded and all the necessary setup is done successfully
     False - in case any error occurs during download or setup or incase the Youtube link is invalid
  '''

  if check_yt_link(video_url) != True:
    return False

  try:
    # Creates the pytube media object
    media_stream = YouTube(video_url)
    media_stream.bypass_age_gate()

    # Captions (In case human annotated captions has been uploaded by the video uploader) can also be downloaded using pytube
    # This can also be used as loosely labelled training data for !!Transcription/ !!Translation

    print("No of different Captions available for this video (Can be used for transcription/translation training data) : " + ", ".join([ re.search(r'lang="([^"]*)"', str(caption)).group(1) for caption in media_stream.caption_tracks]))

    # Downloading the audio file
    update_path = wrapper_object.info['update_path']

    audio_object = media_stream.streams.filter(only_audio=True, adaptive=True).desc()
    audio_object = audio_object[1] # Selecting the audio file with the best kbps

    audio_file_path = audio_object.download(output_path=update_path, filename= "latest_audio_file")

    log_file(audio_file_path, "Downloaded_file")

    print('Audio downloaded successfully!')

    # Note:
    # Choosed to work with .mp3 format (lossy) rather .wav format since .wav(lossless) format required (~500mb) of storage when compared with (~30mb) of storage of mp3 format for only ~30mins of audio data
    # Also i think the performance improvement of .wav over .mp3 may not be that much.
    # And since the whisper model uses 16 bit depth, there may not be any advantages of the .wav format.

    # The below is the default file-name to the recently updated file in the pipeline process
    wrapper_object.info['audio_file_path'] = update_path + "/latest_audio_file.mp3"       # Storing the file path of the updated file in the wrapper object

    # Convert them to .mp3 file format
    print('Converting to .mp3 format......')
    audio = AudioSegment.from_file(audio_file_path)
    audio.export(wrapper_object.info['audio_file_path'], format="mp3")

    print('Converted successfully to .mp3 format!')

    # Since working with a Local downloaded version of the Whisper model, not need for chunking of files <25mb as required by openAI's Whisper API (Cloud Deployed)

    log_file(wrapper_object.info['audio_file_path'], "to_mp3_converted.mp3")

    return True

  except Exception as e:
    print(f"Error during download: {e}")

    return False

In [None]:
#@markdown #### **Audio Resampling Utility Function**
#@markdown * `resample_rebit_audio()` - This function will convert audio file to 16 kHz sample rate and 16 bid depth in case not (as required by Whisper Transcription Model)
#@markdown ---

def resample_rebit_audio(input_file_path, output_file_path):
  ''' This is an utility function that will convert a given audio file to 16kHz sample Rate with 16 bit depth which is compulsarily required by whisper transcription model
  Parameters:
    - input_file_path - Path of the audio file
    - output_file_path - Path of the audio file where the converted file will be stored
  Returns: None

  '''

  # Load the audio file
  waveform, sample_rate = torchaudio.load(input_file_path)

  # Get audio file metadata
  info = torchaudio.info(input_file_path)

  # Check if the audio needs to be resampled
  if sample_rate != 16000 or info.bits_per_sample != 16:
      # Resample to 16kHz if needed
      if sample_rate != 16000:
          resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
          waveform = resampler(waveform)

      # Convert to 16-bit depth if needed
      if info.bits_per_sample != 16:
          # Normalize to [-1, 1] range and then scale to 16-bit PCM range
          waveform = waveform / waveform.abs().max()
          waveform = (waveform * 32767).short()

      # Save the resampled audio file
      torchaudio.save(output_file_path, waveform, 16000, encoding="PCM_S", bits_per_sample=16)
  else:
      print("The audio file already meets the required conditions.")
      # Save the original file if no conversion is needed
      torchaudio.save(output_file_path, waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)


In [None]:
#@markdown #### **Video To Audio Converter Utility Function**
#@markdown * `audio_video_converter()` - This function that takes in an input as audio/video, performs checks and outputs appropriate audio file as the per the requirement of the Model
#@markdown ---

def audio_video_converter(wrapper_object, file_path):
  '''This function that takes in an input as audio/video, performs checks and outputs appropriate audio file as the per the requirement of the Model
  Parameters:
    - wrapper_object -> An instance of the SemanticWrapper() class that will store all the metadata include the transcription for a specific audio file
    - file_path -> Path as `str` to convert (either video or audio)
  Returns:
    - True -> Incase the file is converted to as required and no error occurs in between
    - False -> In case of any exception
    - Note: Changes are made on the .info object of wrapper_object
  '''

  wrapper_object.info['audio_file_path'] = wrapper_object.info['update_path'] + "/latest_audio_file.mp3" # New File Path

  if get_file_type(file_path) == 'audio':

    resample_rebit_audio(file_path, wrapper_object.info['audio_file_path']) # Resampling to 16KHz
    log_file(wrapper_object.info['audio_file_path'], "resampled_audio_after_aud_to_aud.mp3") # Logging

    print("Resampling done.")

    return True

  elif get_file_type(file_path) == 'video':

    converted_path = wrapper_object.info['update_path'] + '/latest_audio_file'

    # ffmpeg command to take a file and copy only the audio part into another file
    command = [
        'ffmpeg',
        '-i', file_path,
        '-vn', # No video
        '-acodec', 'copy',
        converted_path
    ]
    subprocess.run(command, check=True)
    log_file(converted_path, "converted_vid_to_aud") # Logging

    resample_rebit_audio(converted_path, wrapper_object.info['audio_file_path']) # Resampling to 16KHz
    log_file(wrapper_object.info['audio_file_path'], "resampled_audio_after_vid_to_aud.mp3") # Logging

    return True

  elif get_file_type(file_path) == 'unkown':
    print("Unknown file type!")
    return False

In [None]:
#@markdown #### **Audio preprocessing Techniques**
#@markdown `Class AudioPreprocessingTechniques`
#@markdown * `normalize_audio()` - This function uses the Pydubs Normalize feature to normalize the audio
#@markdown * `dynamic_range_compression()` - This function uses the Pydubs compress dynamic range function to reduce dynamic range (highs,lows) in the audio file
#@markdown * `noisereduce()` - This function uses the noisereduce library feature to apply spectral gating and reduce the noise in the audio file
#@markdown * `weiner_filter()` - This function uses the Scipy's wiener filter to reduce the mean square error between sample in the audio effectively to reduce the noise in the audio file
#@markdown ---

class AudioPreprocessingTechniques:
  '''This Class consists definitions of several audio preprocessing techniques.
    The methods of this class are:
    1. .normalize_audio()
    2. .dynamic_range_compression()
    3. .noisereduce()
    4. .weiner_filter()
  '''

  def normalize_audio(self, wrapper_object): # https://github.com/jiaaro/pydub
    ''' This function uses the Pydubs Normalize feature to normalize the audio
        Parameters:
          - wrapper_object -> Instance of SemanticWrapper() that will contain all the information about the audio file (where it is stored and where it needs to be updated)
        Returns: None
    '''

    print("Normalizing audio......")

    audio_file_path = wrapper_object.info['audio_file_path']

    audio = AudioSegment.from_file(audio_file_path)                                       # Load audio file
    normalized_audio = normalize(audio)                                                   # Normalize audio
    normalized_audio.export(audio_file_path, format="mp3")                                # Export normalized audio (overwrite previous file)

    log_file(audio_file_path, "normalized_audio.mp3")

    print("Audio Normalization done!")



  def dynamic_range_compression(self, wrapper_object): # https://github.com/jiaaro/pydub
    ''' This function uses the Pydubs compress dynamic range function to reduce dynamic range (highs,lows) in the audio file
        Parameters:
          - wrapper_object -> Instance of SemanticWrapper() that will contain all the information about the audio file (where it is stored and where it needs to be updated)
        Returns: None
    '''
    print("Doing Dynamic range compression of the Audio file....")

    audio_file_path = wrapper_object.info['audio_file_path']

    audio = AudioSegment.from_file(audio_file_path)                           # Load the audio file
    compressed_audio = compress_dynamic_range(audio)                          # Apply dynamic range compression
    compressed_audio.export(audio_file_path, format="mp3")                    # Export the compressed audio

    log_file(audio_file_path, "dynamic-range-compressed.mp3")                 # Log

    print("Dynamic Range Compression Completed!")



  def noisereduce(self, wrapper_object): # https://pypi.org/project/noisereduce/
    ''' This function uses the noisereduce library feature to apply spectral gating and reduce the noise in the audio file
        Parameters:
          - wrapper_object -> Instance of SemanticWrapper() that will contain all the information about the audio file (where it is stored and where it needs to be updated)
        Returns: None
    '''
    print("Applying Spectral Gating/ Noise Gating using Noise Reduce Library......")

    audio_file_path = wrapper_object.info['audio_file_path']

    rate, data = wavfile.read(audio_file_path)                                            # load data
    reduced_noise = nr.reduce_noise(y=data, sr=rate, torch_flag=True, device=device_id)   # perform noise reduction on GPU
    wavfile.write(audio_file_path, rate, reduced_noise)                                   # Export the file
    # gc.collect(); torch.cuda.empty_cache();                                             # Clear the GPU memory

    log_file(audio_file_path, "noise_reduced.mp3")

    print("Noise reduction using spectral gating done!")


  def weiner_filter(self, wrapper_object): # https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.wiener.html
    ''' This function uses the Scipy;s wiener filter to reduce the mean square error between sample in the audio effectively to reduce the noise in the audio file
        Parameters:
          - wrapper_object -> Instance of SemanticWrapper() that will contain all the information about the audio file (where it is stored and where it needs to be updated)
        Returns: None
    '''
    print("Applying weiner filter on audio file....")

    audio_file_path = wrapper_object.info['audio_file_path']

    waveform, sample_rate = torchaudio.load(audio_file_path)                        # Load audio file

    if waveform.ndim > 1:                                                           # Check if multi-channel audio
        waveform = waveform.mean(dim=0)                                             # Convert to mono if stereo
    denoised = wiener(waveform.numpy(), mysize=None, noise=None)                    # Apply weiner filter
    denoised_waveform = torch.tensor(denoised)                                      # Transfer to pytorch tensor to save the file

    torchaudio.save(audio_file_path, denoised_waveform.unsqueeze(0), sample_rate, encoding="PCM_S", bits_per_sample=16)  # Export the file

    log_file(audio_file_path, "weiner_filtered.mp3")

    print("Completed applying weiner filter on audio file to reduce noise!")

In [None]:
#@markdown #### **Whisper Transcripter Class Definition**
#@markdown `Class whisper_transcripter`
#@markdown * `__init__()` - called when the whisper transcripter model is first initialised
#@markdown * `transcribe()` - This is the function that will be called with the audio file. This function will return the transcription output
#@markdown * `flush_memory()` - This will destroy the model initialization, releasing the RAM and GPU memory it was taking
#@markdown ---

class whisper_transcripter:
  ''' This class holds the initialization code for Stable Whisper (a superset of Whisper) and the transcribe function that will call the actual transcriber with the configuration
      The methods of this class are:
        1. __init__() - called when the whisper transcripter model is first initialised
            Note: This needs to be called only once in the app so as not overload the GPU with multiple instances of the weights of the model
        2. transcribe() - This is the function that will be called with the audio file. This function will return the transcription output
        3. flush_memory() - This will destroy the model initialization, releasing the RAM and GPU memory it was taking
  '''

  def __init__(self, model_size, device_id, compute_type):
    '''This function will initialize the whisper transcription model in CPU and GPU Ram
      Parameters:
        - model_size : Size of the whisper transcription model
        - device_id : The GPU device in which the model needs to be initialized
        - compute_type : the size of the Floating point values and its operaitons
        - dq=True : Dynamic Quantization - To reduce Memory Usage
      Returns: None
    '''
    self.model = stable_whisper.load_faster_whisper(model_size, device=device_id, compute_type=compute_type)

  def transcribe(self, wrapper_object):
    '''This is the function that will perform the actual transcription
    Parameters:
      - wrapper_object -> This is an instance of SemanticWrapper() that will hold all the information about the audio file.
    Returns: None
    Arguments predefined for Transcription:
      - regroup = True -> This is a feature of stable-ts (superset of stable_whisper which is a superset of faster whisper which is a superset of vanilla whisper XD)
                       -> This will save bifurcate transcripts into good understandable semantic chunks with respect to voice breaks in the audio file
      - denoiser='demucs' -> demucs is a library from meta that isolates voice from music. Although we dont have audio files as input, demucs works as a denoiser and isolates voice signals effectively removing noise from the data
      - vad = True -> This is a feature of stable-ts which uses Silero VAD to cut the audio file into chunks where voice activity is detected. Transcription is not applied on audio chunks where voice is not detected
      - beam_size=3 -> Beam search decoding implemented - Gets the best token out of 5
      - condition_on_previous_text=True -> A parameter that will enable whisper transcriber to understand and predict current token from the previous tokens generated. It has been found to produce hallucinations in audio.
      - vad_filter=True -> Same as vad (applies Silero VAD)
      - vad_parameters=dict(min_silence=1000) -> This is the parameter that will apply VAD chunking if non-voice activity is more than 1s
      - initial_prompt=None -> This can be used to modify how we want our transcript to be [Future Implemnetation]
      - no_repeat_ngram_size=9 -> This is in order to produce repeatitions incase the audio data has repeatations.
      - nonspeech_error=0 -> This is another parameter that can control how much of error between words can occur
      - Temparature=0 -> To reduce hallucination
      - suppress_tokens=[] -> In order to produce tokens such as um, hmm, ha that is suppressed by default by whisper
      - hallucination_silence_threshold=1 -> This is to remedy a bug in whisper where repetative words are produced to incase of non-voice activity
      - hotwords=None -> This is can used to enter domain specific words or names to provide custom vocabulary to the transcriptor [Future Implementation]
      - language -> The language to transcribe to (in case None - then auto detect in first 30 seconds)(what if multilingual audio file?[Future Implementation])
      - task -> transcribe/translate -> whisper can directly translate to english language. However this is less trained (Only checkpoints available)
    '''

    result = self.model.transcribe_stable(audio=wrapper_object.info['audio_file_path'],
                                          regroup=True,
                                          denoiser='demucs',
                                          vad=True,
                                          beam_size=3,
                                          condition_on_previous_text=True,
                                          vad_filter=True,
                                          vad_parameters=dict(min_silence_duration_ms=1000),
                                          initial_prompt=None,
                                          temperature=0,
                                          # suppress_tokens=[],
                                          hallucination_silence_threshold=1,
                                          hotwords=None,
                                          language=wrapper_object.info['language_to_transcribe_to'], # The language code must be from one of the valid ones... (need to create a dropdown list)
                                          task='transcribe'
                                          )

    transcript_file_path = wrapper_object.info['update_path'] + '/transcript.json'

    result.save_as_json(transcript_file_path)
    # wrapper_object.info['whisper_transcript_unmodified'] = json_to_dict(transcript_file_path)

    wrapper_object.info['transcript_file_path'] = transcript_file_path

  def flush_memory(self,):
    '''
    This function will flush the GPU and CPU memory using gc (garbage collector) and using torch.cuda.empty_cache()
    Parameters: None
    Returns: None
    '''
    self.model = None
    gc.collect();
    torch.cuda.empty_cache();

In [None]:
#@markdown #### **Speaker Diarization Class Definition**
#@markdown `class diarization_pipeline`
#@markdown * `__init__()` - called when the whisper transcripter model is first initialised
#@markdown * `diarize()` - This is the function that will be called with the audio file. This function will return the transcription output
#@markdown * `flush_memory()` - This will destroy the model initialization, releasing the RAM and GPU memory it was taking
#@markdown ---


class diarization_pipeline:

  def __init__(self, ):
    '''This function will initialize the Diarization model from Pyannote Audio that will segment the audio file w.r.to the speaker
    Parameters: None
    Returns: None
    '''
    self.diarization = None

    self.pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="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")
    self.pipeline.to(torch.device("cuda"))

  def diarize(self, wrapper_object):
    '''This function will perform the speaker diarization and output the timestamps at which different speaker spoke in the audio
    Parameters:
      - wrapper_object -> An instance of SemanticWrapper() which will hold all the information with respect to the audio file
    '''
    waveform, sample_rate = torchaudio.load(wrapper_object.info['audio_file_path'])

    with ProgressHook() as hook:
      self.diarization = self.pipeline({"waveform": waveform, "sample_rate": sample_rate}, hook=hook, min_speakers=1, max_speakers=wrapper_object.info['max_no_of_speakers']) # Add parameter to dynamically add no of speakers
    assert isinstance(self.diarization, Annotation)

    wrapper_object.info['diarization_output'] = self.diarization

    print("Speaker Diarization Done!")

  def flush_memory(self,):
    '''
    This function will flush the GPU and CPU memory using gc (garbage collector) and using torch.cuda.empty_cache()
    Parameters: None
    Returns: None
    '''
    self.pipeline = None
    self.diarization = None
    gc.collect();
    torch.cuda.empty_cache();

In [None]:
#@markdown #### **Transcript Postprocessing Script Function Definition**
#@markdown `prepare_transcript()`
#@markdown * This function will prepare from the segment transcript a concatenated transcript with different speaker annotations.
#@markdown * This model combines the output of Whisper Transcription and Speaker diarization and combines them to produce the final transcript
#@markdown * Method:
#@markdown  1. First for every word produced in the transcript, we are using timestamps of speaker diarization to segregate individual words with respect to the speaker and creating speaker based chunks
#@markdown  2. The speaker diarization produces speaker segments based on VAD even if the same speaker is speaking after some gaps. So, we will combine speaker segments if the gap between segments is less than 2s.
#@markdown  3. After this we will concate all transcript segments annotating the speaker to produce the final transcript.

#@markdown ---

def prepare_transcript(wrapper_object):
  ''' This function will prepare from the segment transcript a concatenated transcript with different speaker annotations.
  This model combines the output of Whisper Transcription and Speaker diarization and combines them to produce the final transcript
  Parameters: WrapperObject -> instance of SemanticWrapper() that will hold all the information about the audio file
  Returns: None
  Method:
  1. First for every word produced in the transcript, we are using timestamps of speaker diarization to segregate individual words with respect to the speaker and creating speaker based chunks
  2. The speaker diarization produces speaker segments based on VAD even if the same speaker is speaking after some gaps. So, we will combine speaker segments if the gap between segments is less than 2s.
  3. After this we will concate all transcript segments annotating the speaker to produce the final transcript.

  '''
  # Takes the wrapper object
  # Takes its transcription, and diarization data and prepares the transcript....
  # And stores the final transcript in itself

  # Get Data
  transcription = json_to_dict(wrapper_object.info['transcript_file_path'])
  diarization = wrapper_object.info['diarization_output']

  # Code idea from : https://github.com/thomasmol/cog-whisper-diarization

  init_segments = transcription['segments']

  # Initialize variables to keep track of the current position in both lists
  margin = 0.1  # 0.1 seconds margin

  # Initialize an empty list to hold the final segments with speaker info
  final_segments = []

  diarization_list = list(diarization.itertracks(yield_label=True))
  unique_speakers = {
      speaker for _, _, speaker in diarization.itertracks(yield_label=True)
  }
  detected_num_speakers = len(unique_speakers)

  speaker_idx = 0
  n_speakers = len(diarization_list)

  # Iterate over each segment
  for segment in init_segments:
      segment_start = segment["start"]
      segment_end = segment["end"]
      segment_text = []
      segment_words = []

      # Iterate over each word in the segment
      for word in segment["words"]:
          word_start = word["start"] - margin
          word_end = word["end"] + margin

          while speaker_idx < n_speakers:
              turn, _, speaker = diarization_list[speaker_idx]

              if turn.start <= word_end and turn.end >= word_start:
                  # Add word without modifications
                  segment_text.append(word["word"])

                  # Strip here for individual word storage
                  word["word"] = word["word"].strip()
                  #word["word"] = word["word"]
                  segment_words.append(word)

                  if turn.end <= word_end:
                      speaker_idx += 1

                  break
              elif turn.end < word_start:
                  speaker_idx += 1
              else:
                  break

      if segment_text:
          combined_text = " ".join(segment_text)
          cleaned_text = combined_text
          #cleaned_text = re.sub("  ", " ", combined_text).strip()
          new_segment = {
              "avg_logprob": segment["avg_logprob"],
              "start": segment_start,
              "end": segment_end,
              "speaker": speaker,
              "text": cleaned_text,
              "words": segment_words,
          }
          final_segments.append(new_segment)


  print("Starting cleaning")
  final_segments = final_segments

  dict_to_json(final_segments, wrapper_object.info['update_path'] + '/logs/' + 'before_cleaning.json') # Log

  # Write code to split text if more than 15 seconds.....
  # There is some error with speaker diarization time and chunk length -
  # for this we have to give (may be 300ms) offset while evaluating speaker diariazation timestamps....

  # Make output
  output = []  # Initialize an empty list for the output

  # Initialize the first group with the first segment
  current_group = {
      "start": str(final_segments[0]["start"]),
      "end": str(final_segments[0]["end"]),
      "speaker": final_segments[0]["speaker"],
      "avg_logprob": final_segments[0]["avg_logprob"],
  }

  current_group["text"] = final_segments[0]["text"]

  current_group["words"] = final_segments[0]["words"]

  for i in range(1, len(final_segments)):
      # Assuming all segments are <= 15 seconds from the previous splitting....
      # Calculate time gap between consecutive segments
      time_gap = final_segments[i]["start"] - final_segments[i - 1]["end"]
      # If the current segment's speaker is the same as the previous segment's speaker,
      # and the time gap is less than or equal to 2 seconds, group them
      if final_segments[i]["speaker"] == final_segments[i - 1]["speaker"]:
          current_group["end"] = str(final_segments[i]["end"])
          current_group["text"] += " " + final_segments[i]["text"]
          current_group.setdefault("words", []).extend(final_segments[i]["words"])
      else:
          # Add the current_group to the output list
          output.append(current_group)

          # Start a new group with the current segment
          current_group = {
              "start": str(final_segments[i]["start"]),
              "end": str(final_segments[i]["end"]),
              "speaker": final_segments[i]["speaker"],
              "avg_logprob": final_segments[i]["avg_logprob"],
          }
          current_group["text"] = final_segments[i]["text"]
          current_group["words"] = final_segments[i]["words"]

  # Add the last group to the output list
  output.append(current_group)

  dict_to_json(output, wrapper_object.info['update_path'] + '/logs/' + 'after_cleaning.json') # Log

  transcript = " "

  for segment in output:
    transcript += '[' + 'Person ' + personify(segment['speaker']) + '] '
    transcript += segment['text'] + " "
    transcript += '\n'

  wrapper_object.info['final_transcript'] = transcript


In [None]:
#@markdown #### **Generative AI [LLM - Translation, Summarization, Question Answering] Class Definition**
#@markdown `class gen_ai_model`
#@markdown * `__init__()` - This will configure the Google's GenAI model to access Gemini's 1.5 Flash model
#@markdown * `get_word_meaning_pairs()` - This function defines a prompt that will generate word-meaning pairs of some important words in the transcript that can help cross-lingual translation and summarization. The Word is in the source_language in which the transcription was produced. And the meaning is in the target language in which the transcript needs to translated or summarized to.
#@markdown * ` translate()` - This function defines a prompt to the Gemini API that will input the transcript generated and the word-meaning pairs generate by the get_word_meaning_pairs() and produces the translation
#@markdown * ` summarize()` - Similar to the translate() function this function defines a prompt to summarize the transcript using transcript and word-meaning pairs.
#@markdown * ` ask_question()` - This function defines a prompt to ask question about the transcript
#@markdown ---
#@markdown * ` gen_ai_model_instance` - Initialisation of gen_ai_model since its lightweight (Only config files - No model loading)
#@markdown ---

class gen_ai_model:
  '''This class defines methods that will be used for Translation/Summarization/Question-Answering using the Google's Gemini API
      1. __init__() - This will configure the Google's GenAI model to access Gemini's 1.5 Flash model
      2. get_word_meaning_pairs() - This function defines a prompt that will generate word-meaning pairs of some important words in the transcript that can help cross-lingual translation and summarization. The Word is in the source_language in which the transcription was produced. And the meaning is in the target language in which the transcript needs to translated or summarized to.
      3. translate() - This function defines a prompt to the Gemini API that will input the transcript generated and the word-meaning pairs generate by the get_word_meaning_pairs() and produces the translation
      4. summarize() - Similar to the translate() function this function defines a prompt to summarize the transcript using transcript and word-meaning pairs.
      5. ask_question() - This function defines a prompt to ask question about the transcript
  '''

  def __init__(self,):
    ''' This function initializes the Google's GenAI model
    Parameters: None, Returns: None
    '''
    genai.configure(api_key="AIzaSyD_msiizNofC-U1QYHxFC4bIPX06Okv6NA")
    self.model = genai.GenerativeModel('gemini-1.5-flash')

  def get_word_meaning_pairs(self, transcript, source_language, target_language):
    '''This function defines a prompt that will generate word-meaning pairs of some important words in the transcript that can help cross-lingual translation and summarization. The Word is in the source_language in which the transcription was produced. And the meaning is in the target language in which the transcript needs to translated or summarized to.
        Note: This is based on the assumption that giving word-meaning pairs will improve the downstream task (Translation/Summarization/Question Answering)
        Parameters: (This parameters will be provided translate(), summarize() or question() functions)
          - transcript - the transcription output from the whisper model
          - source_language - the language of the transcript
          - target_language - The language in which the meaning to be
    '''
    try:
      prompt = [
        f"You are a multilingual translator. Given the Transcript in the {source_language} language, try to create word: meaning pairs for the most important words in the transcript, where word is in the {source_language} language from the transcript  and meaning is in the {target_language} language.",
        f"Try to write meanings of words, understanding the semantic context of the transcript. Generate upto 30 words which you find most important for contextual understanding of transcript."
        f"input: [Start of Transcript]: {transcript} [End of Transcript]",
        "output: ಅಮ್ಮ (Amma) - माँ (Maa): वह महिला जिसने जन्म दिया है, माता।,\n ಮೂಕ (Mooka) - गूंगा (Goonga): वह व्यक्ति जो बोलने में असमर्थ है।\n Love - प्यार (Pyaar): एक गहरा, कोमल और प्रेमपूर्ण भावना।,\n Flower - फूल (Phool): पौधों के रंगीन और खुशबूदार अंग, जो पत्तियों के बीच खिलते हैं।\n ನೀರು (Neeru) - Water: A transparent, colorless liquid that is essential for life. \n ಮನೆ (Mane) - House: The place where people live, a dwelling. ",
      ]

      # print(self.model.count_tokens(" ".join(prompt))) # To check if the no of tokens has crossed 1million...

      # Generation configuration
      generation_config = {
        "temperature":0.8, # To avoid hallucination
        "top_p": 0.6, # Lower, the better accurate results
        "top_k": 30, # Lower, the better accurate results
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain"
      }

      response = self.model.generate_content(prompt, generation_config=generation_config)

      data = response.to_dict() # Accessing the answer
      print(data)
      text = markdown_to_plain(data['candidates'][0]['content']['parts'][0]['text'])

      print(data)

      return text

    except Exception as e:
      print(f'{type(e).__name__}: {e}')
      return "Error : Please Rerun"

  def translate(self, transcript, source_language, target_language, progress):
    '''This function defines a prompt to the Gemini API that will input the transcript generated and the word-meaning pairs generate by the get_word_meaning_pairs() and produces the translation
        Parameters: (This parameters will be provided by the gradio app)
          - transcript - the transcription output from the whisper model
          - source_language - the language of the transcript
          - target_language - The language in which translation needs to be
          - progress - checkpoints for the gradio app
    '''
    try:
      progress(.2, desc=f"Generating Word Meaning Pairs for Translation word is in {source_language} and meaning is in {target_language}") # This is a progress checkpoint for Gradio app

      # This function will get the word_meaning_pairs()
      word_meaning_pairs = self.get_word_meaning_pairs(transcript, source_language, target_language)

      print(word_meaning_pairs)

      prompt = [
        f"You are a helpful Multilingual Translation Assistant. Given the Transcript in the {source_language} language, and some word-meaning pairs where the word is in {source_language} and meaning is in {target_language}, try to generate accurate translation of the transcript in the {target_language}",
        f"You should make sure to write the transcription as accurately as possible taking the understanding from the word-meaning pairs and the entire transcript for context and semantic understanding. Note that if there are no word meaning pairs then try to translate directly as accurately as possible without them."
        f"input1: [Start of Transcript]: {transcript} [End of Transcript]",
        f"input2: [Word Meaning Pairs]: {word_meaning_pairs} [End of Word Meaning Pairs]",
        f"output: The transcript translation in the {target_language}",
      ]

      # print(self.model.count_tokens(" ".join(prompt))) # To check if the no of tokens has crossed 1million...

      progress(.3, desc=f"Generating Translation in the {target_language}")

      # Generation configuration for Gemini Model
      generation_config = {
        "temperature":0.8,
        "top_p": 0.6,
        "top_k": 30,
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain"
      }

      response = self.model.generate_content(prompt, generation_config=generation_config)

      progress(.9, desc=f"Returning Generated Translation")

      data = response.to_dict()
      text = markdown_to_plain(data['candidates'][0]['content']['parts'][0]['text'])

      return text

    except Exception as e:
      print(f'{type(e).__name__}: {e}')
      return "Error : Please Rerun Translation"

  def summarize(self, transcript, source_language, target_language, progress):
    '''Similar to the translate() function this function defines a prompt to summarize the transcript using transcript and word-meaning pairs.
        Parameters: (This parameters will be provided by the gradio app)
          - transcript - the transcription output from the whisper model
          - source_language - the language of the transcript
          - target_language - The language in which summary needs to be
          - progress - checkpoints for the gradio app
    '''

    try:

      progress(.2, desc=f"Generating Word Meaning Pairs for Translation word is in {source_language} and meaning is in {target_language}")

      word_meaning_pairs = self.get_word_meaning_pairs(transcript, source_language, target_language)

      prompt = [
        f"You are multilingual summarizer. Given the Transcript in the {source_language} language, and some word-meaning pairs where the word is in {source_language} and meaning is in {target_language}, try to generate summary of the transcript in the {target_language}. Avoid writing [Person 1] tags. Summarize the transcript like an article. Focus on what was spoken from higher level to lower level.",
        f"Note that if there are no word meaning pairs then try to translate directly as accurately as possible without them."
        f"input1: [Start of Transcript]: {transcript} [End of Transcript]",
        f"input2: [Word Meaning Pairs]: {word_meaning_pairs} [End of Word Meaning Pairs]",
        f"output: The summary in the {target_language}",
      ]

      print(self.model.count_tokens(" ".join(prompt))) # To check if the no of tokens has crossed 1million...

      progress(.3, desc=f"Generating Summary in the {target_language}")

      generation_config = {
        "temperature":0.8,
        "top_p": 0.6,
        "top_k": 30,
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain"
      }

      response = self.model.generate_content(prompt, generation_config=generation_config)

      progress(.9, desc=f"Returning Generated Summary")

      data = response.to_dict()
      print(str(data))
      text = markdown_to_plain(data['candidates'][0]['content']['parts'][0]['text'])

      return text

    except Exception as e:
      print(f'{type(e).__name__}: {e}')
      return "Error: Please Rerun Summarization"

  def ask_question(self, transcript, question, progress):
    '''This function defines a prompt to ask question about the transcript
        Parameters: (This parameters will be provided by the gradio app)
          - transcript - the transcription output from the whisper model
          - question - The question that is being asked about the transcript
          - progress - checkpoints for the gradio app
    '''
    try:
      prompt = [
        f"You are a helpful Question Answering Assistant. Given the Transcript, try to answer the question from understanding the given transcript.",
        f"input1: [Start of Transcript]: {transcript} [End of Transcript]",
        f"input2: [Question] {question} [End of Question]"
        f"output: Generate the answer for the question",
      ]

      progress(.3, desc=f"Answering Question")

      print(self.model.count_tokens(" ".join(prompt))) # To check if the no of tokens has crossed 1million...

      generation_config = {
        "temperature":0.8,
        "top_p": 0.6,
        "top_k": 30,
        "max_output_tokens": 8192,
        "response_mime_type": "text/plain"
      }

      response = self.model.generate_content(prompt, generation_config=generation_config)


      progress(.9, desc=f"Returning Generated Answer")

      data = response.to_dict()
      text = markdown_to_plain(data['candidates'][0]['content']['parts'][0]['text'])

      return text

    except Exception as e:
      print(f'{type(e).__name__}: {e}')
      return "Error: Please Rerun Question Answering"

# Initialise GenAI model (since its lightweight)
gen_ai_model_instance = gen_ai_model()

In [None]:
#@markdown #### **Semantic Wrapper Class**
#@markdown `class SemanticWrapper`
#@markdown * `__init__()` - This will initialize a wrapper that will have an arbitrary file name and folder name in which all the audio files, logs, json file will be stored
#@markdown * This object will also hold the semantic level metadata as well as pipeline specific data.
#@markdown * The idea behind using an object is since python passes values by reference, we can send this object to functions that have its own independent functionality
#@markdown * This allows all different functionalities to be independent of each other reducing dependency and helping in debugging.
#@markdown * Also each functionality can be tackled independently and can be updated independently without reling on pipeline dependency
#@markdown * Also scaling is easily since objects can be created and these objects can be tackled independently incase of batchprocessing, parallel processing or multi threading
#@markdown ---

class SemanticWrapper:
  def __init__(self,):
    '''This function will initialise a instance of the SemanticWrapper with a file_path and file_name that will have contain all the audio, logs and JSON files
    Also this will hold all the metadata about the audio file.
    Parameter: None
    Returns: None
    '''
    self.info = {}
    self.info['update_path'] = "generated_data/" + ''.join(random.choices(string.ascii_uppercase + string.digits, k=3))
    self.info['file_name'] = os.path.basename(self.info['update_path'])
    os.mkdir(self.info['update_path'])
    os.mkdir(self.info['update_path']+'/logs')

In [None]:
#@markdown #### **PipeLine Function**
#@markdown `demoPipeline()`
#@markdown *  This function is the pipeline that will call the different parts of the pipeline one after the other.
#@markdown ---
#@markdown Pipeline Order:
#@markdown 1. First Audio Preprocessing Techniques are applied on the audio
#@markdown  * 1. Normalization of Audio
#@markdown  * 2. Dynamic Range Compression on the audio file
#@markdown  * 3. Noise Reduce (using Spectral Gating)
#@markdown  * 4. Weiner Filter (mean average between sample bits)
#@markdown 2. Transcription of the audio file using Multilingual Whisper
#@markdown 3. Segments audio timestamps using Pyannote Speaker Diariazation Model
#@markdown 4. Combine Diarization Output + Transcript to Final Transcript
#@markdown 5. Save to JSON file
#@markdown ---

def demoPipeline(wrapperObject, progress):
  ''' This function is the pipeline that will call the different parts of the pipeline one after the other. The order in which the different functionalities is important since the order matters here for efficient transcription
      Parameters:
        - wrapperObject -> Instance of SemanticWrapper() that will hold all the audio data. - This is nothing but the audio file
        - progress - checkpoints for the gradio app
      Returns: None

      Pipeline Order:
      1. First Audio Preprocessing Techniques are applied on the audio
          - 1. Normalization of Audio
          - 2. Dynamic Range Compression on the audio file
          - 3. Noise Reduce (using Spectral Gating)
          - 4. Weiner Filter (mean average between sample bits)
      2. Transcription of the audio file using Multilingual Whisper
      3. Segments audio timestamps using Pyannote Speaker Diariazation Model
      4. Combine Diarization Output + Transcript to Final Transcript
      5. Save to JSON file

  '''

  progress(.1, desc="File Setup Complete -> Starting Audio Preprocessing - Normalizing Audio, Dynamic Range Compression, Noise Reduction, Weiner Filter")

  preprocessing = AudioPreprocessingTechniques()

  print("---------------------------------------------------------------------------------------------------")
  print("Starting Audio Preprocessing Techniques - Normalizing Audio, Dynamic Range Compression, Noise Reduction, Weiner Filter")
  print("---------------------------------------------------------------------------------------------------")

  preprocessing.normalize_audio(wrapperObject)
  # preprocessing.dynamic_range_compression(wrapperObject)                    # Is taking too much time - need to find efficient way to do this...
  # preprocessing.noisereduce(wrapperObject)                                  # noisereduce only supports .wav format as input. Conversion from .mp3 to .wav required.
  preprocessing.weiner_filter(wrapperObject)

  print("Audio Preprocessing Done!")

  progress(.3, desc="Audio Preprocessing Done! -> Initialialising Whisper Model...")

  print("---------------------------------------------------------------------------------------------------")
  print("Starting Whisper Transcription")
  print("---------------------------------------------------------------------------------------------------")

  print("----------Initializaing Whisper Transcription Model")

  whisper = whisper_transcripter("large-v3", device_id, "float16")


  progress(.4, desc="Whisper Model Initialised -> Starting Whisper Transcription...")

  print("----------Models Initialization Done!")

  whisper.transcribe(wrapperObject)

  print("---------------------------------------------------------------------------------------------------")
  print("Whisper Transcription Complete")
  print("---------------------------------------------------------------------------------------------------")

  whisper.flush_memory()

  progress(.7, desc="Whisper Transcription complete -> Initializing speaker diariazation model...")

  print("---------------------------------------------------------------------------------------------------")
  print("Starting Speaker Diarization")
  print("---------------------------------------------------------------------------------------------------")

  diarization = diarization_pipeline()


  progress(.8, desc="Speaker Diariazation Model Initialized -> Starting speaker diariazation...")

  diarization.diarize(wrapperObject)

  diarization.flush_memory()

  progress(.9, desc="Speaker Diariazation Complete -> Starting Transcript Preparation...")

  print("---------------------------------------------------------------------------------------------------")
  print("Starting Transcript Preparation")
  print("---------------------------------------------------------------------------------------------------")

  prepare_transcript(wrapperObject)

  wrapperObject.info['diarization_output'] = None
  dict_to_json(wrapperObject.info, wrapperObject.info['update_path'] + '/'+ "complete_data.json") # Log

  print("---------------------------------------------------------------------------------------------------")
  print("Pipeline Completed!")
  print("---------------------------------------------------------------------------------------------------")

  progress(.99, desc="Transcript Complete...")



In [None]:
#@markdown #### **Gradio Connector Functions for Transcription Pipeline**
#@markdown `transcribe_file()` and `transcribe_yt()`
#@markdown *  These are connector functions that will connect the Gradio APP with the entire pipeline.
#@markdown ---

def transcribe_file(file_path, language, slider_max, progress=gr.Progress()):
    ''' This is a connector function that will connect the Gradio APP with the entire pipeline.
    This function will setup all the wrapper object and all the paths to start and complete the entire transcription pipeline.
    This function is specific to file inputs.
    Parameters:
      - file_path -> Path to the file uploaded by the gradio app to transcribe
      - language -> Language mentioned by the user describing the language of audio in the file
      - slider_max -> int value shared by the user that will describe the no of speakers present in the audio file
      - progress -> checkpoints for Gradio app
    Returns:
      - Transcript -> Transcript generated by the pipeline
      - "result" -> str : This is just a string that will denote a state change in Gradio App
      - "Select language for Translation" -> str: That will be updated in Translation tabs of Gradio App
      - "Select Language for Summarization" -> str: This will be updated in Summarization tabs of Gradio App
      - "File Details" -> This will be updated in File Details Text Box
      - "Please Ask a Question" -> str : This will be updated in Question-Answering tab
    '''

    progress(.01, desc="Extracting audio file....")

    wrapperObject = SemanticWrapper()
    wrapperObject.info['url_or_file_path'] = file_path
    audio_video_converter(wrapperObject, file_path)

    if language is not None:
      wrapperObject.info['language_to_transcribe_to'] = languages[language]
    else:
      wrapperObject.info['language_to_transcribe_to'] = None

    wrapperObject.info['max_no_of_speakers'] = slider_max

    # Start Pipeline
    demoPipeline(wrapperObject, progress)

    print("Transcription complete.")

    return wrapperObject.info['final_transcript'], "result", "Select language for Translation", "Select Language for Summarization",  f"File Path: {os.path.basename(wrapperObject.info['url_or_file_path'])}, \n File Name: {wrapperObject.info['file_name']}", "Please ask a Question", wrapperObject.info['language_to_transcribe_to']  # Returning a state to trigger interface change

def transcribe_yt(video_url, language, slider_max, progress=gr.Progress()):
    ''' This is a connector function that will connect the Gradio APP with the entire pipeline.
    This function will setup all the wrapper object and all the paths to start and complete the entire transcription pipeline.
    This function is specific to Youtube URLs as input
    Parameters:
      - video_url -> URL of the youtube video on which the transcrption pipeline needs to be implemented
      - language -> Language mentioned by the user describing the language of audio in the file
      - slider_max -> int value shared by the user that will describe the no of speakers present in the audio file
      - progress -> checkpoints for Gradio app
    Returns:
      - Transcript -> Transcript generated by the pipeline
      - "result" -> str : This is just a string that will denote a state change in Gradio App
      - "Select language for Translation" -> str: That will be updated in Translation tabs of Gradio App
      - "Select Language for Summarization" -> str: This will be updated in Summarization tabs of Gradio App
      - "File Details" -> This will be updated in File Details Text Box
      - "Please Ask a Question" -> str : This will be updated in Question-Answering tab
    '''

    progress(.01, desc="Downloading audio file....")

    wrapperObject = SemanticWrapper()
    wrapperObject.info['url_or_file_path'] = video_url
    YTVideoDownloader(wrapperObject, video_url)

    if language is not None:
      wrapperObject.info['language_to_transcribe_to'] = languages[language]
    else:
      wrapperObject.info['language_to_transcribe_to'] = None

    wrapperObject.info['max_no_of_speakers'] = slider_max

    # Start Pipeline
    demoPipeline(wrapperObject, progress)

    print("Transcription complete.")
    return wrapperObject.info['final_transcript'], "result", "Select language for Translation", "Select Language for Summarization",  f"URL Path: {os.path.basename(wrapperObject.info['url_or_file_path'])}, \n File Name: {wrapperObject.info['file_name']}", "Please ask a Question", wrapperObject.info['language_to_transcribe_to']  # Returning a state to trigger interface change



In [None]:
#@markdown #### **Gradio Connector Functions for Gen AI Pipeline**
#@markdown `translate()` and `transcribe_yt()` and `question_answer()`
#@markdown *  These are connector functions that will connect the Gradio APP with the Gemini API Pipeline
#@markdown ---

def translate(transcript, source_language, target_language, progress=gr.Progress()):
    '''This function connects the gradio app with the gemini translation prompt
    Parameters:
    - transcript -> Transcript generated by the pipeline
    - source_language -> Language of the transcript
    - target_language -> Language in which the translation is requested
    - progress -> checkpoints for Gradio app
    Returns: Translated Text [str]
    '''
    translated_text = gen_ai_model_instance.translate(transcript, source_language, target_language, progress)

    return translated_text

def summarize(transcript, source_language, target_language, progress=gr.Progress()):
    '''This function connects the gradio app with the gemini translation prompt
    Parameters:
    - transcript -> Transcript generated by the pipeline
    - source_language -> Language of the transcript
    - target_language -> Language in which the summary is requested
    - progress -> checkpoints for Gradio app
    Returns: Summary [str]
    '''
    summarized_text = gen_ai_model_instance.translate(transcript, source_language, target_language, progress)

    return summarized_text

def question_answer(transcript, question, progress=gr.Progress()):
    '''This function connects the gradio app with the gemini translation prompt
    Parameters:
    - transcript -> Transcript generated by the pipeline
    - question -> Question asked by the user
    - progress -> checkpoints for Gradio app
    Returns: answer [str]
    '''
    answer = gen_ai_model_instance.ask_question(transcript, question, progress)

    return answer

    # Multi turn question answering is not implemented yet...
    # Cross language question answering is not implemented yet....

In [None]:
#@markdown #### **Gradio App Code**
#@markdown `reload_gradio_app()`
#@markdown *  This cell contains the code for the entire Gradio App
#@markdown ---

app = None

def reload_gradio_app():

    global app

    with gr.Blocks(theme=gr.themes.Base()) as app_instance:  # Note that even though the app is defined inside the function, we define it on the global namespace at the later part of function code.

      gr.Markdown(
        """
        # MultiLingual Whisper Transcription with *Gemini* powered Translation and Summarization
        #
        """)

      state = gr.State(value="transcription")  # Initial state
      with gr.Row():

          with gr.Column(scale=1) as transcription_interface:
              with gr.Tabs():

                  with gr.TabItem("Upload Audio/Video File"):
                      upload_file = gr.File(label="Upload File")
                      dropdown_file = gr.Dropdown(choices=language_list, label="Audio Language [Optional: Default -> English]", value="english")
                      slider_max_file = gr.Slider(minimum=1, maximum=10, label="Max No of Speakers[Optional: Default -> 6]", step=1, value=6)
                      submit_button_file = gr.Button("Transcribe")

                  with gr.TabItem("Upload YouTube Link"):
                      youtube_link = gr.Textbox(label="YouTube Link")
                      dropdown_yt = gr.Dropdown(choices=language_list, label="Audio Language [Optional: Default -> English]", value="english")
                      slider_max_yt = gr.Slider(minimum=1, maximum=10, label="Max No of Speakers[Optional: Default -> 6]", step=1, value=6)
                      submit_button_yt = gr.Button("Transcribe")

          with gr.Column(visible=False, scale=1) as result_interface:

              with gr.Column():

                  with gr.Column():
                      refresh = gr.Button("Upload New File")
                      gr.Markdown(f""" *** """)

                  with gr.Column("File Details"):
                      file_details = gr.Textbox(label="File Details", lines=4, value="Please first upload a file")
                      gr.Markdown(f""" *** """)
                      language_detected = gr.Textbox(visible=False)

              with gr.Column("Translate"):
                  target_language_translate = gr.Dropdown(choices=language_list, label="Translate to Language [Default -> english]", value="english")
                  translate_button = gr.Button("Translate")

              with gr.Column("Summarize"):
                  target_language_summarize = gr.Dropdown(choices=language_list, label="Summarize in Language [Default -> english]", value="english")
                  summarize_button = gr.Button("Summarize")

          with gr.Column(scale=2) as result_boxes:
              # with gr.Column("File Details"):
              #     file_details = gr.Textbox(label="File Details", lines=3, value="Please first upload a file")
              with gr.Tabs():
                  with gr.TabItem("Transcription"):
                      transcription_output = gr.Textbox(label="Transcription Result", lines=20, show_copy_button=True, value="Please provide a Youtube link or Upload a Audio/Video File")
                  with gr.TabItem("Translation"):
                      translation_output = gr.Textbox(label="Translation Result", lines=20, show_copy_button=True, value="Please first transcribe a audio file")
                  with gr.TabItem("Summarization"):
                      summarization_output = gr.Textbox(label="Summarization Result", lines=20, show_copy_button=True, value="Please first transcribe a audio file")
                  with gr.TabItem("Question Answering"):
                      question_input = gr.Textbox(label="Question", lines=3, show_copy_button=True, value="Please first transcribe a audio file")
                      question_button = gr.Button("Ask")
                      answer_output = gr.Textbox(label="Answer", lines=20, show_copy_button=True, value="")

      # Transition logic based on state
      def switch_to_result(state):
          if state == "result":
              return gr.update(visible=False), gr.update(visible=True), gr.update(visible=True)
          else:
              return gr.update(visible=True), gr.update(visible=False), gr.update(visible=True)


      state.change(fn=switch_to_result, inputs=state, outputs=[transcription_interface, result_interface, result_boxes])

      submit_button_file.click(
          fn=transcribe_file,
          inputs=[upload_file, dropdown_file, slider_max_file],
          outputs=[transcription_output, state, translation_output, summarization_output, file_details, question_input, language_detected]
      )

      submit_button_yt.click(
          fn=transcribe_yt,
          inputs=[youtube_link, dropdown_yt, slider_max_yt],
          outputs=[transcription_output, state, translation_output, summarization_output, file_details, question_input, language_detected]
      )

      translate_button.click(
          translate,
          inputs=[transcription_output, language_detected, target_language_translate],
          outputs=translation_output
      )

      summarize_button.click(
          summarize,
          inputs=[transcription_output, language_detected, target_language_summarize],
          outputs=summarization_output
      )

      question_button.click(
          question_answer,
          inputs=[transcription_output, question_input],
          outputs=answer_output
      )

      refresh.click(None, js="window.location.reload()")

      switch_to_result(state.value)  # Initialize the correct interface visibility

    app = app_instance

## Run App 🚀

In [None]:
reload_gradio_app()
app.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://7a0e9bdd470fff20be.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Resampling done.
---------------------------------------------------------------------------------------------------
Starting Audio Preprocessing Techniques - Normalizing Audio, Dynamic Range Compression, Noise Reduction, Weiner Filter
---------------------------------------------------------------------------------------------------
Normalizing audio......
Audio Normalization done!
Applying weiner filter on audio file....
Completed applying weiner filter on audio file to reduce noise!
Audio Preprocessing Done!
---------------------------------------------------------------------------------------------------
Starting Whisper Transcription
---------------------------------------------------------------------------------------------------
----------Initializaing Whisper Transcription Model


config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

----------Models Initialization Done!


Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
100%|██████████| 80.2M/80.2M [00:01<00:00, 57.7MB/s]
Demucs: 100%|██████████| 340.11/340.11 [00:15<00:00, 21.90sec/s]


Detected Language: english
Transcribing with faster-whisper (large-v3)...

Transcribe: 100%|██████████| 340.11/340.11 [00:30<00:00, 11.01sec/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
VAD: 100%|██████████| 340.11/340.11 [00:17<00:00, 19.48sec/s]
Adjustment: 100%|██████████| 339.99/339.99 [00:00<00:00, 13293.26sec/s]


Saved: /content/generated_data/0IB/transcript.json
---------------------------------------------------------------------------------------------------
Whisper Transcription Complete
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Starting Speaker Diarization
---------------------------------------------------------------------------------------------------


config.yaml:   0%|          | 0.00/469 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/399 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/221 [00:00<?, ?B/s]

Output()

Speaker Diarization Done!
---------------------------------------------------------------------------------------------------
Starting Transcript Preparation
---------------------------------------------------------------------------------------------------
Starting cleaning
---------------------------------------------------------------------------------------------------
Pipeline Completed!
---------------------------------------------------------------------------------------------------
Transcription complete.
{'candidates': [{'content': {'parts': [{'text': "Here are some important word: meaning pairs from the transcript, focusing on the context and Hindi translation:\n\n**English** | **Hindi** | **Meaning**\n------- | -------- | --------\nCongratulations | बधाई | A formal expression of joy or good wishes\nThank you | धन्यवाद | An expression of gratitude\nJoining | शामिल होना | To become part of something\nOver to you | आपका बारी |  A phrase used to pass the conversation to someone



In [None]:
# Stop the above running cell before running this cell
app.close()

Closing server running on port: 7860


In [None]:
gc.collect();
torch.cuda.empty_cache();

# --------------------------------------------------------------------------------------------

## Evaluation

Comprehensive evaluation of an ASR model requires large-scale (min ~1GB) datasets with GPU Compute power to run evaluations. For this assignment we will be evaluating ASR's performance on a audio file with 'Gold' transcript available.

For English Language:
Audio File: (15mins) [yt-link](https://www.youtube.com/watch?v=ysLiABvVos8)
Gold Transcript: [Transcript Link](https://www.newsonair.gov.in/bulletins-detail/english-morning-news-7/)

For Hindi Language:
Audio File:(15mins) [yt-link](https://www.youtube.com/watch?v=wDz9BNWXfn8)
Gold Transcript: [Transcript Link](https://www.newsonair.gov.in/bulletins-detail/%e0%a4%b8%e0%a4%ae%e0%a4%be%e0%a4%9a%e0%a4%be%e0%a4%b0-%e0%a4%b8%e0%a4%82%e0%a4%a7%e0%a5%8d%e0%a4%af%e0%a4%be-77/)

I intended to access the ASR system on kannada and telugu language but was not able to find gold-transcript for it in this short time.

### Evaluation Setup
Need to run this after running Full setup 🏗️ above

#### Gold Transcript and testing dataset


In [None]:
# Gold Transcript for video link https://www.youtube.com/watch?v=ysLiABvVos8
g_transcript_eng = """
Presents Morning News. Good Morning, I'am Sairah Mujtabha. The headlines:
Campaigning for remaining phases of Lok Sabha elections intensifies.
President Droupadi Murmu to confer Padma Awards at second investiture ceremony in Rashtrapati Bhavan today.
Tri-service Conference, ‘Parivartan Chintan’, chaired by Chief of Defence Staff General Anil Chauhan to be held in New Delhi.
Unrest erupts in Pakistan Occupied Kashmir as Pakistan security forces deployed ahead of protests.
IMD forecasts fresh spell of heat wave to continue over Rajasthan and Madhya Pradesh.
In Table Tennis, Manika Batra reaches quarterfinals at Saudi Smash in Jeddah.

And now the news in detail:
Campaigning for the remaining phases of Lok Sabha elections has intensified. Voting in the fourth phase will take place on 13th of this month in 96 seats spread across ten states and union territory. As many as 1,717 candidates from 10 states and union territory are in the fray for this phase.
In Telangana, electioneering reached its peak in the state with only three days remaining for the campaigning before the May 13th polling. Top leaders of all major parties are making a beeline seeking support from the electors for their candidates.
BJP Senior Leader and Union Minister, Amit Shah will be taking part in electioneering in Bhongir Lok Sabha Constituency later this morning. He will be campaigning in support of party candidate Dr Boora Narsaiah Goud.
AICC senior leader, Rahul Gandhi will take part in electioneering in Telangana today and participate in Jana Jatara Sabha at Narsapur and attend a public meeting at Saroor Nagar on the outskirts of Hyderabad this evening. BRS Chief and former Chief Minister K Chandrasekhar Rao will hold a roadshow at Karimnagar later today.
Earlier, Prime Minister Narendra Modi addressed two public meetings yesterday in Karimnagar and Warangal Lok Sabha constituencies. Mr Modi said the bond between the Congress and the BRS was inseparable with regard to corruption. He alleged that the parties become a syndicate behind the screen though they criticise each other in the open.
BRS Chief and former Chief Minister K Chandrasekhar Rao said the BJP was invoking Pakistan for its electoral gains and neglecting real issues concerning the nation. Addressing people at a roadshow held at Patancheru, he said the ruling party had also maligned the nation on international platforms through undemocratic decisions like the arrest of Delhi Chief Minister Arvind Kejriwal apart from targeting whoever opposed it.

Meanwhile, BJP leader and Union Minister Nitin Gadkari said that unwavering public support has helped the BJP in implementing impactful schemes and projects at Centre. Addressing a public meeting in Delhi yesterday, in support of BJP candidate from South Delhi Lok Sabha seat, Ramvir Singh Bidhuri, Mr Gadkari said, the Union Government has spent over one lakh crore rupees in various projects over the past ten years that has transformed Delhi. He added that the government has also boosted water supply in the national capital and cleaning of Yamuna River.

In Uttar Pradesh, hectic election campaigning is scheduled for today.

To make impeccable strategy for Raebareli elections, Senior BJP leader and party president Jagat Prakash Nadda will be in state capital Lucknow today evening. He will hold meeting of senior party functionaries of Lucknow, Unnao, Mohanlal Ganj and Raebareli Loksabha constituencies. He will also head a meeting of the MLAs and MLCs of the constituencies which are going for poll in fourth phase of election. Prior to the meetings he will address political rallies in Chitrakoot and Fatehpur districts. On the other hand Samajwadi party chief Akhilesh Yadav will campaign in Bahraich, Dhaurahra and  Lakhimpur khiri constituencies. Congress General Secretary Priyanka Gandhi Vadra will also campaign in Raebareli constituency today.  SUSHIL CHANDRA TIWARI AKASHVANI NEWS LUCKNOW
In the fourth phase of Lok Sabha elections, the nomination process has been completed for 13 seats in Uttar Pradesh.  Polling on these seats will be held on the 13th of May. This also includes the important seat of Etawah, which is considered to be the stronghold of Samajwadi Party. BJP has been winning this seat for the last two Lok Sabha elections and this time the party has once again nominated sitting MP Ramshankar Katheria as its candidate. Jitendra Dohre from SP and former Hathras MP Sarika Singh Baghel from BSP are in the fray.

Five assembly constituencies of three districts Etawah, Auraiya and Kanpur Dehat come under Etawah Lok Sabha seat. This includes Etawah, Auraiya, Bharthana, Dibiyapur and Sikandra. Etawah got political recognition from the late Mulayam Singh Yadav, who was the head of the Samajwadi Party. Being the largest political family, Samajwadi Party has dominated the Etawah Lok Sabha seat. SP is the only party to win this seat three times in a row. SP had won from here in 1999, 2004 and 2009. After that, Ashok Dohre of BJP in 2014 and Ramshankar Katheria of BJP in 2019 became MP from here. BJP is trying to score a hattrick of victory on this seat by making Ramshankar Katheria its candidate. On the other hand, SP has fielded Jitendra Dohre to regain its seat. BSP candidate and former MP from Hathras, Sarika Singh Baghel is making the contest interesting. All parties have intensified campaigning in support of their candidates.Om Awasthi, Akashvani News, Lucknow.

President Droupadi Murmu will present the Padma Awards at the second Civil Investiture Ceremony at Rashtrapati Bhavan in Delhi today.
Padma Awards – one of the highest civilian Awards of the country are conferred in three categories – Padma Vibhushan, Padma Bhushan and Padma Shri. The Awards are given in various disciplines and fields of activities including art, social work, public affairs, science and engineering, trade and industry, medicine, literature and education, sports, and civil service. The awards are announced on the occasion of Republic Day every year.
Last month, the President presented three Padma Vibhushan, eight Padma Bhushan and 55 Padma Shri Awards on 22nd April.

Chief of Defence Staff General Anil Chauhan will chair a two-day conference- ‘Parivartan Chintan-2’ on jointness and integration of Armed Forces in New Delhi beginning today. Parivartan Chintan is a pioneering conference for the heads of all Tri-services Institutes and aims at generating novel reformative ideas and initiatives.
Members from all the sub committees of the apex Chiefs of Staff Committee; the CDS as its permanent Chairman and the three Service Chiefs, shall review the progress made in multiple domains, and ideate on the necessary reforms to achieve the desired end state towards transformation through jointness and integration.

External Affairs Minister  Dr. S Jaishankar has said that every political party in the country is committed to ensure that Pakistan Occupied Kashmir (PoK), which is a part of India, returns to India. Interacting with the students at the Gargi College in New Delhi, Mr. Jaishankar said, the PoK issue has come to the front of people’s thinking after revocation of Article 370 in Jammu and Kashmir.

The minister said that people assumed Article 370 could not be changed as the politics of the day had driven it deeply into the public consciousness.

Meanwhile, in the wake of the deteriorating economic situation, tensions are rising in Pakistan-occupied Jammu and Kashmir (PoJK) as the Pakistani administration has been deploying troops from Punjab Province to suppress the upcoming protest in the region on Saturday this week.

According to a joint statement issued by the UKPNP and JAAC, the demonstrators are protesting an unjustified taxation, high electricity bills, uncontrolled inflation, and severe shortages of essentials. Additionally, the people also demand ownership of local land and water resources and royalties to locals over the hydroelectric power produced in dams located in PoJK and Pakistan-Occupied Gilgit Baltistan.

India’s Foreign Secretary, Vinay Mohan Kwatra arrived in Dhaka on a short trip to Bangladesh, yesterday. In a social media post, Bangladesh Foreign Ministry shared details about Kwatra’s visit saying that he is expected to meet PM Sheikh Hasina, Foreign Minister Hasan Mahmud and also hold meeting with the Foreign Secretary of Bangladesh, Masud Bin Momen today.

Recently, in February, External Affairs Minister Dr. S Jaishankar extended a warm welcome to Bangladesh Foreign Minister Hasan Mahmud when he arrived in India, emphasising that their discussions aim to fortify the enduring friendship between India and Bangladesh.

In Punjab, wheat harvesting is going on at a rapid pace. The incidents of burning wheat residue after harvesting are also increasing day by day in various districts, due to which, the environment is being affected.

The number of incidents of wheat residue being set on fire in fields in various districts of the state has reached 2 thousand 612 as of yesterday evening. Out of it, the maximum number of 133 incidents has been reported in the border district of Firozpur. Apart from this, Fazilka, Bathinda, Gurdaspur, Mansa, Muktsar, Sangrur, and Tarn Taran are some of the districts where the number of these incidents is increasing rapidly. However, the authorities are also making farmers aware not to set the fire but despite this, these incidents are continuously increasing. Shishu Sharma Shantal, Akashvani News, Jalandhar

In Kerala, many Air India Express flights to the  GCC countries remain cancelled today as well, disrupting travel plans of many.

Flights from Kuwait, Ras Al Khaimah, Doha, Muscat and Sharjah to Kannur International Airport were cancelled by the airline. Flight services from the airport to Sharjah, Muscat, Dammam and Abu Dhabi also stand cancelled. Airport authorities have asked air travellers to first confirm the flight status before embarking on their journey.

The India Meteorological Department (IMD) has forecast that a fresh spell of heat wave will continue to prevail over Rajasthan and Madhya Pradesh till tomorrow. The IMD said, hot and humid weather will prevail over coastal Gujarat during the next 5 days and over Kerala and Mahe till tomorrow. The weather department also predicted light to moderate rainfall with thunderstorm, lightning and squally winds over North-East India till the 11th of this month.

Star Indian table tennis player Manika Batra stormed into the quarterfinals of the Saudi Smash in Jeddah, beating world No. 14 Nina Mittelham of Germany. Batra defeated Mittelham 11-6, 11-9, 11-7 to seal her maiden last eight entry in the event. This is Batra’s first win over Mittelham in four attempts. Now the Indian will face another formidable rival world No. 5 Hina Hayata of Japan for a place in the semifinals, today.

In IPL Cricket, Sunrisers Hyderabad defeated Lucknow Super Giants by ten wickets at the Rajiv Gandhi International Stadium in Hyderabad last night.

Today, Punjab Kings will take on Royal Challengers Bengaluru at Himachal Pradesh Cricket Association Stadium in Dharamshala.

"""

In [None]:
g_transcript_hin = """
लोकसभा चुनाव के सातवें और अंतिम चरण का मतदान सम्‍पन्‍न; 59 प्रतिशत से अधिक मतदान।

अरूणाचल प्रदेश और सिक्किम विधानसभा चुनाव की मतगणना कल।

भारतीय जनता पार्टी और आईएनडीए गठबंधन, दोनों के नेताओं ने लोकसभा चुनाव में बहुमत का दावा किया।

अफगानिस्‍तान में नाव दुर्घटना में 20 लोगों की मृत्‍यु।

ताइवान एथेलेटिक्‍स ओपन टूर्नामेंट में भारत के डी पी मनु ने भाला फेंक स्‍पर्धा में स्‍वर्ण पदक जीता।

समाचार संध्या के


लोकसभा चुनाव के सातवें और अंतिम चरण में रात आठ बजे तक 59 प्रतिशत से अधिक मतदान दर्ज किया गया। इस चरण में सात राज्‍यों और एक केन्‍द्र शासित प्रदेश की 57 संसदीय सीटों के लिए वोट डाले गए। इनमें 13-13 सीटें उत्‍तर प्रदेश और पंजाब की, नौ सीटें पश्चिम बंगाल, आठ- बिहार, छह- ओडिसा, चार- हिमाचल प्रदेश, तीन- झारखंड और एक- चंडीगढ की है। बिहार में लगभग 50 प्रतिशत, चंडीगढ में 63 प्रतिशत और झारखंड में 69 प्रतिशत से अधिक मतदान दर्ज हुआ। पंजाब में लगभग 56 प्रतिशत वोट डाले गए, उत्‍तर प्रदेश में 56 प्रतिशत और पश्चिम बंगाल में 70 प्रतिशत मतदान हुआ। ओडिसा में 64 प्रतिशत मतदान दर्ज हुआ, यहां 42 विधानसभा सीटों के लिए भी साथ ही चुनाव कराया गया।

लोकसभा चुनाव के सातवें और अंतिम चरण में 904 उम्‍मीदवार चुनाव मैदान में थे। प्रमुख उम्‍मीदवारों में प्रधानमंत्री और भारतीय जनता पार्टी के वरिष्‍ठ नेता नरेन्‍द्र मोदी, पार्टी नेता अनुराग सिंह ठाकुर, रवि शंकर प्रसाद और कंगना रनौत, कांग्रेस नेता मनीष तिवारी और विक्रमादित्‍य सिंह, राष्‍ट्रीय जनता दल की मीसा भारती और तृणमूल कांग्रेस के अभिषेक बनर्जी शामिल हैं।

चंडीगढ से हमारे संवाददाता ने बताया है कि पंजाब में लगभग दो करोड़ 14 लाख मतदाताओं में से लगभग 55 दशमलव 6 प्रतिशत मतदाताओं ने वोट डाले।

पंजाब के मतदाताओं ने आमतौर शांतिपूर्ण मतदान करने की अपनी परंपरा को इस बार भी बनाए रखते हुए 328 उम्मीदवारों के भाग्य को वोटिंग मशीनों में बंद कर दिया है। भीषण गर्मी और लू के थपेड़ों की परवाह न करते हुए हर उम्र वर्ग के लोग निकले। खासतौर पर सुबह तो मतदान शुरू होने से काफी समय पहले ही मतदान केंद्रों पर मतदाताओं की कतारें देखने को मिलींI मतदान प्रक्रिया के सफल संचालन में चुनाव आयोग द्वारा किए गए व्यापक प्रबंधों का भी अहम योगदान रहा है। शिशु शर्मा शांतल, आकाशवाणी समाचार, जालंधर।

सातवें और अंतिम चरण में उत्तर प्रदेश की 13 लोकसभा सीटों पर मतदान शांतिपूर्ण संपन्न हो गया। हमारे संवाददाता ने बताया है कि सोनभद्र जिले की दुद्धी विधानसभा सीट के उपचुनाव के लिए भी आज वोट डाले गए।

चिलचिलाती गर्मी से बचने के लिए सुबह से ही मतदान केंदो पर वोटरों की लंबी कटारे नजर आई। भीषण गर्मी की स्थिति के कारण चुनाव आयोग ने नींबू पानी, गुड़ ठंडा पानी, टेंट ओआरएस और अन्य जीवन रक्षक दवाओं की व्यवस्था की थी। कुछ स्थानों पर ईवीएम में गड़बड़ी के कारण मतदान थोड़ी देर से शुरू हुआ। पहली बार वोट डालने वाले मतदाता बेहद उत्साहित नजर आए। इन 13 सीटों पर प्रधानमंत्री और बीजेपी के वरिष्ठ नेता नरेंद्र मोदी, केंद्रीय मंत्री पंकज चौधरी, अनुप्रिया पटेल, महेंद्र नाथ पांडे, फिल्म स्टार काजल निषाद और रवि किशन, कांग्रेस नेता अजय राय, समाजवादी पार्टी के उम्मीदवार अफजाल अंसारी, रमेश बिंद और पूर्व प्रधानमंत्री चन्द्रशेखर के बेटे नीरज शेखर समेत कई दिग्गजों की किस्मत का फैसला ईवीएम में बंद हो गया। सुशील चंद्र तिवारी आकाशवाणी समाचार वाराणसी।

पटना से हमारे संवाददाता ने बताया कि बिहार में भी सातवें चरण का मतदान शांतिपूर्वक सम्‍पन्‍न हो गया।



बिहार में आठ संसदीय क्षेत्रों में आज बारह महिला प्रत्याशियों सहित 134 उम्मीदवारों की चुनावी किस्मत ईवीएम में बंद हो गयी। आरा संसदीय सीट से केंद्रीय मंत्री आर के सिंह और पटना साहिब संसदीय क्षेत्र से पूर्व केंद्रीय मंत्री रविशंकर प्रसाद इस चरण के प्रमुख उम्मीदवार हैं। वहीं दो पूर्व केंद्रीय मंत्रियों उपेंद्र कुशवाहा की चुनावी किस्मत काराकाट और रामकृपाल यादव का भाग्य पाटलिपुत्र संसदीय क्षेत्र से ईवीएम में बंद हो गया। राष्ट्रीय जनता दल के प्रमुख लालू प्रसाद की पुत्री मीसा भारती और चर्चित भोजपुरी गायक पवन सिंह भी इस चरण के प्रमुख उम्मीदवार थे। शहरी क्षेत्रों में जहां लोगों की मतदान के प्रति उदासीनता देखी गयी वहीं ग्रामीण क्षेत्रों में भीषण गर्मी के बावजूद महिला और बुजुर्ग वोटर बड़ी संख्या में मतदान करने के लिए बाहर निकले। धर्मेन्द्र कुमार राय, आकाशवाणी समाचार पटना

झारखंड के तीन निर्वाचन क्षेत्रों दुमका, गोड्डा और राजमहाल में कड़े सुरक्षा प्रबंधों के बीच मतदान शांतिपूर्वक सम्‍पन्‍न हो गया। मुख्‍य निर्वाचन अधिकारी रवि कुमार ने बताया कि आज के चुनाव में 67 दशमलव नौ-पांच प्रतिशत मतदान दर्ज किया गया है। हमारे संवाददाता ने बताया है कि शहरी क्षेत्रों की तुलना में ग्रामीण क्षेत्रों में अधिक मतदान दर्ज किया गया है।

2019 में हुए लोकसभा चुनाव के मुकाबले इस बार मतदान का प्रतिशत तीन दशमलव छह –तीन प्रतिशत कम रहा। दुमका, गोड्डा और राजमहल में हुए आज का मतदान एनडीए और इंडिया गठबंधन दोनों के लिए काफी अहम है। जेएमएम सुप्रीमो सीबू सोरेन की पुत्रवधु सीता सोरेन दुमाका सीट पर बीजेपी की टिकट पर चुनावी मुकाबले में है। जेएमएम के नलिन सोरेन उन्‍हें चुनौती दे रहे हैं। राजमहल में जेएमएम के विजय हांसा और भाजपा के ताला मरांडी आमने-सामने हैं। गोड्डा से भाजपा के निशिकांत दुबे और आईएनसी के प्रदीप यादव के बीच सीधा मुकालबा है। आकाशवाणी समाचार के लिए रांची से कृष्‍ण कुमार लाल।

हिमाचल प्रदेश में आज चार संसदीय सीटों के लिए 67 प्रतिशत से अधिक लोगों ने मतदान किया। महिलाएं, युवा और बुजुर्ग सभी बड़ी संख्या में शामिल हुए। मुख्य मुकाबला भाजपा और कांग्रेस के बीच था।हिमाचल प्रदेश में आज चार संसदीय सीटों के लिए 67 प्रतिशत से अधिक लोगों ने मतदान किया। महिलाएं, युवा और बुजुर्ग सभी बड़ी संख्या में शामिल हुए। मुख्य मुकाबला भाजपा और कांग्रेस के बीच था।


पश्चिम बंगाल में आज नौ संसदीय क्षेत्रों में मतदान के दौरान ईवीएम में खराबी, हिंसा, मतदाताओं को डराने-धमकाने की घटनाएं और सत्तारूढ़ तथा विपक्षी दलों के कार्यकर्ताओं के बीच झड़पें हुईं। जादवपुर के भांगर में एक पुलिस अधिकारी घायल हो गया।

देशभर में लोकसभा चुनाव की मतगणना की तैयारियां जोरों पर हैं। मतगणना मंगलवार को होगी। तेलंगाना में 34 स्‍थानों पर लगभग दस हजार अधिकारी तैनात किये गए हैं। मुख्‍य निर्वाचन अधिकारी विकास राज ने हैदराबाद में बताया कि सभी मतगणना केन्‍द्रों पर निषेधाज्ञा लागू करने के साथ ही तीन स्‍तरीय सुर‍क्षा व्‍यवस्‍था की गई है। केन्‍द्रीय सशस्‍त्र पुलिस बल की 12 कंपनियां तैनात की गई हैं।

अरूणाचल प्रदेश और स‍िक्किम विधानसभा चुनाव की कल होने वाली मतगणना की तैयारियां पूरी कर ली गई हैं। मतगणना सवेरे 6 बजे शुरू होगी। पहले डाक मत पत्रों की गिनती होगी। उसके बाद इलेक्‍ट्रोनिक पोस्‍टल बैलेट और ईवीएम के मतों की गिनती की जाएगी।

अरूणाचल प्रदेश की 60 सदस्‍यों की विधानसभा में भारतीय जनता पार्टी दस सीट र्निविरोध जीत चुकी है। स‍िक्किम में 32 सदस्‍यों की विधानसभा के लिए 146 उम्‍मीदवार मैदान में हैं।

इन विधानसभाओं का कार्यकाल कल यानी 2 जून को समाप्‍त हो रहा है। इन दोनों राज्‍यों में विधानसभा और लोकसभा चुनाव के लिए 19 अप्रैल को वोट डाले गए थे।

भारतीय जनता पार्टी और आई एन डी आई गठबंधन के नेताओं ने लोकसभा चुनाव में बहुमत प्राप्‍त करने का दावा किया है। भारतीय जनता पार्टी अध्‍यक्ष जगत प्रकाश नड्डा ने विश्‍वास व्‍यक्‍त किया है कि लोकसभा चुनाव में उनकी पार्टी को 370 से अधिक सीटे मिलेंगी और राष्ट्रीय जनतांत्रिक गठबंधन 400 से अधिक सीट जीतेगा। उन्‍होंने कहा कि देश के लोग तुष्टिकरण, भाई-भतीजावाद और भ्रष्‍टाचार की राजनीति को नकारेंगे तथा सशक्‍त, विकसित और आत्‍मनिर्भर भारत को अपना समर्थन देंगे। श्री नड्डा ने सभी मतदाताओं को धन्‍यवाद दिया।

सबसे बड़े लोकतंत्र का चुनाव सम्‍पन्‍न हुआ है तो मैं अपने मतदाताओं को बहुत-बहुत धन्‍यवाद देता हूं। जिन्‍होंने इतनी भीषण गर्मी में सातों फेज में मतदान किया। मैं विशेषकर के अपने युवा साथी नवमतदाताओं को भी बधाई देना चाहता हूं जिन्‍होंने पहली बार अपने मत का प्रयोग किया।

विपक्षी आईएनडीआईए गठबंधन ने लोकसभा चुनाव की मंगलवार को होने वाली मतगणना के पहले आज नई दिल्‍ली में बैठक की। यह बैठक भविष्‍य की रणनीति की रूपरेखा तैयार करने के लिए नई दिल्‍ली में कांग्रेस अध्‍यक्ष मल्लिकार्जुन खरगे के आवास पर हुई। गठबंधन ने 295 से अधिक लोकसभा सीटों पर जीत दर्ज करने का विश्‍वास व्‍यक्‍त किया। बैठक के बाद कांग्रेस अध्‍यक्ष मल्लिकार्जुन खरगे ने कहा कि इसमें चुनाव और आगे की चुनौतियों सहित कई मुद्दों पर चर्चा की गई।

आज हम डिटेल चर्चा करके काउंटिंग के दिन जो हमको केयर लेना चाहिए वो तय करके अपने कैडर को बहुत बार बोले हैं आज मिलकर सबको एक इशारा देना था कि हम किस ढंग से इस चुनाव के दिन लडना चाहिए और बोलना चाहिए। ये चीज पर हमने चर्चा की है।

पूर्वी अफगानिस्‍तान के नांगरहार प्रांत में एक नाव डूबने से 20 लोगों की मौत हो गई है। मोमांद दारा जिले के बसावुल इलाके में आज सुबह सात बजे 25 लोगों को ले जा रही नाव नदी में डूब गई। पांच लोगों को बचा लिया गया। अब तक पांच शव निकाले जा चुके हैं। दुर्घटना स्‍थल पर शवों की तलाश के लिए आपात राहत टीम भेजी गई हैं। हादसे के कारणों का पता लगाया जा रहा है।

केरल में तेज वर्षा जारी है। त्रिशूर, मलप्पुरम और कोझिकोड जिलों में रेड अलर्ट और इडुक्की, पलक्काड़, वायनाड जिलों में ऑरेंज अलर्ट जारी किया गया हैं। अलाप्पुझा, कोट्टायम, एर्नाकुलम, पत्तनमथिट्टा, कन्नूर और कासरगोड जिलों में येलो अलर्ट है।

मणिपुर में अचानक आई बाढ़ की स्थिति में सुधार हो रहा है। अधिकांश निचले क्षेत्रों में जलभराव का स्‍तर कम हो रहा है। ख्वायरमबंद इम्‍फाल बाजार के कुछ हिस्‍सें अभी भी बाढ़ से प्रभावित हैं। पिछले तीन दिनों से यहां की सभी दुकानें और व्‍यावसायिक प्रतिष्‍ठान बंद पड़े हैं।

ताइवान एथलेटिक्स ओपन में भारत के डीपी मनु ने स्‍वर्ण और नित्या रामराज ने रजत पदक जीता। पुरुषों की भाला फेंक में मनु ने 81 दशमलव 58 मीटर के अपने सर्वश्रेष्ठ प्रयास के साथ स्वर्ण पदक जीता। महिलाओं की 100 मीटर बाधा दौड़ में भारत की नित्या रामराज ने 13 दशमलव 23 सेकेंड के समय के साथ रजत पदक जीता।

लंदन में खेली जा रही एफआईएच प्रो हॉकी लीग में आज भारतीय पुरूष हॉकी टीम ने जर्मनी को 3-0 से हरा दिया। आज ही एफआईएच प्रो हॉकी लीग महिला वर्ग में जमर्नी और भारत के बीच मैच खेला जाएगा।

"""

In [None]:
# Testing Dataset
# English Video, Hindi Video
testing_dataset = {}

testing_dataset['vids'] = ["https://www.youtube.com/watch?v=ysLiABvVos8", "https://www.youtube.com/watch?v=wDz9BNWXfn8"]
testing_dataset['languages'] = ['english', 'hindi']
testing_dataset['gold_transcripts'] = [g_transcript_eng, g_transcript_hin]

#### Normalizer Code from Open AI Whisper

In [None]:
#@markdown #### **English word maps**
#@markdown * Used by the below normalizer code
#@markdown * Taken from https://github.com/openai/whisper/blob/main/whisper/normalizers/english.json
#@markdown * `dict` - english_word_maps
#@markdown ---

english_word_maps = {
    "accessorise": "accessorize",
    "accessorised": "accessorized",
    "accessorises": "accessorizes",
    "accessorising": "accessorizing",
    "acclimatisation": "acclimatization",
    "acclimatise": "acclimatize",
    "acclimatised": "acclimatized",
    "acclimatises": "acclimatizes",
    "acclimatising": "acclimatizing",
    "accoutrements": "accouterments",
    "aeon": "eon",
    "aeons": "eons",
    "aerogramme": "aerogram",
    "aerogrammes": "aerograms",
    "aeroplane": "airplane",
    "aeroplanes": "airplanes",
    "aesthete": "esthete",
    "aesthetes": "esthetes",
    "aesthetic": "esthetic",
    "aesthetically": "esthetically",
    "aesthetics": "esthetics",
    "aetiology": "etiology",
    "ageing": "aging",
    "aggrandisement": "aggrandizement",
    "agonise": "agonize",
    "agonised": "agonized",
    "agonises": "agonizes",
    "agonising": "agonizing",
    "agonisingly": "agonizingly",
    "almanack": "almanac",
    "almanacks": "almanacs",
    "aluminium": "aluminum",
    "amortisable": "amortizable",
    "amortisation": "amortization",
    "amortisations": "amortizations",
    "amortise": "amortize",
    "amortised": "amortized",
    "amortises": "amortizes",
    "amortising": "amortizing",
    "amphitheatre": "amphitheater",
    "amphitheatres": "amphitheaters",
    "anaemia": "anemia",
    "anaemic": "anemic",
    "anaesthesia": "anesthesia",
    "anaesthetic": "anesthetic",
    "anaesthetics": "anesthetics",
    "anaesthetise": "anesthetize",
    "anaesthetised": "anesthetized",
    "anaesthetises": "anesthetizes",
    "anaesthetising": "anesthetizing",
    "anaesthetist": "anesthetist",
    "anaesthetists": "anesthetists",
    "anaesthetize": "anesthetize",
    "anaesthetized": "anesthetized",
    "anaesthetizes": "anesthetizes",
    "anaesthetizing": "anesthetizing",
    "analogue": "analog",
    "analogues": "analogs",
    "analyse": "analyze",
    "analysed": "analyzed",
    "analyses": "analyzes",
    "analysing": "analyzing",
    "anglicise": "anglicize",
    "anglicised": "anglicized",
    "anglicises": "anglicizes",
    "anglicising": "anglicizing",
    "annualised": "annualized",
    "antagonise": "antagonize",
    "antagonised": "antagonized",
    "antagonises": "antagonizes",
    "antagonising": "antagonizing",
    "apologise": "apologize",
    "apologised": "apologized",
    "apologises": "apologizes",
    "apologising": "apologizing",
    "appal": "appall",
    "appals": "appalls",
    "appetiser": "appetizer",
    "appetisers": "appetizers",
    "appetising": "appetizing",
    "appetisingly": "appetizingly",
    "arbour": "arbor",
    "arbours": "arbors",
    "archeological": "archaeological",
    "archaeologically": "archeologically",
    "archaeologist": "archeologist",
    "archaeologists": "archeologists",
    "archaeology": "archeology</span>",
    "ardour": "ardor",
    "armour": "armor",
    "armoured": "armored",
    "armourer": "armorer",
    "armourers": "armorers",
    "armouries": "armories",
    "armoury": "armory",
    "artefact": "artifact",
    "artefacts": "artifacts",
    "authorise": "authorize",
    "authorised": "authorized",
    "authorises": "authorizes",
    "authorising": "authorizing",
    "axe": "ax",
    "backpedalled": "backpedaled",
    "backpedalling": "backpedaling",
    "bannister": "banister",
    "bannisters": "banisters",
    "baptise": "baptize",
    "baptised": "baptized",
    "baptises": "baptizes",
    "baptising": "baptizing",
    "bastardise": "bastardize",
    "bastardised": "bastardized",
    "bastardises": "bastardizes",
    "bastardising": "bastardizing",
    "battleax": "battleaxe",
    "baulk": "balk",
    "baulked": "balked",
    "baulking": "balking",
    "baulks": "balks",
    "bedevilled": "bedeviled",
    "bedevilling": "bedeviling",
    "behaviour": "behavior",
    "behavioural": "behavioral",
    "behaviourism": "behaviorism",
    "behaviourist": "behaviorist",
    "behaviourists": "behaviorists",
    "behaviours": "behaviors",
    "behove": "behoove",
    "behoved": "behooved",
    "behoves": "behooves",
    "bejewelled": "bejeweled",
    "belabour": "belabor",
    "belaboured": "belabored",
    "belabouring": "belaboring",
    "belabours": "belabors",
    "bevelled": "beveled",
    "bevvies": "bevies",
    "bevvy": "bevy",
    "biassed": "biased",
    "biassing": "biasing",
    "bingeing": "binging",
    "bougainvillaea": "bougainvillea",
    "bougainvillaeas": "bougainvilleas",
    "bowdlerise": "bowdlerize",
    "bowdlerised": "bowdlerized",
    "bowdlerises": "bowdlerizes",
    "bowdlerising": "bowdlerizing",
    "breathalyse": "breathalyze",
    "breathalysed": "breathalyzed",
    "breathalyser": "breathalyzer",
    "breathalysers": "breathalyzers",
    "breathalyses": "breathalyzes",
    "breathalysing": "breathalyzing",
    "brutalise": "brutalize",
    "brutalised": "brutalized",
    "brutalises": "brutalizes",
    "brutalising": "brutalizing",
    "busses": "buses",
    "bussing": "busing",
    "caesarean": "cesarean",
    "caesareans": "cesareans",
    "calibre": "caliber",
    "calibres": "calibers",
    "calliper": "caliper",
    "callipers": "calipers",
    "callisthenics": "calisthenics",
    "canalise": "canalize",
    "canalised": "canalized",
    "canalises": "canalizes",
    "canalising": "canalizing",
    "cancelation": "cancellation",
    "cancelations": "cancellations",
    "cancelled": "canceled",
    "cancelling": "canceling",
    "candour": "candor",
    "cannibalise": "cannibalize",
    "cannibalised": "cannibalized",
    "cannibalises": "cannibalizes",
    "cannibalising": "cannibalizing",
    "canonise": "canonize",
    "canonised": "canonized",
    "canonises": "canonizes",
    "canonising": "canonizing",
    "capitalise": "capitalize",
    "capitalised": "capitalized",
    "capitalises": "capitalizes",
    "capitalising": "capitalizing",
    "caramelise": "caramelize",
    "caramelised": "caramelized",
    "caramelises": "caramelizes",
    "caramelising": "caramelizing",
    "carbonise": "carbonize",
    "carbonised": "carbonized",
    "carbonises": "carbonizes",
    "carbonising": "carbonizing",
    "carolled": "caroled",
    "carolling": "caroling",
    "catalogue": "catalog",
    "catalogued": "cataloged",
    "catalogues": "catalogs",
    "cataloguing": "cataloging",
    "catalyse": "catalyze",
    "catalysed": "catalyzed",
    "catalyses": "catalyzes",
    "catalysing": "catalyzing",
    "categorise": "categorize",
    "categorised": "categorized",
    "categorises": "categorizes",
    "categorising": "categorizing",
    "cauterise": "cauterize",
    "cauterised": "cauterized",
    "cauterises": "cauterizes",
    "cauterising": "cauterizing",
    "cavilled": "caviled",
    "cavilling": "caviling",
    "centigramme": "centigram",
    "centigrammes": "centigrams",
    "centilitre": "centiliter",
    "centilitres": "centiliters",
    "centimetre": "centimeter",
    "centimetres": "centimeters",
    "centralise": "centralize",
    "centralised": "centralized",
    "centralises": "centralizes",
    "centralising": "centralizing",
    "centre": "center",
    "centred": "centered",
    "centrefold": "centerfold",
    "centrefolds": "centerfolds",
    "centrepiece": "centerpiece",
    "centrepieces": "centerpieces",
    "centres": "centers",
    "channelled": "channeled",
    "channelling": "channeling",
    "characterise": "characterize",
    "characterised": "characterized",
    "characterises": "characterizes",
    "characterising": "characterizing",
    "cheque": "check",
    "chequebook": "checkbook",
    "chequebooks": "checkbooks",
    "chequered": "checkered",
    "cheques": "checks",
    "chilli": "chili",
    "chimaera": "chimera",
    "chimaeras": "chimeras",
    "chiselled": "chiseled",
    "chiselling": "chiseling",
    "circularise": "circularize",
    "circularised": "circularized",
    "circularises": "circularizes",
    "circularising": "circularizing",
    "civilise": "civilize",
    "civilised": "civilized",
    "civilises": "civilizes",
    "civilising": "civilizing",
    "clamour": "clamor",
    "clamoured": "clamored",
    "clamouring": "clamoring",
    "clamours": "clamors",
    "clangour": "clangor",
    "clarinettist": "clarinetist",
    "clarinettists": "clarinetists",
    "collectivise": "collectivize",
    "collectivised": "collectivized",
    "collectivises": "collectivizes",
    "collectivising": "collectivizing",
    "colonisation": "colonization",
    "colonise": "colonize",
    "colonised": "colonized",
    "coloniser": "colonizer",
    "colonisers": "colonizers",
    "colonises": "colonizes",
    "colonising": "colonizing",
    "colour": "color",
    "colourant": "colorant",
    "colourants": "colorants",
    "coloured": "colored",
    "coloureds": "coloreds",
    "colourful": "colorful",
    "colourfully": "colorfully",
    "colouring": "coloring",
    "colourize": "colorize",
    "colourized": "colorized",
    "colourizes": "colorizes",
    "colourizing": "colorizing",
    "colourless": "colorless",
    "colours": "colors",
    "commercialise": "commercialize",
    "commercialised": "commercialized",
    "commercialises": "commercializes",
    "commercialising": "commercializing",
    "compartmentalise": "compartmentalize",
    "compartmentalised": "compartmentalized",
    "compartmentalises": "compartmentalizes",
    "compartmentalising": "compartmentalizing",
    "computerise": "computerize",
    "computerised": "computerized",
    "computerises": "computerizes",
    "computerising": "computerizing",
    "conceptualise": "conceptualize",
    "conceptualised": "conceptualized",
    "conceptualises": "conceptualizes",
    "conceptualising": "conceptualizing",
    "connexion": "connection",
    "connexions": "connections",
    "contextualise": "contextualize",
    "contextualised": "contextualized",
    "contextualises": "contextualizes",
    "contextualising": "contextualizing",
    "cosier": "cozier",
    "cosies": "cozies",
    "cosiest": "coziest",
    "cosily": "cozily",
    "cosiness": "coziness",
    "cosy": "cozy",
    "councillor": "councilor",
    "councillors": "councilors",
    "counselled": "counseled",
    "counselling": "counseling",
    "counsellor": "counselor",
    "counsellors": "counselors",
    "crenelated": "crenellated",
    "criminalise": "criminalize",
    "criminalised": "criminalized",
    "criminalises": "criminalizes",
    "criminalising": "criminalizing",
    "criticise": "criticize",
    "criticised": "criticized",
    "criticises": "criticizes",
    "criticising": "criticizing",
    "crueller": "crueler",
    "cruellest": "cruelest",
    "crystallisation": "crystallization",
    "crystallise": "crystallize",
    "crystallised": "crystallized",
    "crystallises": "crystallizes",
    "crystallising": "crystallizing",
    "cudgelled": "cudgeled",
    "cudgelling": "cudgeling",
    "customise": "customize",
    "customised": "customized",
    "customises": "customizes",
    "customising": "customizing",
    "cypher": "cipher",
    "cyphers": "ciphers",
    "decentralisation": "decentralization",
    "decentralise": "decentralize",
    "decentralised": "decentralized",
    "decentralises": "decentralizes",
    "decentralising": "decentralizing",
    "decriminalisation": "decriminalization",
    "decriminalise": "decriminalize",
    "decriminalised": "decriminalized",
    "decriminalises": "decriminalizes",
    "decriminalising": "decriminalizing",
    "defence": "defense",
    "defenceless": "defenseless",
    "defences": "defenses",
    "dehumanisation": "dehumanization",
    "dehumanise": "dehumanize",
    "dehumanised": "dehumanized",
    "dehumanises": "dehumanizes",
    "dehumanising": "dehumanizing",
    "demeanour": "demeanor",
    "demilitarisation": "demilitarization",
    "demilitarise": "demilitarize",
    "demilitarised": "demilitarized",
    "demilitarises": "demilitarizes",
    "demilitarising": "demilitarizing",
    "demobilisation": "demobilization",
    "demobilise": "demobilize",
    "demobilised": "demobilized",
    "demobilises": "demobilizes",
    "demobilising": "demobilizing",
    "democratisation": "democratization",
    "democratise": "democratize",
    "democratised": "democratized",
    "democratises": "democratizes",
    "democratising": "democratizing",
    "demonise": "demonize",
    "demonised": "demonized",
    "demonises": "demonizes",
    "demonising": "demonizing",
    "demoralisation": "demoralization",
    "demoralise": "demoralize",
    "demoralised": "demoralized",
    "demoralises": "demoralizes",
    "demoralising": "demoralizing",
    "denationalisation": "denationalization",
    "denationalise": "denationalize",
    "denationalised": "denationalized",
    "denationalises": "denationalizes",
    "denationalising": "denationalizing",
    "deodorise": "deodorize",
    "deodorised": "deodorized",
    "deodorises": "deodorizes",
    "deodorising": "deodorizing",
    "depersonalise": "depersonalize",
    "depersonalised": "depersonalized",
    "depersonalises": "depersonalizes",
    "depersonalising": "depersonalizing",
    "deputise": "deputize",
    "deputised": "deputized",
    "deputises": "deputizes",
    "deputising": "deputizing",
    "desensitisation": "desensitization",
    "desensitise": "desensitize",
    "desensitised": "desensitized",
    "desensitises": "desensitizes",
    "desensitising": "desensitizing",
    "destabilisation": "destabilization",
    "destabilise": "destabilize",
    "destabilised": "destabilized",
    "destabilises": "destabilizes",
    "destabilising": "destabilizing",
    "dialled": "dialed",
    "dialling": "dialing",
    "dialogue": "dialog",
    "dialogues": "dialogs",
    "diarrhoea": "diarrhea",
    "digitise": "digitize",
    "digitised": "digitized",
    "digitises": "digitizes",
    "digitising": "digitizing",
    "disc": "disk",
    "discolour": "discolor",
    "discoloured": "discolored",
    "discolouring": "discoloring",
    "discolours": "discolors",
    "discs": "disks",
    "disembowelled": "disemboweled",
    "disembowelling": "disemboweling",
    "disfavour": "disfavor",
    "dishevelled": "disheveled",
    "dishonour": "dishonor",
    "dishonourable": "dishonorable",
    "dishonourably": "dishonorably",
    "dishonoured": "dishonored",
    "dishonouring": "dishonoring",
    "dishonours": "dishonors",
    "disorganisation": "disorganization",
    "disorganised": "disorganized",
    "distil": "distill",
    "distils": "distills",
    "dramatisation": "dramatization",
    "dramatisations": "dramatizations",
    "dramatise": "dramatize",
    "dramatised": "dramatized",
    "dramatises": "dramatizes",
    "dramatising": "dramatizing",
    "draught": "draft",
    "draughtboard": "draftboard",
    "draughtboards": "draftboards",
    "draughtier": "draftier",
    "draughtiest": "draftiest",
    "draughts": "drafts",
    "draughtsman": "draftsman",
    "draughtsmanship": "draftsmanship",
    "draughtsmen": "draftsmen",
    "draughtswoman": "draftswoman",
    "draughtswomen": "draftswomen",
    "draughty": "drafty",
    "drivelled": "driveled",
    "drivelling": "driveling",
    "duelled": "dueled",
    "duelling": "dueling",
    "economise": "economize",
    "economised": "economized",
    "economises": "economizes",
    "economising": "economizing",
    "edoema": "edema",
    "editorialise": "editorialize",
    "editorialised": "editorialized",
    "editorialises": "editorializes",
    "editorialising": "editorializing",
    "empathise": "empathize",
    "empathised": "empathized",
    "empathises": "empathizes",
    "empathising": "empathizing",
    "emphasise": "emphasize",
    "emphasised": "emphasized",
    "emphasises": "emphasizes",
    "emphasising": "emphasizing",
    "enamelled": "enameled",
    "enamelling": "enameling",
    "enamoured": "enamored",
    "encyclopaedia": "encyclopedia",
    "encyclopaedias": "encyclopedias",
    "encyclopaedic": "encyclopedic",
    "endeavour": "endeavor",
    "endeavoured": "endeavored",
    "endeavouring": "endeavoring",
    "endeavours": "endeavors",
    "energise": "energize",
    "energised": "energized",
    "energises": "energizes",
    "energising": "energizing",
    "enrol": "enroll",
    "enrols": "enrolls",
    "enthral": "enthrall",
    "enthrals": "enthralls",
    "epaulette": "epaulet",
    "epaulettes": "epaulets",
    "epicentre": "epicenter",
    "epicentres": "epicenters",
    "epilogue": "epilog",
    "epilogues": "epilogs",
    "epitomise": "epitomize",
    "epitomised": "epitomized",
    "epitomises": "epitomizes",
    "epitomising": "epitomizing",
    "equalisation": "equalization",
    "equalise": "equalize",
    "equalised": "equalized",
    "equaliser": "equalizer",
    "equalisers": "equalizers",
    "equalises": "equalizes",
    "equalising": "equalizing",
    "eulogise": "eulogize",
    "eulogised": "eulogized",
    "eulogises": "eulogizes",
    "eulogising": "eulogizing",
    "evangelise": "evangelize",
    "evangelised": "evangelized",
    "evangelises": "evangelizes",
    "evangelising": "evangelizing",
    "exorcise": "exorcize",
    "exorcised": "exorcized",
    "exorcises": "exorcizes",
    "exorcising": "exorcizing",
    "extemporisation": "extemporization",
    "extemporise": "extemporize",
    "extemporised": "extemporized",
    "extemporises": "extemporizes",
    "extemporising": "extemporizing",
    "externalisation": "externalization",
    "externalisations": "externalizations",
    "externalise": "externalize",
    "externalised": "externalized",
    "externalises": "externalizes",
    "externalising": "externalizing",
    "factorise": "factorize",
    "factorised": "factorized",
    "factorises": "factorizes",
    "factorising": "factorizing",
    "faecal": "fecal",
    "faeces": "feces",
    "familiarisation": "familiarization",
    "familiarise": "familiarize",
    "familiarised": "familiarized",
    "familiarises": "familiarizes",
    "familiarising": "familiarizing",
    "fantasise": "fantasize",
    "fantasised": "fantasized",
    "fantasises": "fantasizes",
    "fantasising": "fantasizing",
    "favour": "favor",
    "favourable": "favorable",
    "favourably": "favorably",
    "favoured": "favored",
    "favouring": "favoring",
    "favourite": "favorite",
    "favourites": "favorites",
    "favouritism": "favoritism",
    "favours": "favors",
    "feminise": "feminize",
    "feminised": "feminized",
    "feminises": "feminizes",
    "feminising": "feminizing",
    "fertilisation": "fertilization",
    "fertilise": "fertilize",
    "fertilised": "fertilized",
    "fertiliser": "fertilizer",
    "fertilisers": "fertilizers",
    "fertilises": "fertilizes",
    "fertilising": "fertilizing",
    "fervour": "fervor",
    "fibre": "fiber",
    "fibreglass": "fiberglass",
    "fibres": "fibers",
    "fictionalisation": "fictionalization",
    "fictionalisations": "fictionalizations",
    "fictionalise": "fictionalize",
    "fictionalised": "fictionalized",
    "fictionalises": "fictionalizes",
    "fictionalising": "fictionalizing",
    "fillet": "filet",
    "filleted": "fileted",
    "filleting": "fileting",
    "fillets": "filets",
    "finalisation": "finalization",
    "finalise": "finalize",
    "finalised": "finalized",
    "finalises": "finalizes",
    "finalising": "finalizing",
    "flautist": "flutist",
    "flautists": "flutists",
    "flavour": "flavor",
    "flavoured": "flavored",
    "flavouring": "flavoring",
    "flavourings": "flavorings",
    "flavourless": "flavorless",
    "flavours": "flavors",
    "flavoursome": "flavorsome",
    "flyer / flier": "flier / flyer",
    "foetal": "fetal",
    "foetid": "fetid",
    "foetus": "fetus",
    "foetuses": "fetuses",
    "formalisation": "formalization",
    "formalise": "formalize",
    "formalised": "formalized",
    "formalises": "formalizes",
    "formalising": "formalizing",
    "fossilisation": "fossilization",
    "fossilise": "fossilize",
    "fossilised": "fossilized",
    "fossilises": "fossilizes",
    "fossilising": "fossilizing",
    "fraternisation": "fraternization",
    "fraternise": "fraternize",
    "fraternised": "fraternized",
    "fraternises": "fraternizes",
    "fraternising": "fraternizing",
    "fulfil": "fulfill",
    "fulfilment": "fulfillment",
    "fulfils": "fulfills",
    "funnelled": "funneled",
    "funnelling": "funneling",
    "galvanise": "galvanize",
    "galvanised": "galvanized",
    "galvanises": "galvanizes",
    "galvanising": "galvanizing",
    "gambolled": "gamboled",
    "gambolling": "gamboling",
    "gaol": "jail",
    "gaolbird": "jailbird",
    "gaolbirds": "jailbirds",
    "gaolbreak": "jailbreak",
    "gaolbreaks": "jailbreaks",
    "gaoled": "jailed",
    "gaoler": "jailer",
    "gaolers": "jailers",
    "gaoling": "jailing",
    "gaols": "jails",
    "gasses": "gases",
    "gage": "gauge",
    "gaged": "gauged",
    "gages": "gauges",
    "gaging": "gauging",
    "generalisation": "generalization",
    "generalisations": "generalizations",
    "generalise": "generalize",
    "generalised": "generalized",
    "generalises": "generalizes",
    "generalising": "generalizing",
    "ghettoise": "ghettoize",
    "ghettoised": "ghettoized",
    "ghettoises": "ghettoizes",
    "ghettoising": "ghettoizing",
    "gipsies": "gypsies",
    "glamorise": "glamorize",
    "glamorised": "glamorized",
    "glamorises": "glamorizes",
    "glamorising": "glamorizing",
    "glamor": "glamour",
    "globalisation": "globalization",
    "globalise": "globalize",
    "globalised": "globalized",
    "globalises": "globalizes",
    "globalising": "globalizing",
    "glueing": "gluing",
    "goitre": "goiter",
    "goitres": "goiters",
    "gonorrhoea": "gonorrhea",
    "gramme": "gram",
    "grammes": "grams",
    "gravelled": "graveled",
    "grey": "gray",
    "greyed": "grayed",
    "greying": "graying",
    "greyish": "grayish",
    "greyness": "grayness",
    "greys": "grays",
    "grovelled": "groveled",
    "grovelling": "groveling",
    "groyne": "groin",
    "groynes": "groins",
    "gruelling": "grueling",
    "gruellingly": "gruelingly",
    "gryphon": "griffin",
    "gryphons": "griffins",
    "gynaecological": "gynecological",
    "gynaecologist": "gynecologist",
    "gynaecologists": "gynecologists",
    "gynaecology": "gynecology",
    "haematological": "hematological",
    "haematologist": "hematologist",
    "haematologists": "hematologists",
    "haematology": "hematology",
    "haemoglobin": "hemoglobin",
    "haemophilia": "hemophilia",
    "haemophiliac": "hemophiliac",
    "haemophiliacs": "hemophiliacs",
    "haemorrhage": "hemorrhage",
    "haemorrhaged": "hemorrhaged",
    "haemorrhages": "hemorrhages",
    "haemorrhaging": "hemorrhaging",
    "haemorrhoids": "hemorrhoids",
    "harbour": "harbor",
    "harboured": "harbored",
    "harbouring": "harboring",
    "harbours": "harbors",
    "harmonisation": "harmonization",
    "harmonise": "harmonize",
    "harmonised": "harmonized",
    "harmonises": "harmonizes",
    "harmonising": "harmonizing",
    "homoeopath": "homeopath",
    "homoeopathic": "homeopathic",
    "homoeopaths": "homeopaths",
    "homoeopathy": "homeopathy",
    "homogenise": "homogenize",
    "homogenised": "homogenized",
    "homogenises": "homogenizes",
    "homogenising": "homogenizing",
    "honour": "honor",
    "honourable": "honorable",
    "honourably": "honorably",
    "honoured": "honored",
    "honouring": "honoring",
    "honours": "honors",
    "hospitalisation": "hospitalization",
    "hospitalise": "hospitalize",
    "hospitalised": "hospitalized",
    "hospitalises": "hospitalizes",
    "hospitalising": "hospitalizing",
    "humanise": "humanize",
    "humanised": "humanized",
    "humanises": "humanizes",
    "humanising": "humanizing",
    "humour": "humor",
    "humoured": "humored",
    "humouring": "humoring",
    "humourless": "humorless",
    "humours": "humors",
    "hybridise": "hybridize",
    "hybridised": "hybridized",
    "hybridises": "hybridizes",
    "hybridising": "hybridizing",
    "hypnotise": "hypnotize",
    "hypnotised": "hypnotized",
    "hypnotises": "hypnotizes",
    "hypnotising": "hypnotizing",
    "hypothesise": "hypothesize",
    "hypothesised": "hypothesized",
    "hypothesises": "hypothesizes",
    "hypothesising": "hypothesizing",
    "idealisation": "idealization",
    "idealise": "idealize",
    "idealised": "idealized",
    "idealises": "idealizes",
    "idealising": "idealizing",
    "idolise": "idolize",
    "idolised": "idolized",
    "idolises": "idolizes",
    "idolising": "idolizing",
    "immobilisation": "immobilization",
    "immobilise": "immobilize",
    "immobilised": "immobilized",
    "immobiliser": "immobilizer",
    "immobilisers": "immobilizers",
    "immobilises": "immobilizes",
    "immobilising": "immobilizing",
    "immortalise": "immortalize",
    "immortalised": "immortalized",
    "immortalises": "immortalizes",
    "immortalising": "immortalizing",
    "immunisation": "immunization",
    "immunise": "immunize",
    "immunised": "immunized",
    "immunises": "immunizes",
    "immunising": "immunizing",
    "impanelled": "impaneled",
    "impanelling": "impaneling",
    "imperilled": "imperiled",
    "imperilling": "imperiling",
    "individualise": "individualize",
    "individualised": "individualized",
    "individualises": "individualizes",
    "individualising": "individualizing",
    "industrialise": "industrialize",
    "industrialised": "industrialized",
    "industrialises": "industrializes",
    "industrialising": "industrializing",
    "inflexion": "inflection",
    "inflexions": "inflections",
    "initialise": "initialize",
    "initialised": "initialized",
    "initialises": "initializes",
    "initialising": "initializing",
    "initialled": "initialed",
    "initialling": "initialing",
    "instal": "install",
    "instalment": "installment",
    "instalments": "installments",
    "instals": "installs",
    "instil": "instill",
    "instils": "instills",
    "institutionalisation": "institutionalization",
    "institutionalise": "institutionalize",
    "institutionalised": "institutionalized",
    "institutionalises": "institutionalizes",
    "institutionalising": "institutionalizing",
    "intellectualise": "intellectualize",
    "intellectualised": "intellectualized",
    "intellectualises": "intellectualizes",
    "intellectualising": "intellectualizing",
    "internalisation": "internalization",
    "internalise": "internalize",
    "internalised": "internalized",
    "internalises": "internalizes",
    "internalising": "internalizing",
    "internationalisation": "internationalization",
    "internationalise": "internationalize",
    "internationalised": "internationalized",
    "internationalises": "internationalizes",
    "internationalising": "internationalizing",
    "ionisation": "ionization",
    "ionise": "ionize",
    "ionised": "ionized",
    "ioniser": "ionizer",
    "ionisers": "ionizers",
    "ionises": "ionizes",
    "ionising": "ionizing",
    "italicise": "italicize",
    "italicised": "italicized",
    "italicises": "italicizes",
    "italicising": "italicizing",
    "itemise": "itemize",
    "itemised": "itemized",
    "itemises": "itemizes",
    "itemising": "itemizing",
    "jeopardise": "jeopardize",
    "jeopardised": "jeopardized",
    "jeopardises": "jeopardizes",
    "jeopardising": "jeopardizing",
    "jewelled": "jeweled",
    "jeweller": "jeweler",
    "jewellers": "jewelers",
    "jewellery": "jewelry",
    "judgement": "judgment",
    "kilogramme": "kilogram",
    "kilogrammes": "kilograms",
    "kilometre": "kilometer",
    "kilometres": "kilometers",
    "labelled": "labeled",
    "labelling": "labeling",
    "labour": "labor",
    "laboured": "labored",
    "labourer": "laborer",
    "labourers": "laborers",
    "labouring": "laboring",
    "labours": "labors",
    "lacklustre": "lackluster",
    "legalisation": "legalization",
    "legalise": "legalize",
    "legalised": "legalized",
    "legalises": "legalizes",
    "legalising": "legalizing",
    "legitimise": "legitimize",
    "legitimised": "legitimized",
    "legitimises": "legitimizes",
    "legitimising": "legitimizing",
    "leukaemia": "leukemia",
    "levelled": "leveled",
    "leveller": "leveler",
    "levellers": "levelers",
    "levelling": "leveling",
    "libelled": "libeled",
    "libelling": "libeling",
    "libellous": "libelous",
    "liberalisation": "liberalization",
    "liberalise": "liberalize",
    "liberalised": "liberalized",
    "liberalises": "liberalizes",
    "liberalising": "liberalizing",
    "licence": "license",
    "licenced": "licensed",
    "licences": "licenses",
    "licencing": "licensing",
    "likeable": "likable",
    "lionisation": "lionization",
    "lionise": "lionize",
    "lionised": "lionized",
    "lionises": "lionizes",
    "lionising": "lionizing",
    "liquidise": "liquidize",
    "liquidised": "liquidized",
    "liquidiser": "liquidizer",
    "liquidisers": "liquidizers",
    "liquidises": "liquidizes",
    "liquidising": "liquidizing",
    "litre": "liter",
    "litres": "liters",
    "localise": "localize",
    "localised": "localized",
    "localises": "localizes",
    "localising": "localizing",
    "louvre": "louver",
    "louvred": "louvered",
    "louvres": "louvers",
    "lustre": "luster",
    "magnetise": "magnetize",
    "magnetised": "magnetized",
    "magnetises": "magnetizes",
    "magnetising": "magnetizing",
    "manoeuvrability": "maneuverability",
    "manoeuvrable": "maneuverable",
    "manoeuvre": "maneuver",
    "manoeuvred": "maneuvered",
    "manoeuvres": "maneuvers",
    "manoeuvring": "maneuvering",
    "manoeuvrings": "maneuverings",
    "marginalisation": "marginalization",
    "marginalise": "marginalize",
    "marginalised": "marginalized",
    "marginalises": "marginalizes",
    "marginalising": "marginalizing",
    "marshalled": "marshaled",
    "marshalling": "marshaling",
    "marvelled": "marveled",
    "marvelling": "marveling",
    "marvellous": "marvelous",
    "marvellously": "marvelously",
    "materialisation": "materialization",
    "materialise": "materialize",
    "materialised": "materialized",
    "materialises": "materializes",
    "materialising": "materializing",
    "maximisation": "maximization",
    "maximise": "maximize",
    "maximised": "maximized",
    "maximises": "maximizes",
    "maximising": "maximizing",
    "meagre": "meager",
    "mechanisation": "mechanization",
    "mechanise": "mechanize",
    "mechanised": "mechanized",
    "mechanises": "mechanizes",
    "mechanising": "mechanizing",
    "mediaeval": "medieval",
    "memorialise": "memorialize",
    "memorialised": "memorialized",
    "memorialises": "memorializes",
    "memorialising": "memorializing",
    "memorise": "memorize",
    "memorised": "memorized",
    "memorises": "memorizes",
    "memorising": "memorizing",
    "mesmerise": "mesmerize",
    "mesmerised": "mesmerized",
    "mesmerises": "mesmerizes",
    "mesmerising": "mesmerizing",
    "metabolise": "metabolize",
    "metabolised": "metabolized",
    "metabolises": "metabolizes",
    "metabolising": "metabolizing",
    "metre": "meter",
    "metres": "meters",
    "micrometre": "micrometer",
    "micrometres": "micrometers",
    "militarise": "militarize",
    "militarised": "militarized",
    "militarises": "militarizes",
    "militarising": "militarizing",
    "milligramme": "milligram",
    "milligrammes": "milligrams",
    "millilitre": "milliliter",
    "millilitres": "milliliters",
    "millimetre": "millimeter",
    "millimetres": "millimeters",
    "miniaturisation": "miniaturization",
    "miniaturise": "miniaturize",
    "miniaturised": "miniaturized",
    "miniaturises": "miniaturizes",
    "miniaturising": "miniaturizing",
    "minibusses": "minibuses",
    "minimise": "minimize",
    "minimised": "minimized",
    "minimises": "minimizes",
    "minimising": "minimizing",
    "misbehaviour": "misbehavior",
    "misdemeanour": "misdemeanor",
    "misdemeanours": "misdemeanors",
    "misspelt": "misspelled",
    "mitre": "miter",
    "mitres": "miters",
    "mobilisation": "mobilization",
    "mobilise": "mobilize",
    "mobilised": "mobilized",
    "mobilises": "mobilizes",
    "mobilising": "mobilizing",
    "modelled": "modeled",
    "modeller": "modeler",
    "modellers": "modelers",
    "modelling": "modeling",
    "modernise": "modernize",
    "modernised": "modernized",
    "modernises": "modernizes",
    "modernising": "modernizing",
    "moisturise": "moisturize",
    "moisturised": "moisturized",
    "moisturiser": "moisturizer",
    "moisturisers": "moisturizers",
    "moisturises": "moisturizes",
    "moisturising": "moisturizing",
    "monologue": "monolog",
    "monologues": "monologs",
    "monopolisation": "monopolization",
    "monopolise": "monopolize",
    "monopolised": "monopolized",
    "monopolises": "monopolizes",
    "monopolising": "monopolizing",
    "moralise": "moralize",
    "moralised": "moralized",
    "moralises": "moralizes",
    "moralising": "moralizing",
    "motorised": "motorized",
    "mould": "mold",
    "moulded": "molded",
    "moulder": "molder",
    "mouldered": "moldered",
    "mouldering": "moldering",
    "moulders": "molders",
    "mouldier": "moldier",
    "mouldiest": "moldiest",
    "moulding": "molding",
    "mouldings": "moldings",
    "moulds": "molds",
    "mouldy": "moldy",
    "moult": "molt",
    "moulted": "molted",
    "moulting": "molting",
    "moults": "molts",
    "moustache": "mustache",
    "moustached": "mustached",
    "moustaches": "mustaches",
    "moustachioed": "mustachioed",
    "multicoloured": "multicolored",
    "nationalisation": "nationalization",
    "nationalisations": "nationalizations",
    "nationalise": "nationalize",
    "nationalised": "nationalized",
    "nationalises": "nationalizes",
    "nationalising": "nationalizing",
    "naturalisation": "naturalization",
    "naturalise": "naturalize",
    "naturalised": "naturalized",
    "naturalises": "naturalizes",
    "naturalising": "naturalizing",
    "neighbour": "neighbor",
    "neighbourhood": "neighborhood",
    "neighbourhoods": "neighborhoods",
    "neighbouring": "neighboring",
    "neighbourliness": "neighborliness",
    "neighbourly": "neighborly",
    "neighbours": "neighbors",
    "neutralisation": "neutralization",
    "neutralise": "neutralize",
    "neutralised": "neutralized",
    "neutralises": "neutralizes",
    "neutralising": "neutralizing",
    "normalisation": "normalization",
    "normalise": "normalize",
    "normalised": "normalized",
    "normalises": "normalizes",
    "normalising": "normalizing",
    "odour": "odor",
    "odourless": "odorless",
    "odours": "odors",
    "oesophagus": "esophagus",
    "oesophaguses": "esophaguses",
    "oestrogen": "estrogen",
    "offence": "offense",
    "offences": "offenses",
    "omelette": "omelet",
    "omelettes": "omelets",
    "optimise": "optimize",
    "optimised": "optimized",
    "optimises": "optimizes",
    "optimising": "optimizing",
    "organisation": "organization",
    "organisational": "organizational",
    "organisations": "organizations",
    "organise": "organize",
    "organised": "organized",
    "organiser": "organizer",
    "organisers": "organizers",
    "organises": "organizes",
    "organising": "organizing",
    "orthopaedic": "orthopedic",
    "orthopaedics": "orthopedics",
    "ostracise": "ostracize",
    "ostracised": "ostracized",
    "ostracises": "ostracizes",
    "ostracising": "ostracizing",
    "outmanoeuvre": "outmaneuver",
    "outmanoeuvred": "outmaneuvered",
    "outmanoeuvres": "outmaneuvers",
    "outmanoeuvring": "outmaneuvering",
    "overemphasise": "overemphasize",
    "overemphasised": "overemphasized",
    "overemphasises": "overemphasizes",
    "overemphasising": "overemphasizing",
    "oxidisation": "oxidization",
    "oxidise": "oxidize",
    "oxidised": "oxidized",
    "oxidises": "oxidizes",
    "oxidising": "oxidizing",
    "paederast": "pederast",
    "paederasts": "pederasts",
    "paediatric": "pediatric",
    "paediatrician": "pediatrician",
    "paediatricians": "pediatricians",
    "paediatrics": "pediatrics",
    "paedophile": "pedophile",
    "paedophiles": "pedophiles",
    "paedophilia": "pedophilia",
    "palaeolithic": "paleolithic",
    "palaeontologist": "paleontologist",
    "palaeontologists": "paleontologists",
    "palaeontology": "paleontology",
    "panelled": "paneled",
    "panelling": "paneling",
    "panellist": "panelist",
    "panellists": "panelists",
    "paralyse": "paralyze",
    "paralysed": "paralyzed",
    "paralyses": "paralyzes",
    "paralysing": "paralyzing",
    "parcelled": "parceled",
    "parcelling": "parceling",
    "parlour": "parlor",
    "parlours": "parlors",
    "particularise": "particularize",
    "particularised": "particularized",
    "particularises": "particularizes",
    "particularising": "particularizing",
    "passivisation": "passivization",
    "passivise": "passivize",
    "passivised": "passivized",
    "passivises": "passivizes",
    "passivising": "passivizing",
    "pasteurisation": "pasteurization",
    "pasteurise": "pasteurize",
    "pasteurised": "pasteurized",
    "pasteurises": "pasteurizes",
    "pasteurising": "pasteurizing",
    "patronise": "patronize",
    "patronised": "patronized",
    "patronises": "patronizes",
    "patronising": "patronizing",
    "patronisingly": "patronizingly",
    "pedalled": "pedaled",
    "pedalling": "pedaling",
    "pedestrianisation": "pedestrianization",
    "pedestrianise": "pedestrianize",
    "pedestrianised": "pedestrianized",
    "pedestrianises": "pedestrianizes",
    "pedestrianising": "pedestrianizing",
    "penalise": "penalize",
    "penalised": "penalized",
    "penalises": "penalizes",
    "penalising": "penalizing",
    "pencilled": "penciled",
    "pencilling": "penciling",
    "personalise": "personalize",
    "personalised": "personalized",
    "personalises": "personalizes",
    "personalising": "personalizing",
    "pharmacopoeia": "pharmacopeia",
    "pharmacopoeias": "pharmacopeias",
    "philosophise": "philosophize",
    "philosophised": "philosophized",
    "philosophises": "philosophizes",
    "philosophising": "philosophizing",
    "philtre": "filter",
    "philtres": "filters",
    "phoney": "phony",
    "plagiarise": "plagiarize",
    "plagiarised": "plagiarized",
    "plagiarises": "plagiarizes",
    "plagiarising": "plagiarizing",
    "plough": "plow",
    "ploughed": "plowed",
    "ploughing": "plowing",
    "ploughman": "plowman",
    "ploughmen": "plowmen",
    "ploughs": "plows",
    "ploughshare": "plowshare",
    "ploughshares": "plowshares",
    "polarisation": "polarization",
    "polarise": "polarize",
    "polarised": "polarized",
    "polarises": "polarizes",
    "polarising": "polarizing",
    "politicisation": "politicization",
    "politicise": "politicize",
    "politicised": "politicized",
    "politicises": "politicizes",
    "politicising": "politicizing",
    "popularisation": "popularization",
    "popularise": "popularize",
    "popularised": "popularized",
    "popularises": "popularizes",
    "popularising": "popularizing",
    "pouffe": "pouf",
    "pouffes": "poufs",
    "practise": "practice",
    "practised": "practiced",
    "practises": "practices",
    "practising": "practicing",
    "praesidium": "presidium",
    "praesidiums": "presidiums",
    "pressurisation": "pressurization",
    "pressurise": "pressurize",
    "pressurised": "pressurized",
    "pressurises": "pressurizes",
    "pressurising": "pressurizing",
    "pretence": "pretense",
    "pretences": "pretenses",
    "primaeval": "primeval",
    "prioritisation": "prioritization",
    "prioritise": "prioritize",
    "prioritised": "prioritized",
    "prioritises": "prioritizes",
    "prioritising": "prioritizing",
    "privatisation": "privatization",
    "privatisations": "privatizations",
    "privatise": "privatize",
    "privatised": "privatized",
    "privatises": "privatizes",
    "privatising": "privatizing",
    "professionalisation": "professionalization",
    "professionalise": "professionalize",
    "professionalised": "professionalized",
    "professionalises": "professionalizes",
    "professionalising": "professionalizing",
    "programme": "program",
    "programmes": "programs",
    "prologue": "prolog",
    "prologues": "prologs",
    "propagandise": "propagandize",
    "propagandised": "propagandized",
    "propagandises": "propagandizes",
    "propagandising": "propagandizing",
    "proselytise": "proselytize",
    "proselytised": "proselytized",
    "proselytiser": "proselytizer",
    "proselytisers": "proselytizers",
    "proselytises": "proselytizes",
    "proselytising": "proselytizing",
    "psychoanalyse": "psychoanalyze",
    "psychoanalysed": "psychoanalyzed",
    "psychoanalyses": "psychoanalyzes",
    "psychoanalysing": "psychoanalyzing",
    "publicise": "publicize",
    "publicised": "publicized",
    "publicises": "publicizes",
    "publicising": "publicizing",
    "pulverisation": "pulverization",
    "pulverise": "pulverize",
    "pulverised": "pulverized",
    "pulverises": "pulverizes",
    "pulverising": "pulverizing",
    "pummelled": "pummel",
    "pummelling": "pummeled",
    "pyjama": "pajama",
    "pyjamas": "pajamas",
    "pzazz": "pizzazz",
    "quarrelled": "quarreled",
    "quarrelling": "quarreling",
    "radicalise": "radicalize",
    "radicalised": "radicalized",
    "radicalises": "radicalizes",
    "radicalising": "radicalizing",
    "rancour": "rancor",
    "randomise": "randomize",
    "randomised": "randomized",
    "randomises": "randomizes",
    "randomising": "randomizing",
    "rationalisation": "rationalization",
    "rationalisations": "rationalizations",
    "rationalise": "rationalize",
    "rationalised": "rationalized",
    "rationalises": "rationalizes",
    "rationalising": "rationalizing",
    "ravelled": "raveled",
    "ravelling": "raveling",
    "realisable": "realizable",
    "realisation": "realization",
    "realisations": "realizations",
    "realise": "realize",
    "realised": "realized",
    "realises": "realizes",
    "realising": "realizing",
    "recognisable": "recognizable",
    "recognisably": "recognizably",
    "recognisance": "recognizance",
    "recognise": "recognize",
    "recognised": "recognized",
    "recognises": "recognizes",
    "recognising": "recognizing",
    "reconnoitre": "reconnoiter",
    "reconnoitred": "reconnoitered",
    "reconnoitres": "reconnoiters",
    "reconnoitring": "reconnoitering",
    "refuelled": "refueled",
    "refuelling": "refueling",
    "regularisation": "regularization",
    "regularise": "regularize",
    "regularised": "regularized",
    "regularises": "regularizes",
    "regularising": "regularizing",
    "remodelled": "remodeled",
    "remodelling": "remodeling",
    "remould": "remold",
    "remoulded": "remolded",
    "remoulding": "remolding",
    "remoulds": "remolds",
    "reorganisation": "reorganization",
    "reorganisations": "reorganizations",
    "reorganise": "reorganize",
    "reorganised": "reorganized",
    "reorganises": "reorganizes",
    "reorganising": "reorganizing",
    "revelled": "reveled",
    "reveller": "reveler",
    "revellers": "revelers",
    "revelling": "reveling",
    "revitalise": "revitalize",
    "revitalised": "revitalized",
    "revitalises": "revitalizes",
    "revitalising": "revitalizing",
    "revolutionise": "revolutionize",
    "revolutionised": "revolutionized",
    "revolutionises": "revolutionizes",
    "revolutionising": "revolutionizing",
    "rhapsodise": "rhapsodize",
    "rhapsodised": "rhapsodized",
    "rhapsodises": "rhapsodizes",
    "rhapsodising": "rhapsodizing",
    "rigour": "rigor",
    "rigours": "rigors",
    "ritualised": "ritualized",
    "rivalled": "rivaled",
    "rivalling": "rivaling",
    "romanticise": "romanticize",
    "romanticised": "romanticized",
    "romanticises": "romanticizes",
    "romanticising": "romanticizing",
    "rumour": "rumor",
    "rumoured": "rumored",
    "rumours": "rumors",
    "sabre": "saber",
    "sabres": "sabers",
    "saltpetre": "saltpeter",
    "sanitise": "sanitize",
    "sanitised": "sanitized",
    "sanitises": "sanitizes",
    "sanitising": "sanitizing",
    "satirise": "satirize",
    "satirised": "satirized",
    "satirises": "satirizes",
    "satirising": "satirizing",
    "saviour": "savior",
    "saviours": "saviors",
    "savour": "savor",
    "savoured": "savored",
    "savouries": "savories",
    "savouring": "savoring",
    "savours": "savors",
    "savoury": "savory",
    "scandalise": "scandalize",
    "scandalised": "scandalized",
    "scandalises": "scandalizes",
    "scandalising": "scandalizing",
    "sceptic": "skeptic",
    "sceptical": "skeptical",
    "sceptically": "skeptically",
    "scepticism": "skepticism",
    "sceptics": "skeptics",
    "sceptre": "scepter",
    "sceptres": "scepters",
    "scrutinise": "scrutinize",
    "scrutinised": "scrutinized",
    "scrutinises": "scrutinizes",
    "scrutinising": "scrutinizing",
    "secularisation": "secularization",
    "secularise": "secularize",
    "secularised": "secularized",
    "secularises": "secularizes",
    "secularising": "secularizing",
    "sensationalise": "sensationalize",
    "sensationalised": "sensationalized",
    "sensationalises": "sensationalizes",
    "sensationalising": "sensationalizing",
    "sensitise": "sensitize",
    "sensitised": "sensitized",
    "sensitises": "sensitizes",
    "sensitising": "sensitizing",
    "sentimentalise": "sentimentalize",
    "sentimentalised": "sentimentalized",
    "sentimentalises": "sentimentalizes",
    "sentimentalising": "sentimentalizing",
    "sepulchre": "sepulcher",
    "sepulchres": "sepulchers",
    "serialisation": "serialization",
    "serialisations": "serializations",
    "serialise": "serialize",
    "serialised": "serialized",
    "serialises": "serializes",
    "serialising": "serializing",
    "sermonise": "sermonize",
    "sermonised": "sermonized",
    "sermonises": "sermonizes",
    "sermonising": "sermonizing",
    "sheikh": "sheik",
    "shovelled": "shoveled",
    "shovelling": "shoveling",
    "shrivelled": "shriveled",
    "shrivelling": "shriveling",
    "signalise": "signalize",
    "signalised": "signalized",
    "signalises": "signalizes",
    "signalising": "signalizing",
    "signalled": "signaled",
    "signalling": "signaling",
    "smoulder": "smolder",
    "smouldered": "smoldered",
    "smouldering": "smoldering",
    "smoulders": "smolders",
    "snivelled": "sniveled",
    "snivelling": "sniveling",
    "snorkelled": "snorkeled",
    "snorkelling": "snorkeling",
    "snowplough": "snowplow",
    "snowploughs": "snowplow",
    "socialisation": "socialization",
    "socialise": "socialize",
    "socialised": "socialized",
    "socialises": "socializes",
    "socialising": "socializing",
    "sodomise": "sodomize",
    "sodomised": "sodomized",
    "sodomises": "sodomizes",
    "sodomising": "sodomizing",
    "solemnise": "solemnize",
    "solemnised": "solemnized",
    "solemnises": "solemnizes",
    "solemnising": "solemnizing",
    "sombre": "somber",
    "specialisation": "specialization",
    "specialisations": "specializations",
    "specialise": "specialize",
    "specialised": "specialized",
    "specialises": "specializes",
    "specialising": "specializing",
    "spectre": "specter",
    "spectres": "specters",
    "spiralled": "spiraled",
    "spiralling": "spiraling",
    "splendour": "splendor",
    "splendours": "splendors",
    "squirrelled": "squirreled",
    "squirrelling": "squirreling",
    "stabilisation": "stabilization",
    "stabilise": "stabilize",
    "stabilised": "stabilized",
    "stabiliser": "stabilizer",
    "stabilisers": "stabilizers",
    "stabilises": "stabilizes",
    "stabilising": "stabilizing",
    "standardisation": "standardization",
    "standardise": "standardize",
    "standardised": "standardized",
    "standardises": "standardizes",
    "standardising": "standardizing",
    "stencilled": "stenciled",
    "stencilling": "stenciling",
    "sterilisation": "sterilization",
    "sterilisations": "sterilizations",
    "sterilise": "sterilize",
    "sterilised": "sterilized",
    "steriliser": "sterilizer",
    "sterilisers": "sterilizers",
    "sterilises": "sterilizes",
    "sterilising": "sterilizing",
    "stigmatisation": "stigmatization",
    "stigmatise": "stigmatize",
    "stigmatised": "stigmatized",
    "stigmatises": "stigmatizes",
    "stigmatising": "stigmatizing",
    "storey": "story",
    "storeys": "stories",
    "subsidisation": "subsidization",
    "subsidise": "subsidize",
    "subsidised": "subsidized",
    "subsidiser": "subsidizer",
    "subsidisers": "subsidizers",
    "subsidises": "subsidizes",
    "subsidising": "subsidizing",
    "succour": "succor",
    "succoured": "succored",
    "succouring": "succoring",
    "succours": "succors",
    "sulphate": "sulfate",
    "sulphates": "sulfates",
    "sulphide": "sulfide",
    "sulphides": "sulfides",
    "sulphur": "sulfur",
    "sulphurous": "sulfurous",
    "summarise": "summarize",
    "summarised": "summarized",
    "summarises": "summarizes",
    "summarising": "summarizing",
    "swivelled": "swiveled",
    "swivelling": "swiveling",
    "symbolise": "symbolize",
    "symbolised": "symbolized",
    "symbolises": "symbolizes",
    "symbolising": "symbolizing",
    "sympathise": "sympathize",
    "sympathised": "sympathized",
    "sympathiser": "sympathizer",
    "sympathisers": "sympathizers",
    "sympathises": "sympathizes",
    "sympathising": "sympathizing",
    "synchronisation": "synchronization",
    "synchronise": "synchronize",
    "synchronised": "synchronized",
    "synchronises": "synchronizes",
    "synchronising": "synchronizing",
    "synthesise": "synthesize",
    "synthesised": "synthesized",
    "synthesiser": "synthesizer",
    "synthesisers": "synthesizers",
    "synthesises": "synthesizes",
    "synthesising": "synthesizing",
    "syphon": "siphon",
    "syphoned": "siphoned",
    "syphoning": "siphoning",
    "syphons": "siphons",
    "systematisation": "systematization",
    "systematise": "systematize",
    "systematised": "systematized",
    "systematises": "systematizes",
    "systematising": "systematizing",
    "tantalise": "tantalize",
    "tantalised": "tantalized",
    "tantalises": "tantalizes",
    "tantalising": "tantalizing",
    "tantalisingly": "tantalizingly",
    "tasselled": "tasseled",
    "technicolour": "technicolor",
    "temporise": "temporize",
    "temporised": "temporized",
    "temporises": "temporizes",
    "temporising": "temporizing",
    "tenderise": "tenderize",
    "tenderised": "tenderized",
    "tenderises": "tenderizes",
    "tenderising": "tenderizing",
    "terrorise": "terrorize",
    "terrorised": "terrorized",
    "terrorises": "terrorizes",
    "terrorising": "terrorizing",
    "theatre": "theater",
    "theatregoer": "theatergoer",
    "theatregoers": "theatergoers",
    "theatres": "theaters",
    "theorise": "theorize",
    "theorised": "theorized",
    "theorises": "theorizes",
    "theorising": "theorizing",
    "tonne": "ton",
    "tonnes": "tons",
    "towelled": "toweled",
    "towelling": "toweling",
    "toxaemia": "toxemia",
    "tranquillise": "tranquilize",
    "tranquillised": "tranquilized",
    "tranquilliser": "tranquilizer",
    "tranquillisers": "tranquilizers",
    "tranquillises": "tranquilizes",
    "tranquillising": "tranquilizing",
    "tranquillity": "tranquility",
    "tranquillize": "tranquilize",
    "tranquillized": "tranquilized",
    "tranquillizer": "tranquilizer",
    "tranquillizers": "tranquilizers",
    "tranquillizes": "tranquilizes",
    "tranquillizing": "tranquilizing",
    "tranquilly": "tranquility",
    "transistorised": "transistorized",
    "traumatise": "traumatize",
    "traumatised": "traumatized",
    "traumatises": "traumatizes",
    "traumatising": "traumatizing",
    "travelled": "traveled",
    "traveller": "traveler",
    "travellers": "travelers",
    "travelling": "traveling",
    "travelog": "travelogue",
    "travelogs": "travelogues",
    "trialled": "trialed",
    "trialling": "trialing",
    "tricolour": "tricolor",
    "tricolours": "tricolors",
    "trivialise": "trivialize",
    "trivialised": "trivialized",
    "trivialises": "trivializes",
    "trivialising": "trivializing",
    "tumour": "tumor",
    "tumours": "tumors",
    "tunnelled": "tunneled",
    "tunnelling": "tunneling",
    "tyrannise": "tyrannize",
    "tyrannised": "tyrannized",
    "tyrannises": "tyrannizes",
    "tyrannising": "tyrannizing",
    "tyre": "tire",
    "tyres": "tires",
    "unauthorised": "unauthorized",
    "uncivilised": "uncivilized",
    "underutilised": "underutilized",
    "unequalled": "unequaled",
    "unfavourable": "unfavorable",
    "unfavourably": "unfavorably",
    "unionisation": "unionization",
    "unionise": "unionize",
    "unionised": "unionized",
    "unionises": "unionizes",
    "unionising": "unionizing",
    "unorganised": "unorganized",
    "unravelled": "unraveled",
    "unravelling": "unraveling",
    "unrecognisable": "unrecognizable",
    "unrecognised": "unrecognized",
    "unrivalled": "unrivaled",
    "unsavoury": "unsavory",
    "untrammelled": "untrammeled",
    "urbanisation": "urbanization",
    "urbanise": "urbanize",
    "urbanised": "urbanized",
    "urbanises": "urbanizes",
    "urbanising": "urbanizing",
    "utilisable": "utilizable",
    "utilisation": "utilization",
    "utilise": "utilize",
    "utilised": "utilized",
    "utilises": "utilizes",
    "utilising": "utilizing",
    "valour": "valor",
    "vandalise": "vandalize",
    "vandalised": "vandalized",
    "vandalises": "vandalizes",
    "vandalising": "vandalizing",
    "vaporisation": "vaporization",
    "vaporise": "vaporize",
    "vaporised": "vaporized",
    "vaporises": "vaporizes",
    "vaporising": "vaporizing",
    "vapour": "vapor",
    "vapours": "vapors",
    "verbalise": "verbalize",
    "verbalised": "verbalized",
    "verbalises": "verbalizes",
    "verbalising": "verbalizing",
    "victimisation": "victimization",
    "victimise": "victimize",
    "victimised": "victimized",
    "victimises": "victimizes",
    "victimising": "victimizing",
    "videodisc": "videodisk",
    "videodiscs": "videodisks",
    "vigour": "vigor",
    "visualisation": "visualization",
    "visualisations": "visualizations",
    "visualise": "visualize",
    "visualised": "visualized",
    "visualises": "visualizes",
    "visualising": "visualizing",
    "vocalisation": "vocalization",
    "vocalisations": "vocalizations",
    "vocalise": "vocalize",
    "vocalised": "vocalized",
    "vocalises": "vocalizes",
    "vocalising": "vocalizing",
    "vulcanised": "vulcanized",
    "vulgarisation": "vulgarization",
    "vulgarise": "vulgarize",
    "vulgarised": "vulgarized",
    "vulgarises": "vulgarizes",
    "vulgarising": "vulgarizing",
    "waggon": "wagon",
    "waggons": "wagons",
    "watercolour": "watercolor",
    "watercolours": "watercolors",
    "weaselled": "weaseled",
    "weaselling": "weaseling",
    "westernisation": "westernization",
    "westernise": "westernize",
    "westernised": "westernized",
    "westernises": "westernizes",
    "westernising": "westernizing",
    "womanise": "womanize",
    "womanised": "womanized",
    "womaniser": "womanizer",
    "womanisers": "womanizers",
    "womanises": "womanizes",
    "womanising": "womanizing",
    "woollen": "woolen",
    "woollens": "woolens",
    "woollies": "woolies",
    "woolly": "wooly",
    "worshipped": "worshiped",
    "worshipping": "worshiping",
    "worshipper": "worshiper",
    "yodelled": "yodeled",
    "yodelling": "yodeling",
    "yoghourt": "yogurt",
    "yoghourts": "yogurts",
    "yoghurt": "yogurt",
    "yoghurts": "yogurts",
    "mhm": "hmm",
    "mmm": "hmm"
}

In [None]:
#@markdown #### **Text Normalizer Code**
#@markdown * Copied from https://github.com/openai/whisper/tree/main/whisper/normalizers
#@markdown * Main Classes
#@markdown `BasicTextNormalizer` and `EnglishTextNormalizer`
#@markdown ---

# https://github.com/openai/whisper/blob/main/whisper/normalizers/english.py

# import json
# import os
# import re
from fractions import Fraction
from typing import Iterator, List, Match, Optional, Union

from more_itertools import windowed
import unicodedata

import regex

# non-ASCII letters that are not separated by "NFKD" normalization
ADDITIONAL_DIACRITICS = {
    "œ": "oe",
    "Œ": "OE",
    "ø": "o",
    "Ø": "O",
    "æ": "ae",
    "Æ": "AE",
    "ß": "ss",
    "ẞ": "SS",
    "đ": "d",
    "Đ": "D",
    "ð": "d",
    "Ð": "D",
    "þ": "th",
    "Þ": "th",
    "ł": "l",
    "Ł": "L",
}


def remove_symbols_and_diacritics(s: str, keep=""):
    """
    Replace any other markers, symbols, and punctuations with a space,
    and drop any diacritics (category 'Mn' and some manual mappings)
    """
    return "".join(
        c
        if c in keep
        else ADDITIONAL_DIACRITICS[c]
        if c in ADDITIONAL_DIACRITICS
        else ""
        if unicodedata.category(c) == "Mn"
        else " "
        if unicodedata.category(c)[0] in "MSP"
        else c
        for c in unicodedata.normalize("NFKD", s)
    )


def remove_symbols(s: str):
    """
    Replace any other markers, symbols, punctuations with a space, keeping diacritics
    """
    return "".join(
        " " if unicodedata.category(c)[0] in "MSP" else c
        for c in unicodedata.normalize("NFKC", s)
    )


class BasicTextNormalizer:
    def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
        self.clean = (
            remove_symbols_and_diacritics if remove_diacritics else remove_symbols
        )
        self.split_letters = split_letters

    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = self.clean(s).lower()

        if self.split_letters:
            s = " ".join(regex.findall(r"\X", s, regex.U))

        s = re.sub(
            r"\s+", " ", s
        )  # replace any successive whitespace characters with a space

        return s


class EnglishNumberNormalizer:
    """
    Convert any spelled-out numbers into arabic numbers, while handling:

    - remove any commas
    - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc.
    - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars`
    - spell out `one` and `ones`
    - interpret successive single-digit numbers as nominal: `one oh one` -> `101`
    """

    def __init__(self):
        super().__init__()

        self.zeros = {"o", "oh", "zero"}
        self.ones = {
            name: i
            for i, name in enumerate(
                [
                    "one",
                    "two",
                    "three",
                    "four",
                    "five",
                    "six",
                    "seven",
                    "eight",
                    "nine",
                    "ten",
                    "eleven",
                    "twelve",
                    "thirteen",
                    "fourteen",
                    "fifteen",
                    "sixteen",
                    "seventeen",
                    "eighteen",
                    "nineteen",
                ],
                start=1,
            )
        }
        self.ones_plural = {
            "sixes" if name == "six" else name + "s": (value, "s")
            for name, value in self.ones.items()
        }
        self.ones_ordinal = {
            "zeroth": (0, "th"),
            "first": (1, "st"),
            "second": (2, "nd"),
            "third": (3, "rd"),
            "fifth": (5, "th"),
            "twelfth": (12, "th"),
            **{
                name + ("h" if name.endswith("t") else "th"): (value, "th")
                for name, value in self.ones.items()
                if value > 3 and value != 5 and value != 12
            },
        }
        self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal}

        self.tens = {
            "twenty": 20,
            "thirty": 30,
            "forty": 40,
            "fifty": 50,
            "sixty": 60,
            "seventy": 70,
            "eighty": 80,
            "ninety": 90,
        }
        self.tens_plural = {
            name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()
        }
        self.tens_ordinal = {
            name.replace("y", "ieth"): (value, "th")
            for name, value in self.tens.items()
        }
        self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal}

        self.multipliers = {
            "hundred": 100,
            "thousand": 1_000,
            "million": 1_000_000,
            "billion": 1_000_000_000,
            "trillion": 1_000_000_000_000,
            "quadrillion": 1_000_000_000_000_000,
            "quintillion": 1_000_000_000_000_000_000,
            "sextillion": 1_000_000_000_000_000_000_000,
            "septillion": 1_000_000_000_000_000_000_000_000,
            "octillion": 1_000_000_000_000_000_000_000_000_000,
            "nonillion": 1_000_000_000_000_000_000_000_000_000_000,
            "decillion": 1_000_000_000_000_000_000_000_000_000_000_000,
        }
        self.multipliers_plural = {
            name + "s": (value, "s") for name, value in self.multipliers.items()
        }
        self.multipliers_ordinal = {
            name + "th": (value, "th") for name, value in self.multipliers.items()
        }
        self.multipliers_suffixed = {
            **self.multipliers_plural,
            **self.multipliers_ordinal,
        }
        self.decimals = {*self.ones, *self.tens, *self.zeros}

        self.preceding_prefixers = {
            "minus": "-",
            "negative": "-",
            "plus": "+",
            "positive": "+",
        }
        self.following_prefixers = {
            "pound": "£",
            "pounds": "£",
            "euro": "€",
            "euros": "€",
            "dollar": "$",
            "dollars": "$",
            "cent": "¢",
            "cents": "¢",
        }
        self.prefixes = set(
            list(self.preceding_prefixers.values())
            + list(self.following_prefixers.values())
        )
        self.suffixers = {
            "per": {"cent": "%"},
            "percent": "%",
        }
        self.specials = {"and", "double", "triple", "point"}

        self.words = set(
            [
                key
                for mapping in [
                    self.zeros,
                    self.ones,
                    self.ones_suffixed,
                    self.tens,
                    self.tens_suffixed,
                    self.multipliers,
                    self.multipliers_suffixed,
                    self.preceding_prefixers,
                    self.following_prefixers,
                    self.suffixers,
                    self.specials,
                ]
                for key in mapping
            ]
        )
        self.literal_words = {"one", "ones"}

    def process_words(self, words: List[str]) -> Iterator[str]:
        prefix: Optional[str] = None
        value: Optional[Union[str, int]] = None
        skip = False

        def to_fraction(s: str):
            try:
                return Fraction(s)
            except ValueError:
                return None

        def output(result: Union[str, int]):
            nonlocal prefix, value
            result = str(result)
            if prefix is not None:
                result = prefix + result
            value = None
            prefix = None
            return result

        if len(words) == 0:
            return

        for prev, current, next in windowed([None] + words + [None], 3):
            if skip:
                skip = False
                continue

            next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next)
            has_prefix = current[0] in self.prefixes
            current_without_prefix = current[1:] if has_prefix else current
            if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
                # arabic numbers (potentially with signs and fractions)
                f = to_fraction(current_without_prefix)
                assert f is not None
                if value is not None:
                    if isinstance(value, str) and value.endswith("."):
                        # concatenate decimals / ip address components
                        value = str(value) + str(current)
                        continue
                    else:
                        yield output(value)

                prefix = current[0] if has_prefix else prefix
                if f.denominator == 1:
                    value = f.numerator  # store integers as int
                else:
                    value = current_without_prefix
            elif current not in self.words:
                # non-numeric words
                if value is not None:
                    yield output(value)
                yield output(current)
            elif current in self.zeros:
                value = str(value or "") + "0"
            elif current in self.ones:
                ones = self.ones[current]

                if value is None:
                    value = ones
                elif isinstance(value, str) or prev in self.ones:
                    if (
                        prev in self.tens and ones < 10
                    ):  # replace the last zero with the digit
                        assert value[-1] == "0"
                        value = value[:-1] + str(ones)
                    else:
                        value = str(value) + str(ones)
                elif ones < 10:
                    if value % 10 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
            elif current in self.ones_suffixed:
                # ordinal or cardinal; yield the number right away
                ones, suffix = self.ones_suffixed[current]
                if value is None:
                    yield output(str(ones) + suffix)
                elif isinstance(value, str) or prev in self.ones:
                    if prev in self.tens and ones < 10:
                        assert value[-1] == "0"
                        yield output(value[:-1] + str(ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                elif ones < 10:
                    if value % 10 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                value = None
            elif current in self.tens:
                tens = self.tens[current]
                if value is None:
                    value = tens
                elif isinstance(value, str):
                    value = str(value) + str(tens)
                else:
                    if value % 100 == 0:
                        value += tens
                    else:
                        value = str(value) + str(tens)
            elif current in self.tens_suffixed:
                # ordinal or cardinal; yield the number right away
                tens, suffix = self.tens_suffixed[current]
                if value is None:
                    yield output(str(tens) + suffix)
                elif isinstance(value, str):
                    yield output(str(value) + str(tens) + suffix)
                else:
                    if value % 100 == 0:
                        yield output(str(value + tens) + suffix)
                    else:
                        yield output(str(value) + str(tens) + suffix)
            elif current in self.multipliers:
                multiplier = self.multipliers[current]
                if value is None:
                    value = multiplier
                elif isinstance(value, str) or value == 0:
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        value = p.numerator
                    else:
                        yield output(value)
                        value = multiplier
                else:
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
            elif current in self.multipliers_suffixed:
                multiplier, suffix = self.multipliers_suffixed[current]
                if value is None:
                    yield output(str(multiplier) + suffix)
                elif isinstance(value, str):
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        yield output(str(p.numerator) + suffix)
                    else:
                        yield output(value)
                        yield output(str(multiplier) + suffix)
                else:  # int
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
                    yield output(str(value) + suffix)
                value = None
            elif current in self.preceding_prefixers:
                # apply prefix (positive, minus, etc.) if it precedes a number
                if value is not None:
                    yield output(value)

                if next in self.words or next_is_numeric:
                    prefix = self.preceding_prefixers[current]
                else:
                    yield output(current)
            elif current in self.following_prefixers:
                # apply prefix (dollars, cents, etc.) only after a number
                if value is not None:
                    prefix = self.following_prefixers[current]
                    yield output(value)
                else:
                    yield output(current)
            elif current in self.suffixers:
                # apply suffix symbols (percent -> '%')
                if value is not None:
                    suffix = self.suffixers[current]
                    if isinstance(suffix, dict):
                        if next in suffix:
                            yield output(str(value) + suffix[next])
                            skip = True
                        else:
                            yield output(value)
                            yield output(current)
                    else:
                        yield output(str(value) + suffix)
                else:
                    yield output(current)
            elif current in self.specials:
                if next not in self.words and not next_is_numeric:
                    # apply special handling only if the next word can be numeric
                    if value is not None:
                        yield output(value)
                    yield output(current)
                elif current == "and":
                    # ignore "and" after hundreds, thousands, etc.
                    if prev not in self.multipliers:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "double" or current == "triple":
                    if next in self.ones or next in self.zeros:
                        repeats = 2 if current == "double" else 3
                        ones = self.ones.get(next, 0)
                        value = str(value or "") + str(ones) * repeats
                        skip = True
                    else:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "point":
                    if next in self.decimals or next_is_numeric:
                        value = str(value or "") + "."
                else:
                    # should all have been covered at this point
                    raise ValueError(f"Unexpected token: {current}")
            else:
                # all should have been covered at this point
                raise ValueError(f"Unexpected token: {current}")

        if value is not None:
            yield output(value)

    def preprocess(self, s: str):
        # replace "<number> and a half" with "<number> point five"
        results = []

        segments = re.split(r"\band\s+a\s+half\b", s)
        for i, segment in enumerate(segments):
            if len(segment.strip()) == 0:
                continue
            if i == len(segments) - 1:
                results.append(segment)
            else:
                results.append(segment)
                last_word = segment.rsplit(maxsplit=2)[-1]
                if last_word in self.decimals or last_word in self.multipliers:
                    results.append("point five")
                else:
                    results.append("and a half")

        s = " ".join(results)

        # put a space at number/letter boundary
        s = re.sub(r"([a-z])([0-9])", r"\1 \2", s)
        s = re.sub(r"([0-9])([a-z])", r"\1 \2", s)

        # but remove spaces which could be a suffix
        s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s)

        return s

    def postprocess(self, s: str):
        def combine_cents(m: Match):
            try:
                currency = m.group(1)
                integer = m.group(2)
                cents = int(m.group(3))
                return f"{currency}{integer}.{cents:02d}"
            except ValueError:
                return m.string

        def extract_cents(m: Match):
            try:
                return f"¢{int(m.group(1))}"
            except ValueError:
                return m.string

        # apply currency postprocessing; "$2 and ¢7" -> "$2.07"
        s = re.sub(r"([€£$])([0-9]+) (?:and )?¢([0-9]{1,2})\b", combine_cents, s)
        s = re.sub(r"[€£$]0.([0-9]{1,2})\b", extract_cents, s)

        # write "one(s)" instead of "1(s)", just for the readability
        s = re.sub(r"\b1(s?)\b", r"one\1", s)

        return s

    def __call__(self, s: str):
        s = self.preprocess(s)
        s = " ".join(word for word in self.process_words(s.split()) if word is not None)
        s = self.postprocess(s)

        return s


class EnglishSpellingNormalizer:
    """
    Applies British-American spelling mappings as listed in [1].

    [1] https://www.tysto.com/uk-us-spelling-list.html
    """

    def __init__(self):
        # mapping_path = os.path.join(os.path.dirname(__file__), "english.json")
        # self.mapping = json.load(open(mapping_path))
        self.mapping = english_word_maps

    def __call__(self, s: str):
        return " ".join(self.mapping.get(word, word) for word in s.split())


class EnglishTextNormalizer:
    def __init__(self):
        self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b"
        self.replacers = {
            # common contractions
            r"\bwon't\b": "will not",
            r"\bcan't\b": "can not",
            r"\blet's\b": "let us",
            r"\bain't\b": "aint",
            r"\by'all\b": "you all",
            r"\bwanna\b": "want to",
            r"\bgotta\b": "got to",
            r"\bgonna\b": "going to",
            r"\bi'ma\b": "i am going to",
            r"\bimma\b": "i am going to",
            r"\bwoulda\b": "would have",
            r"\bcoulda\b": "could have",
            r"\bshoulda\b": "should have",
            r"\bma'am\b": "madam",
            # contractions in titles/prefixes
            r"\bmr\b": "mister ",
            r"\bmrs\b": "missus ",
            r"\bst\b": "saint ",
            r"\bdr\b": "doctor ",
            r"\bprof\b": "professor ",
            r"\bcapt\b": "captain ",
            r"\bgov\b": "governor ",
            r"\bald\b": "alderman ",
            r"\bgen\b": "general ",
            r"\bsen\b": "senator ",
            r"\brep\b": "representative ",
            r"\bpres\b": "president ",
            r"\brev\b": "reverend ",
            r"\bhon\b": "honorable ",
            r"\basst\b": "assistant ",
            r"\bassoc\b": "associate ",
            r"\blt\b": "lieutenant ",
            r"\bcol\b": "colonel ",
            r"\bjr\b": "junior ",
            r"\bsr\b": "senior ",
            r"\besq\b": "esquire ",
            # prefect tenses, ideally it should be any past participles, but it's harder..
            r"'d been\b": " had been",
            r"'s been\b": " has been",
            r"'d gone\b": " had gone",
            r"'s gone\b": " has gone",
            r"'d done\b": " had done",  # "'s done" is ambiguous
            r"'s got\b": " has got",
            # general contractions
            r"n't\b": " not",
            r"'re\b": " are",
            r"'s\b": " is",
            r"'d\b": " would",
            r"'ll\b": " will",
            r"'t\b": " not",
            r"'ve\b": " have",
            r"'m\b": " am",
        }
        self.standardize_numbers = EnglishNumberNormalizer()
        self.standardize_spellings = EnglishSpellingNormalizer()

    def __call__(self, s: str):
        s = s.lower()

        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = re.sub(self.ignore_patterns, "", s)
        s = re.sub(r"\s+'", "'", s)  # when there's a space before an apostrophe

        for pattern, replacement in self.replacers.items():
            s = re.sub(pattern, replacement, s)

        s = re.sub(r"(\d),(\d)", r"\1\2", s)  # remove commas between digits
        s = re.sub(r"\.([^0-9]|$)", r" \1", s)  # remove periods not followed by numbers
        s = remove_symbols_and_diacritics(s, keep=".%$¢€£")  # keep numeric symbols

        s = self.standardize_numbers(s)
        s = self.standardize_spellings(s)

        # now remove prefix/suffix symbols that are not preceded/followed by numbers
        s = re.sub(r"[.$¢€£]([^0-9])", r" \1", s)
        s = re.sub(r"([^0-9])%", r"\1 ", s)

        s = re.sub(r"\s+", " ", s)  # replace any successive whitespaces with a space

        return s



#### Evaluation Script

In [None]:
#@markdown ##### ⚙️ Contains the custom script for evaluation
#@markdown
#@markdown ---

%%capture

!pip install jiwer
!pip install sacrebleu

import jiwer
import sacrebleu
import nltk
from nltk.translate.bleu_score import sentence_bleu

from tabulate import tabulate

def compute_wer(hypothesis, reference):
    return jiwer.wer(reference, hypothesis)

def compute_cer(hypothesis, reference):
    return jiwer.cer(reference, hypothesis)

def compute_bleu(hypothesis, reference):
    reference = [reference.split()]  # BLEU expects a list of reference translations
    hypothesis = hypothesis.split()
    return sentence_bleu(reference, hypothesis)

def compute_sacrebleu(hypothesis, reference):
    return sacrebleu.corpus_bleu([hypothesis], [[reference]]).score

english_normalizer = EnglishTextNormalizer()
non_english_normalizer = BasicTextNormalizer()

def evaluate(asr_output, gold_transcript):

    evaluation_scores = {}

    evaluation_scores['wer'] = compute_wer(asr_output, gold_transcript)
    evaluation_scores['cer'] = compute_cer(asr_output, gold_transcript)
    evaluation_scores['bleu'] = compute_bleu(asr_output, gold_transcript)
    evaluation_scores['sacrebleu'] = compute_sacrebleu(asr_output, gold_transcript)

    return evaluation_scores

def custom_progress(num, desc):
  print("Progress:" + str(num * 100) + "\n" + "Description:" + str(desc) )


def evaluation_pipeline(testing_dataset):
  result = []

  for vid, language, gold_transcript  in zip(testing_dataset['vids'], testing_dataset['languages'], testing_dataset['gold_transcripts']):

    current_result = {}
    current_result['Path'] = vid
    current_result['language'] = language

    # Add method incase of file path as input
    transcript, _, _, _, _, _, _,  = transcribe_yt(vid, language, 5, progress=custom_progress)

    if language == 'english':
      transcript = english_normalizer(transcript)
      gold_transcript = english_normalizer(gold_transcript)
    else:
      transcript = non_english_normalizer(transcript)
      gold_transcript = non_english_normalizer(gold_transcript)

    current_result['eval_result'] = evaluate(transcript, gold_transcript)

    result.append(current_result)

  return result

def evaluate_and_tabulate(testing_dataset):

  results = evaluation_pipeline(testing_dataset)

  table_data = []
  table_data = [["Video URL/File Name","Language", "WER", "CER", "BLEU", "SACREBLEU"]]
  for result in results:
    table_data.append([result['Path'], result['language'], result['eval_result']['wer'], result['eval_result']['cer'], result['eval_result']['bleu'], result['eval_result']['sacrebleu'],     ])

  print(tabulate(table_data, headers="firstcolumn", tablefmt="grid"))

### Run Evaluation

In [None]:
#@markdown ## **Run Evaluation** 🕵️
#@markdown * The evaluation will be run on above 2 youtube videos and its transcriptions
#@markdown ---

# Run Evaluation
evaluate_and_tabulate(testing_dataset)

Progress:1.0
Description:Downloading audio file....
Valid Youtube URL!
No of different Captions available for this video (Can be used for transcription/translation training data) : English (auto-generated)
Audio downloaded successfully!
Converting to .mp3 format......
Converted successfully to .mp3 format!
Progress:10.0
Description:File Setup Complete -> Starting Audio Preprocessing - Normalizing Audio, Dynamic Range Compression, Noise Reduction, Weiner Filter
---------------------------------------------------------------------------------------------------
Starting Audio Preprocessing Techniques - Normalizing Audio, Dynamic Range Compression, Noise Reduction, Weiner Filter
---------------------------------------------------------------------------------------------------
Normalizing audio......
Audio Normalization done!
Applying weiner filter on audio file....
Completed applying weiner filter on audio file to reduce noise!
Audio Preprocessing Done!
Progress:30.0
Description:Audio Pre

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Progress:40.0
Description:Whisper Model Initialised -> Starting Whisper Transcription...
----------Models Initialization Done!


Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
100%|██████████| 80.2M/80.2M [00:00<00:00, 159MB/s]
Demucs: 100%|██████████| 892.02/892.02 [00:35<00:00, 25.01sec/s]


Detected Language: english
Transcribing with faster-whisper (large-v3)...

Transcribe: 100%|██████████| 892.02/892.02 [05:32<00:00,  2.69sec/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
VAD: 100%|██████████| 892.02/892.02 [00:45<00:00, 19.54sec/s]
Adjustment: 100%|██████████| 886.4/886.4 [00:00<00:00, 20133.17sec/s]


Saved: /content/generated_data/MRM/transcript.json
---------------------------------------------------------------------------------------------------
Whisper Transcription Complete
---------------------------------------------------------------------------------------------------
Progress:70.0
Description:Whisper Transcription complete -> Initializing speaker diariazation model...
---------------------------------------------------------------------------------------------------
Starting Speaker Diarization
---------------------------------------------------------------------------------------------------


config.yaml:   0%|          | 0.00/469 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/399 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/221 [00:00<?, ?B/s]

Progress:80.0
Description:Speaker Diariazation Model Initialized -> Starting speaker diariazation...


Output()

Speaker Diarization Done!
Progress:90.0
Description:Speaker Diariazation Complete -> Starting Transcript Preparation...
---------------------------------------------------------------------------------------------------
Starting Transcript Preparation
---------------------------------------------------------------------------------------------------
Starting cleaning
---------------------------------------------------------------------------------------------------
Pipeline Completed!
---------------------------------------------------------------------------------------------------
Progress:99.0
Description:Transcript Complete...
Transcription complete.
Progress:1.0
Description:Downloading audio file....
Valid Youtube URL!
No of different Captions available for this video (Can be used for transcription/translation training data) : Hindi (auto-generated)
Audio downloaded successfully!
Converting to .mp3 format......
Converted successfully to .mp3 format!
Progress:10.0
Description:File 

Demucs: 100%|██████████| 890.01/890.01 [00:34<00:00, 25.93sec/s]


Detected Language: hindi
Transcribing with faster-whisper (large-v3)...

Transcribe: 100%|██████████| 890.01/890.01 [03:20<00:00,  4.44sec/s]
VAD: 100%|██████████| 890.01/890.01 [00:46<00:00, 19.21sec/s]
Adjustment: 100%|██████████| 867.72/867.72 [00:00<00:00, 19036.14sec/s]


Saved: /content/generated_data/OUL/transcript.json
---------------------------------------------------------------------------------------------------
Whisper Transcription Complete
---------------------------------------------------------------------------------------------------
Progress:70.0
Description:Whisper Transcription complete -> Initializing speaker diariazation model...
---------------------------------------------------------------------------------------------------
Starting Speaker Diarization
---------------------------------------------------------------------------------------------------


Output()

Progress:80.0
Description:Speaker Diariazation Model Initialized -> Starting speaker diariazation...


Speaker Diarization Done!
Progress:90.0
Description:Speaker Diariazation Complete -> Starting Transcript Preparation...
---------------------------------------------------------------------------------------------------
Starting Transcript Preparation
---------------------------------------------------------------------------------------------------
Starting cleaning
---------------------------------------------------------------------------------------------------
Pipeline Completed!
---------------------------------------------------------------------------------------------------
Progress:99.0
Description:Transcript Complete...
Transcription complete.
+---------------------------------------------+----------+--------------------+---------------------+--------------------+-------------------+
| f                                           | i        | r                  | s                   | t                  | c                 |
| Video URL/File Name                         | Lan

In [None]:
#@markdown ###### **Current Issues**
#@markdown ---

# Main issue - Memory issue with Demucs - change this if possible
# Noise reduce
# Dynamic Range Compression
# Loads of bugs and wrong version control

# Gemini sometime doesnot translates/summarizes to target language (incase of low resource language like kannada) (Reload option works..)

# -----------------------------------------------------------------------------------------------