<a href="https://colab.research.google.com/github/aastha2003gupta/speech_translator/blob/main/speech_translator_mbart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing dependencies
Installing the Whisper ASR library, checking for GPU availability, and loading a pre-trained large-sized Whisper ASR model. The device variable is set to "cuda" if a GPU is available; otherwise, it is set to "cpu". The model is loaded onto the specified device for Automatic Speech Recognition tasks.





In [None]:
!pip -qqq install git+https://github.com/openai/whisper.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[0m

In [None]:
import whisper
import torch

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
whisper_model = whisper.load_model("large", device=device)

100%|██████████████████████████████████████| 2.88G/2.88G [00:19<00:00, 161MiB/s]


# Speech to text
Transcribing the speech content of the specified MP3 audio file (audio_file) using the loaded Whisper ASR model (whisper_model). The transcribed text is then printed using print(result['text']).



In [None]:
audio_file = '/content/audio.mp3'
result = whisper_model.transcribe(audio_file)

In [None]:
print(result['text'])

 What is the full form of RBI? Reserve Bank of India. Okay. The price of petrol went from Rs. 5 in the 80s to more than Rs. 100 in 2023. What is this phenomenon called? Inflation. Correct. By the way, yeah. This is yours. Which is the largest commercial bank in India? Largest commercial bank? Depends, like are you talking about private sector or public sector? Everything included. SBI. Correct. What is the strongest currency? In the world. Strongest currency. One of the Middle East countries, I'm not sure. I think Qatari something, something. I'm not sure, but yeah. Even I don't know. What's the answer? Correct. I'll give you that. I'll give you that. And the last question is, which country was the first to launch UPI? India. That was a trick question and it's correct. How is that a trick question? It literally is from India, bro.


# Detecting language of the transcribed text
Using langdetect library to automatically detect the language of the transcribed text obtained from the Whisper ASR model. It then maps the detected language code to the corresponding mBART format using the predefined dictionary langdetect_to_mbart_mapping. Finally, it prints the original text, the detected language code using langdetect, and the corresponding mBART language code. If the detected language is not present in the mapping, it defaults to English ("en_XX").

In [None]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993224 sha256=a381bb12e505ee7b322ded0dc78584539410fd8cb429c754bcd151e8d6944f0b
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
from langdetect import detect

# Provided mapping from langdetect language code to mBART format
langdetect_to_mbart_mapping = {
    "ar": "ar_AR", "cs": "cs_CZ", "de": "de_DE", "en": "en_XX", "es": "es_XX",
    "et": "et_EE", "fi": "fi_FI", "fr": "fr_XX", "gu": "gu_IN", "hi": "hi_IN",
    "it": "it_IT", "ja": "ja_XX", "kk": "kk_KZ", "ko": "ko_KR", "lt": "lt_LT",
    "lv": "lv_LV", "my": "my_MM", "ne": "ne_NP", "nl": "nl_XX", "ro": "ro_RO",
    "ru": "ru_RU", "si": "si_LK", "tr": "tr_TR", "vi": "vi_VN", "zh-cn": "zh_CN",
    "af": "af_ZA", "az": "az_AZ", "bn": "bn_IN", "fa": "fa_IR", "he": "he_IL",
    "hr": "hr_HR", "id": "id_ID", "ka": "ka_GE", "km": "km_KH", "mk": "mk_MK",
    "ml": "ml_IN", "mn": "mn_MN", "mr": "mr_IN", "pl": "pl_PL", "ps": "ps_AF",
    "pt": "pt_XX", "sv": "sv_SE", "sw": "sw_KE", "ta": "ta_IN", "te": "te_IN",
    "th": "th_TH", "tl": "tl_XX", "uk": "uk_UA", "ur": "ur_PK", "xh": "xh_ZA",
    "gl": "gl_ES", "sl": "sl_SI"
}

# Detect the language using langdetect
detected_lang_code = detect(result['text'])

# Convert langdetect language code to mBART format
mbart_lang_code = langdetect_to_mbart_mapping.get(detected_lang_code, "en_XX")  # Default to English if not found

print(f"Input Text: {result['text']}")
print(f"Detected Language Code (langdetect): {detected_lang_code}")
print(f"mBART Language Code: {mbart_lang_code}")


Input Text:  What is the full form of RBI? Reserve Bank of India. Okay. The price of petrol went from Rs. 5 in the 80s to more than Rs. 100 in 2023. What is this phenomenon called? Inflation. Correct. By the way, yeah. This is yours. Which is the largest commercial bank in India? Largest commercial bank? Depends, like are you talking about private sector or public sector? Everything included. SBI. Correct. What is the strongest currency? In the world. Strongest currency. One of the Middle East countries, I'm not sure. I think Qatari something, something. I'm not sure, but yeah. Even I don't know. What's the answer? Correct. I'll give you that. I'll give you that. And the last question is, which country was the first to launch UPI? India. That was a trick question and it's correct. How is that a trick question? It literally is from India, bro.
Detected Language Code (langdetect): en
mBART Language Code: en_XX


# Translate to different languages using mbart
Utilizing the Hugging Face Transformers library to perform machine translation with the MBart model. It loads a pre-trained MBart model and tokenizer for one-to-many multi-modal translation. The input text, previously transcribed by the Whisper ASR model, is translated into several Indian languages specified in the languages_info dictionary. The translations, along with language codes and names, are stored in the translations_data dictionary. The translated output is printed for each language, facilitating multilingual summarization. The sentencepiece library is also installed as a prerequisite for MBart.

In [None]:
! pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
!pip freeze | grep transformers

transformers==4.35.2


In [None]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

In [None]:
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

In [None]:
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang=mbart_lang_code)

In [None]:
model_inputs = tokenizer(result['text'], return_tensors="pt")

In [None]:
languages_info = {
    "hi_IN": "Hindi",
    "gu_IN": "Gujarati",
    "bn_IN": "Bengali",
    "ta_IN": "Tamil",
    "te_IN": "Telugu",
    "ml_IN": "Malayalam",
    "mr_IN": "Marathi",
    "ur_PK": "Urdu"
}
translations_data = {}
for lang_code, lang_name in languages_info.items():
    generated_tokens = model.generate(
        **model_inputs,
        forced_bos_token_id=tokenizer.lang_code_to_id[lang_code]
    )
    translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    translations_data[lang_code] = {
        'lang_name': lang_name,
        'translation': translation
    }

    print(f"Language Code: {lang_code}, Language Name: {lang_name}, Translation: {translation}")


Language Code: hi_IN, Language Name: Hindi, Translation: ['आरबीआई का पूरा रूप क्या है? भारतीय रिज़र्व बैंक. ठीक है. पैट्रोल की कीमत 80 के दशक में 5 रु. से बढ़कर 2023 में 100 रु. से अधिक हो गई. इस घटना को क्या कहा जाता है? मुद्रास्फीति. ठीक है. वैसे ही, हाँ. यह आपका है. भारत में सबसे बड़ा वाणिज्यिक बैंक क्या है? सबसे बड़ा वाणिज्यिक बैंक क्या है? निर्भर करता है, जैसे आप निजी क्षेत्र या सार्वजनिक क्षेत्र के बारे में बात कर रहे हैं? सब कुछ शामिल है. एसबीआई. ठीक है. सबसे मजबूत मुद्रा क्या है? विश्व में. सबसे मजबूत मुद्रा. मध्य पूर्व देशों में से एक, मुझे विश्वास नहीं है. मुझे लगता है कि कातर कुछ, कुछ. मुझे विश्वास नहीं है, लेकिन हाँ. मुझे भी पता नहीं है. जवाब क्या है? ठीक है. मैं आपको दे दूँगा कि. मैं आपको दे दूँगा कि. और अंतिम सवाल यह है, कौन']
Language Code: gu_IN, Language Name: Gujarati, Translation: ["RBI-નો পুরো ઢાંચાઓ কি? ભારતીય રિઝર્વ બેંક. ঠিক আছে. પેટ્રોનો מחיר ৮০ 'র দশকে Rs.5 থেকে 2023' তে Rs.100 এর ও বেশি হয়ে গেছে. ამ მოვლენას কি বলা হয়? ફુગાવો. ঠিক আছে. ویسے, ඔව්. এটা మీదే. ভ

# Calculating WER
Calculating the Word Error Rate (WER) between the reference text (transcribed by the Whisper ASR model) and the translated text for each language in the translations_data dictionary. It uses the NLTK library's edit_distance function to compute the number of insertions, deletions, and substitutions needed to transform one set of words into another. The WER scores, normalized by the length of the reference text, are stored in the wer_scores dictionary. Finally, the WER scores are printed or can be used further based on your requirements.




In [None]:
from nltk.metrics.distance import edit_distance

In [None]:
wer_scores = {}

for lang_code, lang_data in translations_data.items():
    # Access the translation text from the list
    translated_text = lang_data['translation'][0]  # Assuming the translation is stored as a list

    # Tokenize the reference and translated texts into words
    reference_words = result['text'].split()  # Assuming you have reference_texts defined
    translated_words = translated_text.split()

    # Calculate WER
    wer = edit_distance(reference_words, translated_words)
    wer_rate = wer / len(reference_words)

    # Store the WER for the translation
    wer_scores[lang_code] = wer_rate

# Print or use the WER scores as needed
print(wer_scores)

{'hi_IN': 0.9865771812080537, 'gu_IN': 1.0, 'bn_IN': 1.0, 'ta_IN': 0.9932885906040269, 'te_IN': 1.0, 'ml_IN': 0.9865771812080537, 'mr_IN': 1.0, 'ur_PK': 0.9865771812080537}


# Returning the translated audio in different
Using languages the gTTS (Google Text-to-Speech) library to generate audio files for each language in the translations dictionary. It iterates through each language code and its corresponding translated text, attempts to create a gTTS object, saves the generated audio file in the 'translated_audio' directory with a filename based on the language code, and prints a message indicating successful audio generation. If the language is not supported, it catches a ValueError and prints an error message. The 'translated_audio' directory is created if it doesn't exist.

In [None]:
!pip install gTTS

Collecting gTTS
  Downloading gTTS-2.4.0-py3-none-any.whl (29 kB)
Installing collected packages: gTTS
Successfully installed gTTS-2.4.0


In [None]:
from gtts import gTTS
import os

output_dir = 'translated_audio'

# Create the output directory
os.makedirs(output_dir, exist_ok=True)
for lang_code, translated_text in translations_data.items():
    try:
        tts = gTTS(text=translated_text, lang=lang_code)
        audio_file_path = os.path.join(output_dir, f'{lang_code}_translation.mp3')
        tts.save(audio_file_path)
        print(f'Audio generated for {lang_code}')
    except ValueError as e:
        print(f"Language not supported for {lang_code}: {str(e)}")

Language not supported for hi_IN: Language not supported: hi_IN
Language not supported for gu_IN: Language not supported: gu_IN
Language not supported for bn_IN: Language not supported: bn_IN
Language not supported for ta_IN: Language not supported: ta_IN
Language not supported for te_IN: Language not supported: te_IN
Language not supported for ml_IN: Language not supported: ml_IN
Language not supported for mr_IN: Language not supported: mr_IN
Language not supported for ur_PK: Language not supported: ur_PK


In [None]:
from gtts import gTTS
import os

output_dir = 'translated_audio'

# Mapping between mBART language codes and gTTS language codes
mbart_to_gtts_mapping = {
    "hi_IN": "hi",
    "gu_IN": "gu",
    "bn_IN": "bn",
    "ta_IN": "ta",
    "te_IN": "te",
    "ml_IN": "ml",
    "mr_IN": "mr",
    "ur_PK": "ur",
    # Add more mappings as needed
}

# Create the output directory
os.makedirs(output_dir, exist_ok=True)

for lang_code, lang_data in translations_data.items():
    translated_text = lang_data['translation'][0]  # Assuming the translation is stored as a list

    # Convert mBART language code to gTTS language code
    gtts_lang_code = mbart_to_gtts_mapping.get(lang_code.replace('_', '-'), "en")  # Default to English if not found

    try:
        tts = gTTS(text=translated_text, lang=gtts_lang_code)
        audio_file_path = os.path.join(output_dir, f'{lang_code}_translation.mp3')
        tts.save(audio_file_path)
        print(f'Audio generated for {lang_code} ({gtts_lang_code})')
    except ValueError as e:
        print(f"Language not supported for {lang_code}: {str(e)}")


Audio generated for hi_IN (en)
Audio generated for gu_IN (en)
Audio generated for bn_IN (en)
Audio generated for ta_IN (en)
Audio generated for te_IN (en)
Audio generated for ml_IN (en)
Audio generated for mr_IN (en)
Audio generated for ur_PK (en)
