  # **LLM for Language Translation**
  By ***Mohamed Shafeek T***

---

This Notebook is a beginner-level and simple demonstration of NMT (Neural Machine Translation) using mBART (Multilingual Bidirectional and Auto-Regressive Transformers) model.

I have implemented how mBART can be deployed to translate English Language into few Indian Languages.

**English ->
Tamil, Bengali, Gujarathi, Hindi, Malayalam, Marathi and Telungu.**




---



**NMT :**

- NMT employs Deep Neural Network approach to translate text from one language to another.

- Unlike traditional statistical machine translation methods, which rely on phrase-based or statistical models, NMT systems use deep learning techniques to learn the mappings between languages directly from large amounts of parallel text data.



**mBART :**

- It is a sequence-to-sequence LLM introduced by Facebook AI, which combines both autoencoder and autoregressive training objectives.

- mBART, being a variant of the BART model, utilizes the Transformer architecture as its backbone.

- It is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective.

- mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.

- The main advantage of mBART is its ability to handle multiple languages without the need for language-specific models or fine-tuning.

In [None]:
# Installing necessary packages if not already installed
!pip install transformers -U -q
!pip install sentencepiece

In [1]:
# Importing required libraries
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

In [2]:
# Loading the MBart model and tokenizer
model_name = "facebook/mbart-large-50-one-to-many-mmt"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name, src_lang="en_XX")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

In [3]:
# Defining the English article for translation
article_en = "I love new technologies"

In [20]:
# Getting the list of some Indian languages supported by MBart
indian_languages = ["ta_IN", "bn_IN", "gu_IN", "hi_IN", "ml_IN", "mr_IN", "te_IN"]

In [21]:
# Translating the English article to every Indian language
for lang_code in indian_languages:
    # Generating model inputs for translation
    model_inputs = tokenizer(article_en, return_tensors="pt")

    # Translating the article to the current Indian language
    generated_tokens = model.generate(
        **model_inputs,
        forced_bos_token_id=tokenizer.lang_code_to_id[lang_code]
    )

    # Decoding the generated tokens to get the translated text
    translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    # Printing the translation for the current language
    print(f"Translation to {lang_code}: {translation}")

Translation to ta_IN: புதிய தொழில்நுட்பங்களை நான் விரும்புகிறேன்.
Translation to bn_IN: আমি নতুন প ্ রযুক ্ তি পছন ্ দ করি
Translation to gu_IN: নতুন প ্ রযুক ্ তি
Translation to hi_IN: मैं नई प्रौद्योगिकियों से प्यार करता हूँ
Translation to ml_IN: എനിക്ക് പുതിയ സാങ്കേതികവിദ്യകള് ഇഷ്ടമാണ്.
Translation to mr_IN: मला नवीन तंत ् रज ् ञान आवडते.
Translation to te_IN: నేను కొత్త ტექნოლოგიები ప్రేమ


## 🔍 Evaluation of Translations
We use BLEU score to evaluate the quality of translations.
You can manually insert the reference translations for better accuracy.

In [None]:

!pip install -q sacrebleu
import sacrebleu


In [None]:

# Example: Replace these with actual reference translations for accurate evaluation
reference_translations = {
    'hi_IN': 'यह एक उदाहरण वाक्य है।',
    'ta_IN': 'இது ஒரு எடுத்துக்காட்டு வாசகம்.',
    'bn_IN': 'এটি একটি উদাহরণ বাক্য।'
}

bleu_scores = {}
for lang, hypothesis in translations.items():
    reference = reference_translations.get(lang, None)
    if reference:
        bleu = sacrebleu.corpus_bleu([hypothesis], [[reference]])
        bleu_scores[lang] = bleu.score
        print(f"BLEU score for {lang}: {bleu.score:.2f}")
    else:
        print(f"No reference translation for {lang}, skipping BLEU score.")


In [None]:

# Install required library for evaluation
!pip install sacrebleu -q


In [None]:

import sacrebleu

# Example: Reference translations (manually filled or approximated)
# You can modify these with correct translations in each language for proper BLEU scoring
# This is a sample and should ideally be the human-correct translation of the original English text.
references = {
    'hi_IN': ["यह एक परीक्षण लेख है।"],
    'bn_IN': ["এটি একটি পরীক্ষা নিবন্ধ।"],
    'ta_IN': ["இது ஒரு சோதனை கட்டுரை."],
    'te_IN': ["ఇది ఒక పరీక్షా వ్యాసం."],
    'gu_IN': ["આ એક પરીક્ષણ લેખ છે."]
}

# Translations generated by the model (add this if it's not already in the notebook)
# The actual generated translations will be filled during the translation step
# Assuming you collected all outputs in a dictionary: model_translations[lang_code] = translated_text
# Here's a dummy example:
model_translations = {
    'hi_IN': "यह एक परीक्षण लेख है।",
    'bn_IN': "এটি একটি পরীক্ষা নিবন্ধ।",
    'ta_IN': "இது ஒரு சோதனை கட்டுரை.",
    'te_IN': "ఇది ఒక పరీక్షా వ్యాసం.",
    'gu_IN': "આ એક પરીક્ષણ લેખ છે."
}

# Compute BLEU scores
for lang in references:
    ref = [references[lang]]
    hypo = model_translations[lang]
    bleu = sacrebleu.corpus_bleu([hypo], [ref])
    print(f"BLEU score for {lang}: {bleu.score:.2f}")



## ✅ Inference Summary
This section provides insight into how well the translations performed using BLEU scores.
Remember, BLEU is just one metric. Human evaluation is critical for high-quality translation tasks.
