# **Multilingual Translation with M2M100 and Prompt Engineering**

## **Introduction**

**M2M100** is a sequence-to-sequence model developed by Facebook AI, designed specifically to handle multilingual translation tasks. Unlike traditional models that are often trained to translate between specific pairs of languages, M2M100 is trained to translate directly between any pair of 100 languages. This makes it a perfect choice for building a translation system that supports multiple languages, as it can handle translation between any two of the supported languages without needing intermediary translations.



## **What is Prompt Engineering?**

**Prompt Engineering** is the process of designing and optimizing input prompts to get better, more accurate, and context-aware outputs from AI models, especially large language models (LLMs) like GPT, mBART,M2M100 and BERT.


## **Why Use Prompt Engineering in Language Translation?**
üöÄ Since M2M100 is a powerful pre-trained multilingual model, it can still sometimes produce imperfect translations, particularly when handling complex or context-heavy sentences. Prompt engineering helps by:

‚úÖ **Improving translation accuracy** by structuring input prompts in a way that the model understands better.

‚úÖ **Enhancing fluency** by including example translations (few-shot prompting) to guide the model.

‚úÖ **Explicit instructions** that provide clear guidelines, ensuring better results during translation.



In [84]:
# Install necessary libraries
!pip install indic-nlp-library nltk transformers sentencepiece

Collecting indic-nlp-library
  Using cached indic_nlp_library-0.92-py3-none-any.whl.metadata (5.7 kB)
Collecting sphinx-argparse (from indic-nlp-library)
  Downloading sphinx_argparse-0.5.2-py3-none-any.whl.metadata (3.7 kB)
Collecting sphinx-rtd-theme (from indic-nlp-library)
  Downloading sphinx_rtd_theme-3.0.2-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting morfessor (from indic-nlp-library)
  Downloading Morfessor-2.0.6-py3-none-any.whl.metadata (628 bytes)
Collecting sphinxcontrib-jquery<5,>=4 (from sphinx-rtd-theme->indic-nlp-library)
  Downloading sphinxcontrib_jquery-4.1-py2.py3-none-any.whl.metadata (2.6 kB)
Downloading indic_nlp_library-0.92-py3-none-any.whl (40 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m40.3/40.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Downloading sphinx_argparse-0.5.2-py3-none-any.whl (12 k

# Import Required Libraries

In [94]:
import torch
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import nltk

* **PyTorch** - Used for tensor operations and running deep learning models efficiently.
* **M2M100Tokenizer** - Tokenizes input text and converts it into numerical representations that the model can process.
* **M2M100ForConditionalGeneration** -  A multilingual transformer-based model from Facebook‚Äôs M2M-100 family, designed for high-quality sequence-to-sequence tasks like translation.


In [77]:
# Download NLTK Punkt tokenizer models (if not already installed)
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Load the Pre-trained M2M100 Model and Tokenizer

In [95]:
model = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model)
model = M2M100ForConditionalGeneration.from_pretrained(model)


## Defining Few-Shot Examples for Translation

In [96]:

few_shot_examples = {
    ("en_XX", "mr_IN"): [
        "Hello, how are you? -> ‡§®‡§Æ‡§∏‡•ç‡§ï‡§æ‡§∞, ‡§Ü‡§™‡§£ ‡§ï‡§∏‡•á ‡§Ü‡§π‡§æ‡§§?",
        "Good morning! -> ‡§∂‡•Å‡§≠ ‡§™‡•ç‡§∞‡§≠‡§æ‡§§!",
        "What is your name? -> ‡§§‡•Å‡§Æ‡§ö‡•á ‡§®‡§æ‡§µ ‡§ï‡§æ‡§Ø ‡§Ü‡§π‡•á?",
        "My name is John -> ‡§Æ‡§æ‡§ù‡§Ç ‡§®‡§æ‡§µ ‡§ú‡•â‡§® ‡§Ü‡§π‡•á.",
        "My name is John. Do not translate the name. -> ‡§Æ‡§æ‡§ù‡§Ç ‡§®‡§æ‡§µ ‡§ú‡•â‡§® ‡§Ü‡§π‡•á. ‡§®‡§æ‡§µ‡§æ‡§ö‡•á ‡§≠‡§æ‡§∑‡§æ‡§Ç‡§§‡§∞ ‡§ï‡§∞‡•Ç ‡§®‡§ï‡§æ.",
        "My name is Seema. Do not translate the name. -> ‡§Æ‡§æ‡§ù‡§Ç ‡§®‡§æ‡§µ ‡§∏‡•Ä‡§Æ‡§æ ‡§Ü‡§π‡•á. ‡§®‡§æ‡§µ‡§æ‡§ö‡•á ‡§≠‡§æ‡§∑‡§æ‡§Ç‡§§‡§∞ ‡§ï‡§∞‡•Ç ‡§®‡§ï‡§æ."
    ],
    ("hi_IN", "en_XX"): [
        "‡§®‡§Æ‡§∏‡•ç‡§ï‡§æ‡§∞, ‡§Ü‡§™ ‡§ï‡•à‡§∏‡•á ‡§π‡•à‡§Ç? -> Hello, how are you?",
        "‡§Æ‡•Å‡§ù‡•á ‡§™‡§¢‡§º‡§æ‡§à ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§ -> I like studying.",
        "‡§Ü‡§™‡§ï‡§æ ‡§®‡§æ‡§Æ ‡§ï‡•ç‡§Ø‡§æ ‡§π‡•à? -> What is your name?",
        "‡§Æ‡•á‡§∞‡§æ ‡§®‡§æ‡§Æ ‡§∏‡•Ä‡§Æ‡§æ ‡§π‡•à‡•§ -> My name is Seema.",
    ],
    ("mr_IN", "en_XX"): [
        "‡§§‡•Å‡§Æ‡§ö‡•á ‡§®‡§æ‡§µ ‡§ï‡§æ‡§Ø ‡§Ü‡§π‡•á? -> What is your name?",
        "‡§Æ‡§æ‡§ù‡•á ‡§®‡§æ‡§µ ‡§∞‡•ã‡§π‡§® ‡§Ü‡§π‡•á‡•§ -> My name is Rohan.",
        "‡§Ü‡§ú ‡§π‡§µ‡§æ‡§Æ‡§æ‡§® ‡§õ‡§æ‡§® ‡§Ü‡§π‡•á‡•§ -> The weather is nice today.",
        "‡§Æ‡§æ‡§ù‡•á ‡§®‡§æ‡§µ ‡§∏‡•Ä‡§Æ‡§æ ‡§Ü‡§π‡•á‡•§ -> My name is Seema.",
    ]
}

# Function for Translation with Refined Prompt

In [109]:
# Function for translation with refined prompt
def translate(text, source_lang, target_lang):
    """Translate text while avoiding unwanted instructions in the output."""
    # Retrieve few-shot examples if available
    prompt_examples = few_shot_examples.get((source_lang, target_lang), [])

    # Refined prompt with examples
    prompt_text = "\n".join(prompt_examples) + f"\n{text} "

    # Set source language
    tokenizer.src_lang = source_lang

    # Tokenize the input text and convert it into tensors
    model_inputs = tokenizer(prompt_text, return_tensors="pt")

    # Generate translated text
    translated_tokens = model.generate(
        **model_inputs,
        forced_bos_token_id=tokenizer.lang_code_to_id[target_lang]  # Set target language
    )

    # Decode the translated output
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

    # Remove the extra parts of the prompt, only keep the translation
    translated_text = translated_text.split("\n")[-1].strip()

    return translated_text

# Function to Translate a Paragraph


In [104]:
# Function to translate a paragraph
def translate_paragraph(paragraph, source_lang, target_lang):
    """Translate the entire paragraph by first breaking it into sentences."""

    # Break the paragraph into sentences using NLTK's Punkt tokenizer
    sentences = nltk.sent_tokenize(paragraph)

    # Translate each sentence and collect the translations
    translated_sentences = []
    for sentence in sentences:
        translated_sentence = translate(sentence, source_lang, target_lang)
        translated_sentences.append(translated_sentence)

    # Join the translated sentences to form a paragraph
    translated_paragraph = ' '.join(translated_sentences)

    return translated_paragraph


# Example

In [110]:
# Example Hindi to English translation
text1 = "‡§Æ‡•Å‡§ù‡•á ‡§Ø‡§æ‡§§‡•ç‡§∞‡§æ ‡§ï‡§∞‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§"
translated_text3 = translate(text1, "hi", "en")
print(f"Hindi to English:\n{text1} -> {translated_text3}")


Hindi to English:
‡§Æ‡•Å‡§ù‡•á ‡§Ø‡§æ‡§§‡•ç‡§∞‡§æ ‡§ï‡§∞‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§ -> I like to travel.


In [111]:
# Paragraph translation (Marathi to English)
paragraph_mr = """‡§®‡§æ‡§µ ‡§∏‡•ã‡§™‡•á, ‡§≤‡§π‡§æ‡§® ‡§Ü‡§£‡§ø ‡§Ö‡§∞‡•ç‡§•‡§™‡•Ç‡§∞‡•ç‡§£ ‡§Ö‡§∏‡§æ‡§µ‡•á. ‡§â‡§ö‡•ç‡§ö‡§æ‡§∞‡§£‡§æ‡§∏‡§æ‡§†‡•Ä ‡§ï‡§†‡•Ä‡§£, ‡§≤‡§æ‡§Ç‡§¨ ‡§ï‡§ø‡§Ç‡§µ‡§æ ‡§ñ‡§ø‡§≤‡•ç‡§≤‡•Ä ‡§â‡§°‡§µ‡§≤‡•Ä ‡§ú‡§æ‡§£‡§æ‡§∞‡•Ä ‡§®‡§æ‡§µ‡•á ‡§ü‡§æ‡§≥‡§æ‡§µ‡•Ä‡§§. ‡§™‡•ç‡§∞‡§æ‡§ö‡•Ä‡§® ‡§ï‡§ø‡§Ç‡§µ‡§æ ‡§¶‡•Å‡§∞‡•ç‡§Æ‡§ø‡§≥ ‡§®‡§æ‡§µ‡§æ‡§Ç‡§ö‡•ç‡§Ø‡§æ ‡§ê‡§µ‡§ú‡•Ä, ‡§ï‡§æ‡§≥‡§æ‡§∏ ‡§Ö‡§®‡•Å‡§∏‡§∞‡•Ç‡§® ‡§Ö‡§∏‡§≤‡•á‡§≤‡•á ‡§Ü‡§£‡§ø ‡§ï‡•â‡§Æ‡§® ‡§®‡§∏‡§≤‡•á‡§≤‡•á ‡§®‡§æ‡§µ ‡§®‡§ø‡§µ‡§°‡§æ‡§µ‡•á. ‡§®‡§æ‡§µ‡§æ‡§ö‡§æ ‡§Ö‡§∞‡•ç‡§• ‡§∏‡§ï‡§æ‡§∞‡§æ‡§§‡•ç‡§Æ‡§ï ‡§Ü‡§£‡§ø ‡§™‡•ç‡§∞‡•á‡§∞‡§£‡§æ‡§¶‡§æ‡§Ø‡•Ä ‡§Ö‡§∏‡§æ‡§µ‡§æ ‡§ú‡•á‡§£‡•á‡§ï‡§∞‡•Ç‡§® ‡§¨‡§æ‡§≥‡§æ‡§ö‡•á ‡§µ‡•ç‡§Ø‡§ï‡•ç‡§§‡§ø‡§Æ‡§§‡•ç‡§§‡•ç‡§µ ‡§ö‡§æ‡§Ç‡§ó‡§≤‡•á ‡§µ‡§ø‡§ï‡§∏‡•Ä‡§§ ‡§π‡•ã‡§à‡§≤."""
translated_paragraph_mr = translate_paragraph(paragraph_mr, "mr", "en")
print(f"Translated Paragraph (Marathi to English):\n{translated_paragraph_mr}")


Translated Paragraph (Marathi to English):
The name is simple, small and meaningful. Difficult, long or flushed for expression. The name of the ancient or ancient names, followed by the black and uncommon names. The name means positive and motivating rather than that the baby's personality will develop well.


In [112]:
# Paragraph translation (English to Marathi)
paragraph_en = """The Sun is the only star in our solar system. It is the center of our solar system, and its gravity holds the solar system together. Everything in our solar system revolves around it ‚Äì the planets, asteroids, comets, and tiny bits of space debris."""
translated_paragraph_en = translate_paragraph(paragraph_en, "en", "mr")
print(f"Translated Paragraph (English to Marathi):\n{translated_paragraph_en}")


Translated Paragraph (English to Marathi):
‡§∏‡•Ç‡§∞‡•ç‡§Ø ‡§Ü‡§™‡§≤‡•ç‡§Ø‡§æ ‡§∏‡•å‡§∞ ‡§™‡•ç‡§∞‡§£‡§æ‡§≤‡•Ä‡§§ ‡§è‡§ï‡§Æ‡§æ‡§§‡•ç‡§∞ ‡§§‡§æ‡§∞‡§æ ‡§Ü‡§π‡•á. ‡§π‡•á ‡§Ü‡§™‡§≤‡•ç‡§Ø‡§æ ‡§∏‡•å‡§∞ ‡§™‡•ç‡§∞‡§£‡§æ‡§≤‡•Ä‡§ö‡•á ‡§ï‡•á‡§Ç‡§¶‡•ç‡§∞ ‡§Ü‡§π‡•á ‡§Ü‡§£‡§ø ‡§§‡•ç‡§Ø‡§æ‡§ö‡•á ‡§ó‡•Å‡§∞‡•Å‡§§‡•ç‡§µ‡§æ‡§ï‡§∞‡•ç‡§∑‡§£ ‡§∏‡•å‡§∞ ‡§™‡•ç‡§∞‡§£‡§æ‡§≤‡•Ä ‡§è‡§ï‡§§‡•ç‡§∞ ‡§†‡•á‡§µ‡§§‡•ã. ‡§Ü‡§™‡§≤‡•ç‡§Ø‡§æ ‡§∏‡•å‡§∞ ‡§™‡•ç‡§∞‡§£‡§æ‡§≤‡•Ä‡§§‡•Ä‡§≤ ‡§∏‡§∞‡•ç‡§µ ‡§ó‡•ã‡§∑‡•ç‡§ü‡•Ä ‡§§‡•ç‡§Ø‡§æ‡§µ‡§∞ ‡§ò‡§∞‡•Ä ‡§ú‡§æ‡§§‡§æ‡§§ ‚Äì ‡§ó‡•ç‡§∞‡§π, ‡§è‡§∏‡•ç‡§ü‡•á‡§∞‡•â‡§Ø‡§°, ‡§ï‡•â‡§Æ‡•á‡§ü, ‡§Ü‡§£‡§ø ‡§Ö‡§Ç‡§§‡§∞‡§ø‡§ï‡•ç‡§∑‡§ö‡•á ‡§õ‡•ã‡§ü‡•ç‡§Ø‡§æ ‡§¨‡§ø‡§ü.
