**Neural Machine Translation System Using Hugging Face**

**Natural Language Processing PBL**

**Team Members:**
1. **Vedang Divekar 1032211877**
2. **Samarth More 1032221224**

In [1]:
# Setup cell
!pip install transformers sentencepiece gradio nltk
import nltk
nltk.download('punkt')

Collecting gradio
  Downloading gradio-5.23.3-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
[0mCollecting safehttpx<0.2.0,>=0.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# Install required libraries
!pip install transformers
!pip install sentencepiece



In [3]:
import torch
from transformers import MarianMTModel, MarianTokenizer
from typing import List

class TranslationModel:
    def __init__(self):
        """
        Initialize the translation model with available language pairs
        """
        self.language_pairs = {
            'en-fr': 'Helsinki-NLP/opus-mt-en-fr',  # English to French
            'en-de': 'Helsinki-NLP/opus-mt-en-de',  # English to German
            'en-es': 'Helsinki-NLP/opus-mt-en-es',  # English to Spanish
            'fr-en': 'Helsinki-NLP/opus-mt-fr-en',  # French to English
            'de-en': 'Helsinki-NLP/opus-mt-de-en',  # German to English
            'es-en': 'Helsinki-NLP/opus-mt-es-en'   # Spanish to English
        }
        self.models = {}
        self.tokenizers = {}

    def load_model(self, source_lang: str, target_lang: str):
        """
        Load the translation model for a specific language pair
        """
        lang_pair = f"{source_lang}-{target_lang}"
        if lang_pair not in self.language_pairs:
            raise ValueError(f"Unsupported language pair: {lang_pair}")

        model_name = self.language_pairs[lang_pair]

        if lang_pair not in self.models:
            print(f"Loading model for {lang_pair}...")
            self.tokenizers[lang_pair] = MarianTokenizer.from_pretrained(model_name)
            self.models[lang_pair] = MarianMTModel.from_pretrained(model_name)

        return self.models[lang_pair], self.tokenizers[lang_pair]

    def translate(self, texts: List[str], source_lang: str, target_lang: str) -> List[str]:
        """
        Translate a list of texts from source language to target language
        """
        try:
            # Load the appropriate model and tokenizer
            model, tokenizer = self.load_model(source_lang, target_lang)

            # Tokenize the input texts
            encoded = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

            # Generate translations
            translated = model.generate(**encoded)

            # Decode the translations
            translated_texts = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

            return translated_texts

        except Exception as e:
            print(f"Translation error: {str(e)}")
            return []

def main():
    # Initialize the translation model
    translator = TranslationModel()

    # Example texts to translate
    texts = [
        "Hello, how are you?",
        "Machine learning is fascinating.",
        "I love programming in Python!"
    ]

    # Demonstrate translations to different languages
    source_lang = "en"
    target_languages = ["fr", "de", "es"]

    for target_lang in target_languages:
        print(f"\nTranslating English to {target_lang.upper()}:")
        translations = translator.translate(texts, source_lang, target_lang)

        for original, translated in zip(texts, translations):
            print(f"Original: {original}")
            print(f"Translated: {translated}\n")

# Add Gradio interface for web-based translation
import gradio as gr

def translate_text(text, source_lang, target_lang):
    translator = TranslationModel()
    translation = translator.translate([text], source_lang, target_lang)[0]
    return translation

# Create Gradio interface
def create_interface():
    language_options = ["en", "fr", "de", "es"]

    interface = gr.Interface(
        fn=translate_text,
        inputs=[
            gr.Textbox(label="Enter text to translate"),
            gr.Dropdown(choices=language_options, label="Source Language", value="en"),
            gr.Dropdown(choices=language_options, label="Target Language", value="fr")
        ],
        outputs=gr.Textbox(label="Translation"),
        title="Neural Machine Translation",
        description="Translate text between English, French, German, and Spanish"
    )

    return interface

# Add batch translation functionality
def batch_translate(file_path: str, source_lang: str, target_lang: str) -> List[str]:
    """
    Translate multiple texts from a file
    """
    translator = TranslationModel()

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            texts = [line.strip() for line in file if line.strip()]

        translations = translator.translate(texts, source_lang, target_lang)

        # Save translations to a new file
        output_path = f'translations_{source_lang}_to_{target_lang}.txt'
        with open(output_path, 'w', encoding='utf-8') as file:
            for original, translated in zip(texts, translations):
                file.write(f"Original: {original}\n")
                file.write(f"Translated: {translated}\n\n")

        return translations

    except Exception as e:
        print(f"Batch translation error: {str(e)}")
        return []

# Add evaluation metrics
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download('punkt')

def evaluate_translation(original_text: str, translated_text: str, reference_translation: str) -> dict:
    """
    Evaluate translation quality using BLEU score
    """
    try:
        # Tokenize the translations
        translated_tokens = nltk.word_tokenize(translated_text.lower())
        reference_tokens = nltk.word_tokenize(reference_translation.lower())

        # Calculate BLEU score
        bleu_score = sentence_bleu([reference_tokens], translated_tokens)

        return {
            "bleu_score": bleu_score,
            "translated_length": len(translated_tokens),
            "reference_length": len(reference_tokens)
        }
    except Exception as e:
        print(f"Evaluation error: {str(e)}")
        return {}

if __name__ == "__main__":
    # Run the main translation demo
    main()

    # Launch the Gradio interface
    interface = create_interface()
    interface.launch()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Translating English to FR:
Loading model for en-fr...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Original: Hello, how are you?
Translated: Bonjour, comment allez-vous ?

Original: Machine learning is fascinating.
Translated: L'apprentissage automatique est fascinant.

Original: I love programming in Python!
Translated: J'adore la programmation en Python !


Translating English to DE:
Loading model for en-de...


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Original: Hello, how are you?
Translated: Hallo, wie geht's?

Original: Machine learning is fascinating.
Translated: Maschinelles Lernen ist faszinierend.

Original: I love programming in Python!
Translated: Ich liebe Programmieren in Python!


Translating English to ES:
Loading model for en-es...


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Original: Hello, how are you?
Translated: Hola, ¿cómo estás?

Original: Machine learning is fascinating.
Translated: El aprendizaje automático es fascinante.

Original: I love programming in Python!
Translated: ¡Me encanta la programación en Python!

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3fccf805979d4960b5.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
