# Multi-lingual many-to-many Translation

[mBART](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) is a machine translation model.

It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper. The model can translate directly between any pair of 50 languages.


This example does not use huggingface `pipeline`.

In [None]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
# Provide the id of the free GPU that you are intented to use
# "auto" option will not work here, so specify the GPU id
# Setting `gpu_id = None` will not use any GPU
gpu_id = 0

model = "facebook/mbart-large-50-many-to-many-mmt"

In [None]:
# Get the pre-trained model and load it to the GPU
translator = AutoModelForSeq2SeqLM.from_pretrained(model).to(gpu_id)

# To get a list of all available language codes
available_language_codes = list(
    AutoTokenizer.from_pretrained(model).lang_code_to_id.keys()
)

In [None]:
def translate(text: str, src_lang: str, trgt_lang: str) -> list[str]:
    """Translates given text from source to target language.

    Args:
        text (str): Input text to be translated
        src_lang (str): Source language code. Eg: 'zh_CN'.
        trgt_lang (str): Target language code. Eg 'en_XX'.
                        Check `available_language_codes` for complete list
    Returns:
        list[str]: A list containing the translated text in target language.

    """
    # Check if the language codes are valid
    assert (
        src_lang in available_language_codes and trgt_lang in available_language_codes
    ), f"'src_lang' and 'trgt_lang' must be one of {available_language_codes}"

    # Define a tokeniser for tokenising the source language
    tokenizer = AutoTokenizer.from_pretrained(model, src_lang=src_lang)

    # Tokenise
    encoded_text = tokenizer(text, return_tensors="pt").to(gpu_id)

    # Generate translated tokens
    generated_tokens = translator.generate(
        **encoded_text, forced_bos_token_id=tokenizer.lang_code_to_id[trgt_lang]
    )
    # Decode tokens to words and sentences
    translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

    return translated

Note: 

Since we are using a Many-to-many multilingual machine translation (50 languages), we need to force the target language id as the first generated token to translate to the target language. 

For this, we need to set the BOS (Beginning of sentence token) token. Set the `forced_bos_token_id` to the `trgt_lang` in the generate method to translate.:

In [None]:
# input text in multiple languages
en_text = (
    "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
)
chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."
finnish_text = (
    "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia."
)
hindi_text = "तंत्रिकाओं के कार्यों को हाथ में न डालें, क्योंकि वे सूक्ष्म होते हैं, वे जल्दी उग्र होते हैं."

In [None]:
# Chinese to English
translate(chinese_text, src_lang="zh_CN", trgt_lang="en_XX")

In [None]:
# Finnish to English
translate(finnish_text, src_lang="fi_FI", trgt_lang="en_XX")

In [None]:
# Chinese to Hindi
translate(chinese_text, src_lang="zh_CN", trgt_lang="hi_IN")

In [None]:
# Hindi to English
translate(hindi_text, src_lang="hi_IN", trgt_lang="en_XX")