<a href="https://colab.research.google.com/github/bandiajay/Generative-AI/blob/main/06_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Machine Translation </center>

<b> Objective </b> : The goal of this worksheet is to empower participants with the skills to translate text from one language to another using advanced language translation tools. This exercise will guide learners through the process of utilizing translation APIs or libraries to understand and apply the principles of machine translation. By the end of the session, participants will be proficient in implementing language translation solutions, enhancing their ability to communicate and work across multiple languages effectively.

<b> Introduction </b> :  <p> Machine translation is the process of using computer algorithms to translate text from one language to another automatically. It leverages natural language processing (NLP) techniques to understand and generate text in both the source and target languages.  </p>
<p>
Language translation technology has revolutionized the way we communicate across cultural and linguistic barriers, making information accessible globally regardless of language differences. This technology utilizes sophisticated machine learning models and linguistic databases to convert text or spoken words from one language to another, preserving meaning and context.</p>

<img src="https://www.aiperspectives.com/wp-content/uploads/2020/11/iStock-811813290.jpg">

<b> Requirements: </b>
<ol>
<li> <i> Transformers </i> - A versatile library from Hugging Face providing state-of-the-art pre-trained models for natural language processing tasks </li>
<li> <i> sentencepiece </i> - SentencePiece is a library for unsupervised text tokenization and detokenization, primarily used for neural network-based text generation tasks, enabling the efficient handling of languages without clear word boundaries..
</ol>

<b> Steps: </b>
<ol>
    <li> Install transformers, sentencepiece packages.</li>
     <li> Write source code </li>
        <p>
            2.1 Get the <b> API </b> from Hugging Face <br>
            2.2 Load the Machine Translation pipeline <br>
            2.3 Give a sentence <br>
            2.4 Translate to another language <br>
            2.5 Repeat with different languages <br>
        </p>
       
</ol>

<h3>Step 1: Install transformers, sentencepiece packages </h3>

**Note:** if the below command fails, execute `python -m pip install transformers`

In [None]:
pip install transformers



**Note:** if the below command fails, execute `python -m pip install sentencepiece`

In [None]:
pip install sentencepiece



<h3>Step 2: Write source code </h3>

<h4>Step 2.1: Import necessary modules </h4>

**Note** : You need to create an account at [Hugging Face](https://huggingface.co/) and get an api key.

In [None]:
from transformers import pipeline
import os
os.environ["api_key"] = "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"

<h4> Step 2.2 : Load the pipeline with Translator </h4>

Pipeline is the function name.
  * First argument indicates the task that needs to be performed by pipeline. Here it is `translation`.
  * model - Deep Learning model. Here it is `Helsinki-NLP/opus-mt-en-fr`. OPUS-MT is a project (`opus` is the name of data corpus) developed by `NLP` team of `Helsiniki` group which mainly translates from English `en` to French `fr`.

**Note** : You can find more models at [Hugging Face](https://huggingface.co/)

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



<h4> Step 2.3 : Enter the input sentence </h4>

In [None]:
input =  "Hello, how are you?"

<h4> Step 2.4 : Translate to french </h4>

translator is the function name.
 * first argument is the input.
 * max_length - Limit on length of the translated text. Here, it should not exceed `40` words.

In [None]:
result = translator(input, max_length=40)

 Print the translated sentence

In [None]:
result

[{'translation_text': 'Bonjour, comment allez-vous ?'}]

<h4> Step 2.5 : Repeat with different languages</h4>

<h5> Step 2.5.1 : English to Spanish </h5>

In [None]:
#English to Spanish
translator_es = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
result_es = translator_es("Hello friend,how are you?", max_length=40)
print(result_es)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]



[{'translation_text': 'Hola amigo, ¿cómo estás?'}]


<h5>Step 2.5.2 : English to Chinese </h5>

In [None]:
#English to Chinese
translator_zh = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
result_zh = translator_zh("Hello friend,how are you?", max_length=40)
print(result_zh)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

[{'translation_text': '你好,朋友,你好吗?'}]
