<a href="https://colab.research.google.com/github/davidbaines/translate_docx/blob/main/Text_Translation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This Notebook shows a full pipeline for Text language identification and Translation using Facebook models fasttext and No Language Left Behind (NLLB). 

First, we start with taking an input text in any language, then we will detect its language code using fasttext.

After that, we take the entered text, and predicted label and feed them to NLLB which translates text from our original language to whatever language NLLB supports. 

Source: https://colab.research.google.com/drive/1fsbzykS5ANEMVcn7gtp8Wl7gkmRzbwOW

Webpage: https://medium.com/mlearning-ai/text-translation-using-nllb-and-huggingface-tutorial-7e789e0f7816

# Language Identification

In [None]:
# download the language model pretrained file
!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

--2023-01-28 04:48:02--  https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1176355829 (1.1G) [application/octet-stream]
Saving to: ‘lid218e.bin’


2023-01-28 04:48:51 (23.7 MB/s) - ‘lid218e.bin’ saved [1176355829/1176355829]



In [None]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 KB[0m [31m63.0 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2
  Using cached pybind11-2.10.3-py3-none-any.whl (222 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp38-cp38-linux_x86_64.whl size=4403549 sha256=28812edc4be591a08ef27ce67437ce9baa6665e2fa11dc1839f51010d55b33e3
  Stored in directory: /root/.cache/pip/wheels/93/61/2a/c54711a91c418ba06ba195b1d78ff24fcaad8592f2a694ac94
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.10.3


In [None]:
import fasttext

pretrained_lang_model = "/content/lid218e.bin" # path of pretrained model file
model = fasttext.load_model(pretrained_lang_model)



Now lets enter a test text in the original language, here we will translate from Arabic to Spanish.

In [None]:
text = "صباح الخير، الجو جميل اليوم والسماء صافية."

In [None]:
predictions = model.predict(text, k=1) 
print(predictions)

(('__label__arb_Arab',), array([0.99960977]))


In [None]:
input_lang = predictions[0][0].replace('__label__', '')

# Text Translation

In [None]:
!pip install -U pip transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m108.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m110.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Installing co

In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
[0m

In [None]:
# Smallest 600M parameter model - distilled
checkpoint = 'facebook/nllb-200-distilled-600M'

# Medium 1.3B parameter model - distilled
# checkpoint = 'facebook/nllb-200-distilled-1.3B'

# Medium 1.3B parameter model
# 1.3B parameter model

# Large 3.3B parameter model
# checkpoint = 'facebook/nllb-200-3.3B'



In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading (…)ncepiece.bpe.model";:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading (…)"tokenizer.json";:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

In [None]:
target_lang = 'spa_Latn'
translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=target_lang, 
                                max_length = 400)
output = translation_pipeline(text)
print(output[0]['translation_text'])

Buenos días, el clima es hermoso y el cielo está limpio.


In [None]:
input_lang = 'eng_Latn'
output_lang = 'tpi_Latn'
input_texts = [r'You may introduce the book of Mark through story form in as concrete a way as possible. ',
               r'One way to do this is to tell the following story and have your translation team act it out. ',
               r'Ahead of time, choose the following characters: Mark, Peter, Jesus, Paul, and Barnabas. ',
               r'The rest of the team can play the parts of the followers and believers. ',
               r'Choose parts of the room to represent different parts of the world: Jerusalem, Antioch, Cyprus, and Rome. ',
               r'These four places could be the four corners of the room. ',
               r'As you tell the story, the characters in that part of the story can walk to the part of the room that is representing the place in the story.',
               r'Have the characters act out the journeys and the actions as you tell the story.',
               r'You may have the team retell the story after you tell it, again acting it out.',
               r'Help them if they forget parts of the story.',]      

In [None]:

outputs = []
max_length = max(len(input_text) for input_text in input_texts)
print(f"Max length is {max_length}")

for input_text in input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    outputs.append(translation_pipeline(input_text))

for output in outputs:
    print(output[0]['translation_text'])



Max length is 141
Yu ken kamapim buk Mak long rot bilong stori long rot i stret tru. 
Wanpela rot bilong mekim olsem em long storiim stori i kamap bihain na tokim ol lain bilong tanim tok long mekim olsem. 
Taim yu laik kisim sampela hap tok, makim ol dispela man: Mak, Pita, Jisas, Pol, na Barnabas. 
Ol narapela insait long lain inap mekim wok bilong ol disaipel na ol bilipman. 
Pinis long makim ol hap bilong rum bilong makim ol narapela hap bilong graun: Jerusalem, Antiok, Saiprus, na Rom. 
Dispela 4-pela hap inap makim 4-pela kona bilong rum.
Taim yu stori, ol man i stap long dispela hap stori i ken wokabaut i go long hap bilong rum em ples i stap long stori.
Taim yu stori, yu mas tokim ol man long ol samting yu mekim long rot bilong raun na wokabaut.
Ating bai yu tokim lain long stori gen taim yu stori pinis, na bihain bai yu mekim olsem.
Sapos ol i lusim tingting long sampela hap stori, orait helpim ol.


In [None]:
tpi_output_texts = []
for output in outputs:
    tpi_output_texts.append(output[0]['translation_text'])

In [None]:
input_lang = 'tpi_Latn'
output_lang = 'eng_Latn'

tpi_input_texts = tpi_output_texts
eng_outputs = []
eng_output_texts = []

for input_text in tpi_input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    eng_outputs.append(translation_pipeline(input_text))

eng_output_texts = [eng_output[0]['translation_text'] for eng_output in eng_outputs]

In [None]:
for eng_output_text in eng_output_texts:
    print(eng_output_text)


You can make Mark's book sound by telling the truth.
And the way of the translator is to make a report, and the way of the translator is to make a report.
Now if you want to take part in the discussion, choose these men: Mark, Peter, Jesus, Paul and Barnabas.
The rest of the congregation can share in the ministry of the disciples and the faithful.
And the city of Jerusalem, and the city of Antioch, and the city of Cyprus, and the city of Rome, were chosen.
And the four corners of the house were four corners.
When you tell the story, the people in the story can walk to the room where the story is told.
When thou speakest, thou shalt speak thy words, and thy ways.
And thou shalt speak, and thou shalt speak, and thou shalt speak.
If they forget a thing, help them.
