<a href="https://colab.research.google.com/github/andrePankraz/speech_service/blob/main/notebooks/NLLB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Translation - No Language Left Behind (NLLB)
The following Notebook can translate text between 200 languages. It's based on the Meta model [NLLB](https://ai.facebook.com/research/no-language-left-behind/).

# Set-up environment
We need following packages:

*   [transformers](https://github.com/huggingface/transformers) for NLLB model
*   [sentence_cleaner_splitter](https://github.com/facebookresearch/LASER/tree/main/utils) from project [LASER](https://github.com/facebookresearch/LASER) for sentence splitting

In [1]:
!pip install -U transformers sentence_cleaner_splitter@git+https://github.com/facebookresearch/LASER.git#subdirectory=utils

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_cleaner_splitter@ git+https://github.com/facebookresearch/LASER.git#subdirectory=utils
  Cloning https://github.com/facebookresearch/LASER.git to /tmp/pip-install-g7jtgnd6/sentence-cleaner-splitter_09d3c101a8494764a5a2f745ab51a9a7
  Running command git clone -q https://github.com/facebookresearch/LASER.git /tmp/pip-install-g7jtgnd6/sentence-cleaner-splitter_09d3c101a8494764a5a2f745ab51a9a7
Collecting indic-nlp-library==0.81
  Downloading indic_nlp_library-0.81-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 5.1 MB/s 
[?25hCollecting sentence-splitter==1.4
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 2.9 MB/s 
[?25hCollecting botok==0.8.8
  Downloading botok-0.8.8-py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 9.9 MB/s 
[?25hCollecting khmer-nltk=

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model NLLB
# facebook/nllb-200-distilled-600M, facebook/nllb-200-distilled-1.3B, facebook/nllb-200-3.3B
# VRAM at least: 4 | 8 | 16 GB VRAM
model_name = 'facebook/nllb-200-distilled-600M'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

In [5]:
from sentence_cleaner_splitter.sentence_split import split_lang_code_map
split_lang_code_map

{'ace_Arab': 'ace_Arab',
 'ace_Latn': 'ace_Latn',
 'acm_Arab': 'acm',
 'acq_Arab': 'acq',
 'aeb_Arab': 'aeb',
 'afr_Latn': 'afr',
 'ajp_Arab': 'ajp',
 'aka_Latn': 'aka',
 'amh_Ethi': 'amh',
 'apc_Arab': 'apc',
 'arb_Arab': 'ara_Arab',
 'arb_Latn': 'ara_Latn',
 'ars_Arab': 'ars',
 'ary_Arab': 'ary',
 'arz_Arab': 'arz',
 'asm_Beng': 'asm',
 'ast_Latn': 'ast',
 'awa_Deva': 'awa',
 'ayr_Latn': 'ayr',
 'azb_Arab': 'azb',
 'azj_Latn': 'azj',
 'bak_Cyrl': 'bak',
 'bam_Latn': 'bam',
 'ban_Latn': 'ban',
 'bel_Cyrl': 'bel',
 'bem_Latn': 'bem',
 'ben_Beng': 'ben',
 'bho_Deva': 'bho',
 'bjn_Arab': 'bjn_Arab',
 'bjn_Latn': 'bjn_Latn',
 'bod_Tibt': 'bod',
 'bos_Latn': 'bos',
 'bug_Latn': 'bug',
 'bul_Cyrl': 'bul',
 'cat_Latn': 'cat',
 'ceb_Latn': 'ceb',
 'ces_Latn': 'ces',
 'cjk_Latn': 'cjk',
 'ckb_Arab': 'ckb',
 'crh_Latn': 'crh_Latn',
 'cym_Latn': 'cym',
 'dan_Latn': 'dan',
 'deu_Latn': 'deu',
 'dik_Latn': 'dik',
 'diq_Latn': 'diq',
 'dyu_Latn': 'dyu',
 'dzo_Tibt': 'dzo',
 'ell_Grek': 'ell',
 'eng

In [6]:
translation_pipeline = pipeline('translation',
                                model=model,
                                tokenizer=tokenizer,
                                src_lang='deu_Latn',
                                tgt_lang='rus_Cyrl',
                                device=device)
translation_pipeline('Das ist ein Test!')

[{'translation_text': 'Это тест!'}]