# NLLB Initial Experiments

The purpose of this notebook is to document my experiments with Meta's 'NLLB', or 'No Language Left Behind' model, as a possible replacement for the v1 translation model.

## First Look

Below, I've implemented an adjusted version of the tutorial code provided on [NLLB's HuggingFace page](https://huggingface.co/docs/transformers/model_doc/nllb).

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [2]:
tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="bo"
)

In [3]:
model = AutoModelForSeq2SeqLM.from_pretrained("/home/j/Documents/Projects/MLotsawa/notebooks/nllb/nllb-checkpoint-4")

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


In [4]:
def translate(input_text):
    
    inputs = tokenizer(input_text, return_tensors="pt")

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"], max_length=300
    )
    
    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Here I've define variables for two sample texts. These are both the same line from the Longchen Nyingtik Ngondro liturgy entitled "The Excellent Path to Omniscience
The Preliminary Practice of the Heart-Essence of the Vast Expanse (Longchen Nyingtik) from the Great Perfection" arranged by Dodrupchen Jikme Trinle Özer. This text is available [here](https://www.lotsawahouse.org/tibetan-masters/dodrupchen-I/longchen-nyingtik-ngondro). The first is in the original Tibetan script. The second is the transliteration of the line. The source text does not specify which transliteration system was used but it appears to be THL Simplified Phonetic Transcription as described [here](https://en.wikipedia.org/wiki/THL_Simplified_Phonetic_Transcription).

The text is:

**འོད་འཕྲོས་རྒྱལ་བ་སྲས་བཅས་མཆོད་པས་མཉེས། །**

**ö trö gyalwa sé ché chöpé nyé**


Which the source text translates as:

**Light streams out, making offerings to the buddhas and bodhisattvas, and pleasing them.**

In [25]:
tibetan_script_sample = "འོད་འཕྲོས་རྒྱལ་བ་སྲས་བཅས་མཆོད་པས་མཉེས། །"
transliterated_sample = "ö trö gyalwa sé ché chöpé nyé"

We can see below that the translation from the Tibetan script seems to be conceptually near to the given translation, if not quite what we're looking for.

In [36]:
translate(tibetan_script_sample)

'The Son of Man, the Son of the Light, is pleased with sacrifice.'

Meanwhile, the transliterated sample appears essentially unrelated to the given translation.

In [37]:
translate(transliterated_sample)

"I'm very happy with the way it turned out."

It's possible that the translations improve which greater context. To test this I've provided below larger samples for translation. These samples come from the same portion of the same text but I've used the entire section, entitled "The Blessing of the Speech".

In the Tibetan script:

***ཨོཾ་ཨཱཿཧཱུྂ། ལྕེ་དབང་རྂ་ཡིག་ལས་བྱུང་མེས་བསྲེགས་ནས། །***

***འོད་དམར་རྣམ་པའི་རྡོ་རྗེ་རྩེ་གསུམ་སྦུབས། །***

***ཨཱ་ལི་ཀཱ་ལིའི་མཐའ་སྐོར་རྟེན་འབྲེལ་སྙིང་། །***

***མུ་ཏིག་ཕྲེང་བ་ལྟ་བུའི་ཡིག་འབྲུ་ལས། །***

***འོད་འཕྲོས་རྒྱལ་བ་སྲས་བཅས་མཆོད་པས་མཉེས། །***

***སླར་འདུས་ངག་སྒྲིབ་དག་ནས་གསུང་རྡོ་རྗེའི། །***

***བྱིན་རླབས་དངོས་གྲུབ་ཐམས་ཅད་ཐོབ་པར་འགྱུར། །***

Transliterated:

***om ah hung ché wang ram yik lé jung mé sek né***

***ö mar nampé dorjé tsesum bub***

***ali kali takor tendrel nying***

***mutik trengwa tabü yikdru lé***

***ö trö gyalwa sé ché chöpé nyé***

***lar dü ngak drib dak né sung dorjé***

***jinlab ngödrub tamché tobpar gyur***

Translated:

***Oṃ āḥ hūṃ! From the syllable raṃ (in my speech centre) arises fire, consuming my tongue,***

***Which is transformed into a three-spoked vajra of red light.***

***In its centre are the vowels and consonants, and around them the mantra of ‘The Essence of Interdependent Origination’***

***Their syllables are like strings of pearls. From them,***

***Light streams out, making offerings to the buddhas and bodhisattvas, and pleasing them.***

***As it converges back, all the obscurations of my speech are purified, and***

***I obtain all the blessings and siddhis of vajra speech.***

In [6]:
tibetan_script_larger = "ཨོཾ་ཨཱཿཧཱུྂ། ལྕེ་དབང་རྂ་ཡིག་ལས་བྱུང་མེས་བསྲེགས་ནས། ། འོད་དམར་རྣམ་པའི་རྡོ་རྗེ་རྩེ་གསུམ་སྦུབས། ། ཨཱ་ལི་ཀཱ་ལིའི་མཐའ་སྐོར་རྟེན་འབྲེལ་སྙིང་། ། མུ་ཏིག་ཕྲེང་བ་ལྟ་བུའི་ཡིག་འབྲུ་ལས། ། འོད་འཕྲོས་རྒྱལ་བ་སྲས་བཅས་མཆོད་པས་མཉེས། ། སླར་འདུས་ངག་སྒྲིབ་དག་ནས་གསུང་རྡོ་རྗེའི། ། བྱིན་རླབས་དངོས་གྲུབ་ཐམས་ཅད་ཐོབ་པར་འགྱུར། །"
transliterated_larger = "om ah hung ché wang ram yik lé jung mé sek né ö mar nampé dorjé tsesum bub ali kali takor tendrel nying mutik trengwa tabü yikdru lé ö trö gyalwa sé ché chöpé nyé lar dü ngak drib dak né sung dorjé jinlab ngödrub tamché tobpar gyur"

We can see below that neither translation appears to have improved. However, there is a point of interest in the translation from the Tibetan script. Note the reference to the Book of Mormon, which may give us a clue to the training set for this model. It is likely that Meta had access to a variety of training materials that are commonly translated into a huge number of languages, among which is undoubtedly a sizeable portion of materials from the LDS Church, which evangelized globally. This presents a particular concern, because we want to avoid translations that bring with themselves a great deal of conceptual, particularly theological, baggage.

In [38]:
translate(tibetan_script_larger)

"The letter O'Bohemian was burned with fire, burned with red-colored stones, burned with the heart of the Alkali siege, and the letter like the Book of Mormon was read with joy, and the Son of Light and Victory was delighted with the sacrifice."

In [39]:
translate(transliterated_larger)

'The name of the child is called the "trengwa tabü yikdru" by the name of the child is called the "chewable" by the name of the child, the "drub" by the name of the child.'

In [9]:
translate(transliterated_larger)

'o h mahkla who has the power to free herself from the ills of the five poisons of the profound and vast dharma'

Below, I've translated the passage in Tibetan script line by line. Note that the translation changes significantly. The first line is notably messy and the Book of Mormon is no longer mentioned, but there is still a lot of room for improvement.

In [7]:
line_by_line = transliterated_larger.split("")

line_by_line_translation = []

for line in line_by_line:
    line_by_line_translation.append(translate(line))

In [56]:
line_by_line_translation

['Oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, what the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word of the word is.',
 'Three corners of a bright red-colored stone were lit.',
 'The siege of Alicante was celebrated in the heart of the city.',
 'Letters like the letter "m" are:',
 'The light of the world is pleased with sacrifices, even the son of the conqueror.',
 'Again, the w

In [11]:
line_by_line = [
    'om ah hung', 
    'ché wang ram yik lé jung mé sek né',
    'ö mar nampé dorjé tsesum bub',
    'ali kali takor tendrel nying', 
    'mutik trengwa tabü yikdru lé',
    'ö trö gyalwa sé ché chöpé nyé',
    'lar dü ngak drib dak né sung dorjé',
    'jinlab ngödrub tamché tobpar gyur'
]

In [12]:
line_by_line_translation = []

for line in line_by_line:
    line_by_line_translation.append(translate(line))

In [13]:
line_by_line_translation

['with ',
 'although the king and ministers had already been seated on the lotus',
 'the sun is transformed into a dazzling vajra and his consort dissolves into',
 'in its centre are the vowels and consonants and around them the mantra of the essence',
 'their syllables are like strings of pearls on which are',
 'hungry hungry victorious ones sweetheart',
 'once again i arise in the form of the vajra speech the obscuring nails and',
 'i obtain all the blessings of accomplishment and the twofold accumulation']