# Fine-Tuning NLLB

The purpose of this notebook is document the process of fine-tuning an NLLB model for translating from Literary Tibetan to English. Some of the code is notebook is based on and adapted from the the tutorial ['How To Fine Tune a NLLB 200 Model for Translating A New Language'](https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865)

I've also drawn on ['Neural Machine Translation With Keras NLP'](https://keras.io/examples/nlp/neural_machine_translation_with_keras_nlp/) for the preprocessing steps.

In [10]:
from transformers.optimization import Adafactor
from transformers import get_constant_schedule_with_warmup
from transformers.optimization import Adafactor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tqdm.auto import trange
import numpy as np
import random
import pathlib

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Preprocessing Text Pairs

### Loading the Data

The code below loads in the text pairs as a list, [Tibetan, English].

In [None]:
text_file = pathlib.Path('/data/training-batches/training-batch-1.txt')

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    try:
        tib, eng = line.split(",")[:2]
        eng = eng.lower()
        text_pairs.append([tib, eng])
    except:
        pass

### Tokenize the Text Pairs

Below, I've used NLLB's pretrained tokenizers to tokenize the data.

In [None]:
max_length = 128  # token sequences will be truncated
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

In [None]:
tokenizer.src_lang = 'bo'
x = [tokenizer(pair[0], return_tensors='pt', padding=True, truncation=True, max_length=max_length) for pair in text_pairs]

In [None]:
tokenizer.src_lang = 'eng_Latn'
y = [tokenizer(pair[1], return_tensors='pt', padding=True, truncation=True, max_length=max_length) for pair in text_pairs]

In [None]:
def batch_data(x, y, batch_size):
    num_batches = len(x) //batch_size
    # Calculate the remainder for uneven division
    remainder = len(x) % batch_size
    # Initialize the starting index for slicing
    start = 0
    # Initialize the list to store sublists
    batches = []
    
    # Iterate through each sublist
    for i in range(num_batches):
        # Calculate the end index for slicing
        end = start + batch_size + (1 if i < remainder else 0)
        # Append the sublist to the result
        batches.append([x[start:end], y[start:end]])
        # Update the starting index for the next sublist
        start = end
    
    return batches

## Training the Model

### Pre-Trained Model
Here, I've downloaded the pre-trained NLLB model.

In [6]:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")



### Optimizer

Below, I've selected the Adafactor optimizer for training. The values passed to the optimizer are taken from the tutorial mentioned above and are arbitrary.

In [7]:
optimizer = Adafactor(
    [p for p in model.parameters() if p.requires_grad],
    scale_parameter=False,
    relative_step=False,
    lr=1e-4,
    clip_threshold=1.0,
    weight_decay=1e-3,
)
scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=1000)

In [None]:
batch_size = 16
training_steps = 10000  # Set a large number of steps,
# and then just interrupt the training manually
losses = []  # with this list, I do very simple tracking of average loss
MODEL_SAVE_PATH = 'nllb'

In [12]:
def train(x, y, batch_size, optimizer):

    loss = None

    losses = []

    tq = trange(len(losses), training_steps)

    #batch the data
    batches = batch_data(x, y, batch_size)

    for i in tq:
            
        for batch in batches:
            
            x = batch[0]
            y = batch[1]

            # -100 is a magic value ignored in the loss function
            # because we don't want the model to learn to predict padding ids
            y.input_ids[y.input_ids == tokenizer.pad_token_id] = -100

            loss = model(**x, labels=y.input_ids).loss
            loss.backward()
            losses.append(loss.item())

            optimizer.step()
            optimizer.zero_grad(set_to_none=True)
            scheduler.step()

        if i % 1000 == 0:
            # each 1000 steps, I report average loss at these steps
            print(i, np.mean(losses[-1000:]))

        if i % 1000 == 0 and i > 0:
            model.save_pretrained(MODEL_SAVE_PATH + str(i))

    return losses
