We want to build the model off of HuggingFace libraries instead of fairseq scripts. It is easier to debug and understand what you're doing when building the model.
We are starting from absolute scratch. We need to train a sentencepiece tokenizer on joint corpus. That's kinda done in the previous notebook.

However I never got a hf model to train with the tokenizer. Let's try that.

### Load Pretrained Model from hub

In [17]:
TOKENIZER_BATCH_SIZE = 256  # Batch-size to train the tokenizer on
TOKENIZER_VOCABULARY = 25000  # Total number of unique subwords the tokenizer can have

BLOCK_SIZE = 128  # Maximum number of tokens in an input sample
NSP_PROB = 0.50  # Probability that the next sentence is the actual next sentence in NSP
SHORT_SEQ_PROB = 0.1  # Probability of generating shorter sequences to minimize the mismatch between pretraining and fine-tuning.
MAX_LENGTH = 512  # Maximum number of tokens in an input sample after padding

MLM_PROB = 0.2  # Probability with which tokens are masked in MLM

TRAIN_BATCH_SIZE = 2  # Batch-size for pretraining the model on
MAX_EPOCHS = 1  # Maximum number of epochs to train the model for
LEARNING_RATE = 1e-4  # Learning rate for training the model

MODEL_CHECKPOINT = "mbart-large-50"  # Name of pretrained model from 🤗 Model Hub

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

ModuleNotFoundError: No module named 'transformers'

### Load Dataset

In [13]:
import pandas as pd
import os

def read_concat_jsonl(files,home_dir):
    dfs = [pd.read_json(os.path.join(home_dir,file), lines=True) for file in files]
    return pd.concat(dfs)

def concatenate_language_pairs(language_pairs_paths,home_dir):
    concatenated_dfs = {}
    for key, paths in language_pairs_paths.items():
        dfs = [read_concat_jsonl(paths,home_dir)]
        concatenated_dfs[key] = pd.concat(dfs)
    return concatenated_dfs

In [14]:
language_pairs_paths ={
    'en_cr': ['kreol-benchmark\experiments\data\en-cr\en-cr_dev.jsonl','kreol-benchmark\experiments\data\en-cr\en-cr_train.jsonl','kreol-benchmark\experiments\data\en-cr\en-cr_test.jsonl'],
    'cr':['kreol-benchmark\experiments\data\cr\cr_dev.jsonl','kreol-benchmark\experiments\data\cr\cr_train.jsonl','kreol-benchmark\experiments\data\cr\cr_test.jsonl']
}

In [15]:
data_all_dict = concatenate_language_pairs(language_pairs_paths,home_dir=r'C:\Users\yush\OneDrive\Desktop\papers')

In [16]:
data_all_dict

{'en_cr':                                                  input  \
 0    I did not come to do away with them, but to gi...   
 1    The fact is, at the time, you had to pay the t...   
 2    Angina can be described as a discomfort, heavi...   
 3             The boy said he would, but he didn't go.   
 4     Was it God in heaven or merely some human being?   
 ..                                                 ...   
 995  Any kingdom where people fight each other will...   
 996  And I am not good enough even to stoop down an...   
 997  Who among you, if your son asks for bread, you...   
 998  If that person listens, you have won back a fo...   
 999  Then he pointed to his disciples and said, the...   
 
                                                 target  
 0    Mo pa finn vini pou aboli me pou donn zot zot ...  
 1    Anverite sa lepok la pou al lekol ti ena enn f...  
 2    Nou capav dekrir anzinn couma enn sensasion in...  
 3    Garson-la reponn wi papa, li pou ale me li 