<center><h2>ALTeGraD 2023<br>Lab Session 4: NLP Frameworks</h2> 07 / 11 / 2023<br> Dr. G. Shang, H. Abdine<br><br>

PART 1.

<b>Student name:</b> Balthazar Neveu

</center>
<font color='gray'>

In this lab you will learn how to use Fairseq and HuggingFace transformers - The most used libraries by researchers and developers  and finetune language models - to finetune a pretrained French language model ($RoBERTa_{small}^{fr}$) on the sentiment analysis dataset CLS_Books where each review is labeled as positive or negative and finetune a variant of BLOOM on a question/answer dataset.
</font>


In [2]:
from pathlib import Path
def abspath(pth): return Path(pth).resolve()
import torch
from prettytable import PrettyTable
from typing import List
import sentencepiece as spm
MODEL_PATH= "../models/RoBERTa_small_fr/model.pt"
assert Path(MODEL_PATH).exists()

<font color="gray">

# <b>Part 1: Fairseq</b>

In the first part of this lab, you will finetune the given model on model on CLS_Books dataset using <b>Fairseq</b> by following these steps:<br>

 1- <b>Tokenize the reviews</b> (Train, Valid and Test) using trained sentencepiece tokenizer provided alongside the pretrained model.[using sentencepiece library and setting the parameter <b>out_type=str</b> in the encode function].<br>
 2- <b>Binarize the tokenized reviews and their labels</b> using the preprocess python script provided in Fairseq.<br>
 3- <b>Fintune the pretrained $RoBERTa_{small}^{fr}$ model</b> using the train python script provided in Fairseq.<br>

 Finally, you will finish the first part by training a random $RoBERTa_{small}^{fr}$ model on the CLS_Books dataset and compare the results against the pretrained model while <b>visualizing the accuracies on tensorboard</b>.

## <b> Number of parameters of the model</b>

In this section you have to compute the number of parameters of $RoBERTa_{small}^{fr}$ using PyTorch (Only the base model with out the head). (<b>Hint:</b> you can check the architecture of the model using model['model'])

</font>



<font color="green">


## Question 1 & Task 1: Number of trainable parameter in RobertaSmall 

- Instead of checking the parameters dictionary, I can load the model in RAM and inspect each module.
- [Roberta Small definition](https://github.com/hadi-abdine/fairseq/blob/main/fairseq/models/roberta/model.py#L693-L700)


```python
@register_model_architecture("roberta", "roberta_small")
def roberta_small_architecture(args):
    args.encoder_layers = safe_getattr(args, "encoder_layers", 4)
    args.encoder_embed_dim = safe_getattr(args, "encoder_embed_dim", 512)
    args.encoder_ffn_embed_dim = safe_getattr(args, "encoder_ffn_embed_dim", 512)
    args.encoder_attention_heads = safe_getattr(args, "encoder_attention_heads", 8)
    args.max_source_positions = safe_getattr(args, "max_positions", 256)
    base_architecture(args)
```

- I used `sympy` to get the symbolic results and to check that the numerical results match.
- Here are the notations
```python
    L: int(nlayers), # Number of layers (4)
    V: int(ntokens), #Number of tokens (32k)
    D: int(nhid), # Embedding = feed forward dimensions (512) 
    A: int(nhead), # Number of attention heads (8)
    C: int(ntokens),
    T: int(max_positions) # Max length number of tokens (256) - defines positional embedding dimension
```

</font>

In [4]:
from fairseq.models.roberta import RobertaModel
model = RobertaModel.from_pretrained(Path(MODEL_PATH).parent, checkpoint_file="model.pt")

2023-11-13 10:30:41 | INFO | fairseq.file_utils | loading archive file ../models/RoBERTa_small_fr from cache at ../models/RoBERTa_small_fr
2023-11-13 10:30:41 | INFO | fairseq.tasks.masked_lm | dictionary: 31999 types
2023-11-13 10:30:42 | INFO | fairseq.models.roberta.model | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 50, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': 'pretraining_tensorboard', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq':

In [6]:
def count_parameters(current_model: torch.nn.Module, table_flag: bool =True, details: List[List[str]]=[]) -> int:
    """Created Count parameters for each torch module
    """
    table = PrettyTable(["Modules", "Parameters", "Analytic validation", "Analytic formula"])
    total_params = 0
    for index, (name, parameter) in enumerate(current_model.named_parameters()):
        if not parameter.requires_grad:
            continue
        params = parameter.numel()
        detail = ["", ""] if index>=len(details) else details[index]
        name = name.replace("model.encoder.", "")
        if "lm_head" in name:
            continue
        table.add_row([name, params, detail[0], detail[1]])
        total_params += params
    if table_flag:
        print(table)
    print(f"Total Trainable Params: {total_params}")
    return total_params

In [15]:
from sympy import symbols, Eq, sympify
import sympy
from IPython.display import display, Markdown

V, D, A, L, T = symbols("V, D, A, C, L, T")
embedding_size = V*D

ntokens = 32000
encoder_layers = nlayers = 4
encoder_embed_dim = 512
encoder_ffn_embed_dim = 512
nhid = encoder_embed_dim
encoder_attention_heads = nhead = 8
max_positions =256
true_values = {
    L: int(nlayers), # Number of layers (4)
    V: int(ntokens), #Number of tokens (32k)
    D: int(nhid), # Embedding = feed forward dimensions (512) 
    A: int(nhead), # Number of attention heads (8)
    T : int(max_positions) # Max length number of tokens (256) - defines positional embedding dimension
}

list_computations = []
total_express = ""
def get_analytic_expression(name, expr_:str, record=True):
    global total_express
    expr = sympify(expr_)
    res = expr.evalf(subs=true_values)
    equation = Eq(expr, int(res))
    latex_eq = sympy.latex(equation)
    eq_original  = sympy.latex(sympify(expr_, evaluate=False))
    display(Markdown(f"{name}: ${eq_original}$ =  $ {latex_eq} $"))
    if record:
        total_express = total_express + "+" + expr_
        list_computations.append([int(res) , str(expr).replace("**2", "²")]) #[f"{name}: ${eq_original}$ =  $ {latex_eq} $", ])
    return res
get_analytic_expression("Embeddings linear projection weights", "V*D")
get_analytic_expression("Trainable position embedding", "D*(T+2)")
get_analytic_expression(f"Layer norm Embedding coefficient", "D")
get_analytic_expression(f"Layer norm Embedding bias", "D")

for layer_idx in range(1, nlayers+1):
    print(f"LAYER {layer_idx}")
    for proj in ["inputs Key", "inputs Query", "inputs Value"]:
        get_analytic_expression(f"MHA {proj} Weights", "A*(D*(D/A))")
        get_analytic_expression(f"MHA {proj} Bias ", "A*(D/A)")
    get_analytic_expression("MHA output projection Weights", "D*D")
    get_analytic_expression("MHA output projection Bias ", "D")

    get_analytic_expression(f"Layer norm weights", "D")
    get_analytic_expression(f"Layer norm bias", "D")

    for fc_index in [1, 2]:
        get_analytic_expression(f"Fully connected {fc_index} Weights", "D*D")
        get_analytic_expression(f"Fully connected {fc_index} Bias ", "D")


    get_analytic_expression(f"Layer norm weights", "D")
    get_analytic_expression(f"Layer norm bias", "D")

total_express = total_express[1:]
total_trainable_params = count_parameters(model, details=list_computations)


Embeddings linear projection weights: $D V$ =  $ D V = 16384000 $

Trainable position embedding: $D \left(T + 2\right)$ =  $ D \left(T + 2\right) = 132096 $

Layer norm Embedding coefficient: $D$ =  $ D = 512 $

Layer norm Embedding bias: $D$ =  $ D = 512 $

LAYER 1


MHA inputs Key Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Key Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Query Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Query Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Value Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Value Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA output projection Weights: $D D$ =  $ D^{2} = 262144 $

MHA output projection Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

Fully connected 1 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 1 Bias : $D$ =  $ D = 512 $

Fully connected 2 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 2 Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

LAYER 2


MHA inputs Key Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Key Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Query Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Query Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Value Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Value Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA output projection Weights: $D D$ =  $ D^{2} = 262144 $

MHA output projection Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

Fully connected 1 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 1 Bias : $D$ =  $ D = 512 $

Fully connected 2 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 2 Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

LAYER 3


MHA inputs Key Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Key Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Query Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Query Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Value Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Value Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA output projection Weights: $D D$ =  $ D^{2} = 262144 $

MHA output projection Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

Fully connected 1 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 1 Bias : $D$ =  $ D = 512 $

Fully connected 2 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 2 Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

LAYER 4


MHA inputs Key Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Key Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Query Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Query Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA inputs Value Weights: $\frac{A D D}{A}$ =  $ D^{2} = 262144 $

MHA inputs Value Bias : $\frac{A D}{A}$ =  $ D = 512 $

MHA output projection Weights: $D D$ =  $ D^{2} = 262144 $

MHA output projection Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

Fully connected 1 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 1 Bias : $D$ =  $ D = 512 $

Fully connected 2 Weights: $D D$ =  $ D^{2} = 262144 $

Fully connected 2 Bias : $D$ =  $ D = 512 $

Layer norm weights: $D$ =  $ D = 512 $

Layer norm bias: $D$ =  $ D = 512 $

+-------------------------------------------------------+------------+---------------------+------------------+
|                        Modules                        | Parameters | Analytic validation | Analytic formula |
+-------------------------------------------------------+------------+---------------------+------------------+
|          sentence_encoder.embed_tokens.weight         |  16384000  |       16384000      |       D*V        |
|        sentence_encoder.embed_positions.weight        |   132096   |        132096       |    D*(T + 2)     |
|      sentence_encoder.layernorm_embedding.weight      |    512     |         512         |        D         |
|       sentence_encoder.layernorm_embedding.bias       |    512     |         512         |        D         |
|   sentence_encoder.layers.0.self_attn.k_proj.weight   |   262144   |        262144       |        D²        |
|    sentence_encoder.layers.0.self_attn.k_proj.bias    |    512     |         512         |        D   

In [18]:
get_analytic_expression("TOTAL AUTO COMPUTATION", total_express, record=False)
total_expression_factorized = "V*D + D*(T+2) + (D + D) + L*( 3*A*D*(D/A) + 3*A*(D/A) + (D*D +D) + (D + D) + 2*(D*D + D) + (D + D))"
manual_total = get_analytic_expression("TOTAL MANUAL COMPUTATION", total_expression_factorized, record=False)
assert total_trainable_params == manual_total, "Wrong trainable model parameters estimation"

TOTAL AUTO COMPUTATION: $D V + D \left(T + 2\right) + D + D + D D + D + D + D + D D + D + D D + D + D + D + D D + D + D + D + D D + D + D D + D + D + D + D D + D + D + D + D D + D + D D + D + D + D + D D + D + D + D + D D + D + D D + D + D + D + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A} + \frac{A D D}{A} + \frac{A D}{A}$ =  $ 24 D^{2} + D V + D \left(T + 2\right) + 42 D = 22829056 $

TOTAL MANUAL COMPUTATION: $D V + D \left(T + 2\right) + D + D + L \left(D D + D + D + D + D + D + 2 \left(D D + D\right) + \frac{3 A D D}{A} + \frac{3 A D}{A}\right)$ =  $ D V + D \left(T + 2\right) + 2 D + L \left(6 D^{2} + 10 D\right) = 22829056 $

<font color='gray'>

## Task 2: Preprocessing the dataset
Prepare the data for finetuning. To do so you have to:
1. Use the provided sentencepiece model to tokenize the text.
2. Binarize the data using `fairseq-preprocess` command

### Tokenizing the reviews

In this section we will tokenize the finetuning dataset using sentenpiece tokenizer. We have three splits in our datase: train valid and test sets.

In this task you have to use the trained sentencepiece tokenizer (RoBERTa_small_fr/sentencepiece.bpe.model) to tokenize 
- the input three files <b>train.review</b>, <b>valid.review</b> and <b>test.review</b> 
- and output the three files <b>train.spm.review</b>, <b>valid.spm.review</b> and <b>test.spm.review</b> containing the tokenized reviews.

Documentation: https://github.com/google/sentencepiece#readme
</font>

# Task 2.1 Tokenization
When we split the sentences into tokens here, we use a provided byte pair encoding model.

When training from scratch, we should start from scratch and tokenize a corpus from scratch...

but if we want to **re-use a pretrained model**, we need to follow its vocabulary definition.

In [None]:
SPLITS=['train', 'test', 'valid']
TOKENIZER_MODEL_PATH = '../models/RoBERTa_small_fr/sentencepiece.bpe.model'
DATA_BOOKS = Path('../data/cls.books/')

In [None]:
# TASK 2.1
s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL_PATH)

SENTS="review"

for split in SPLITS:
    with open(DATA_BOOKS/(split+'.'+SENTS), 'r') as f:
        reviews = f.readlines()
        
        tokenized = [" ".join(s.encode(review, out_type=str)) for review in reviews]
        print(f"Original: {reviews[0][:80]}, \t Tokenized: {tokenized[0][:80]}")
        # tokenize the data using s.encode and a loop(check the documentation)

        # It should look something like this :
        #▁An ci enne ▁VS ▁Nouvelle ▁version ▁plus
    
    with open(DATA_BOOKS/(split+'.spm' + "." + SENTS), 'w') as f:
        f.writelines("\n".join(tokenized)+'\n')

<font color='gray'>

## <b>Binarizing/Preprocessing the finetuning dataset</b>

In this section, you have to binarize the CLS_Books dataset using the <b>fairseq/fairseq_cli/preprocess.py</b> script:

1- Binarize the tokenized reviews and put the output in <b>data/cls-books-bin/input0</b>. 
> Note: Our pretrained model's embedding matrix contains only the embedding of the vocab listed in the dictionary <b>dict.txt</b>.
>
> You need to use the dictionary in the binarization of the text to transform the tokens into indices. Also note that we are using Encoder only architecture, so we only have source data.

2- Binarize the labels (train.label, valid.label and test.label files) and put the output in <b>data/cls-books-bin/label</b>.

Documentation: https://fairseq.readthedocs.io/en/latest/command_line_tools.html
</font>

## Task 2.2 Binarization

- `--only-source` : Only process the source language
- `srcdict` allows forcing the tokenization dictionary.
- `-s "spm.review"` for source language , `-t ""` for an undefined target language.

Note on compression
> - train.review `1102kb`
> - train.spm `1741kb`  (added extra characters to get ready for binarization)
> - train.bin : `569kb` (file appears compressed, sentences read from disk will be lighter and decoded on the CPU).


In the context of `fairseq` ,  `.bin` and `.idx` files are integral to the data format used for efficient storage and retrieval of the training data. 

Here's what each of these file types represents:

1. **`.bin` Files**: 
   - These are binary files that contain the actual training data.
   - In `fairseq`, data (such as tokenized text) is converted into a numerical format (indices corresponding to tokens in the dictionary) and then stored in a binary format.
   - This binary format is more space-efficient and faster to **read from disk compared to plain text**, which is crucial for large datasets commonly used in machine learning.

2. **`.idx` Files**: 
   - These files are index files that accompany the `.bin` files.
   - The `.idx` file stores the byte offsets of each example (like a sentence or a document) in the corresponding `.bin` file.
   - When the training process requires a specific example from the dataset, it uses the `.idx` file to quickly find where that example starts and ends in the `.bin` file.
   - This allows for efficient random access to examples in the dataset without having to read the entire `.bin` file sequentially, which is especially beneficial for large datasets.

In summary, the combination of `.bin` and `.idx` files in `fairseq` enables efficient and fast access to large-scale datasets, a critical aspect of training modern neural networks, particularly in tasks like natural language processing where datasets can be extremely large.




In [None]:
# TASK 2.2
SRC_DICT = "../models/RoBERTa_small_fr/dict.txt"
DESTINATION_ROOT = Path("../data/cls-books-bin")

# (file suffix, folder output, dictionary to use)
CONFIG = [
    (".spm.review", "input0", f"--srcdict {SRC_DICT}"), # binarize the tokenized reviews
    (".label", "label", ""), # binarize the labels - fairseq preprocess will build the dictionary needed (0 & 1 basically)
]
for suffix, out_dir, src_dict in CONFIG:
    CORPUS_TRAIN, CORPUS_VALID, CORPUS_TEST = [str(DATA_BOOKS/f"{dataset_split}{suffix}") for dataset_split in ['train', 'valid', 'test']]
    DESTINATION_FOLDER = str(DESTINATION_ROOT/out_dir)
    !(python ~/fairseq/fairseq_cli/preprocess.py \
                --only-source \
                --workers 8 \
                $src_dict \
                --destdir "$DESTINATION_FOLDER"\
                --trainpref "$CORPUS_TRAIN" \
                --validpref "$CORPUS_VALID" \
                --testpref "$CORPUS_TEST" \
    )

<font color='gray'>

## <b>Finetuning $RoBERTa_{small}^{fr}$</b>

In this section you will use `fairseq/fairseq_cli/train.py` python script to finetune the pretrained model on the CLS_Books dataset (binarized data) for three different seeds: 0, 1 and 2.

Make sure to use the following hyper-parameters: 
- batch size=8
- max number of epochs = 5
- optimizer: Adam
- max learning rate: 1e-05,
- warm up ratio: 0.06, 
- learning rate scheduler: linear
<font>

# Task 3: Fine tune Roberta Small

In [2]:
DATA_SET='books'
TASK= 'sentence_prediction' # sentence prediction task on fairseq
MODEL='RoBERTa_small_fr'
DATA_PATH= "../data/cls-books-bin"
MAX_EPOCH= 5
MAX_SENTENCES= 8 # batch size
MAX_UPDATE = 1200 # number of backward propagation steps
LR= 1.E-5
VALID_SUBSET='valid,test' # for simplicity we will validate on both valid and test set, and then pick the value of test set corresponding the best validation score.
METRIC = 'accuracy' # use the accuracy metric
NUM_CLASSES = 2 # number of classes
SEEDS=3
CUDA_VISIBLE_DEVICES=0
WARMUP = int(0.06 * MAX_UPDATE) # warmup ratio=6% of the whole training
ARCHITECTURE = "roberta_small"

In [3]:
assert Path(MODEL_PATH).exists(), f"Error: {MODEL_PATH} does not exist."

In [4]:
for SEED in range(SEEDS):
  TENSORBOARD_LOGS= 'tensorboard_logs/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_mu'+str(MAX_UPDATE)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
  SAVE_DIR= 'checkpoints/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_mu'+str(MAX_UPDATE)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
  !(python ~/fairseq/fairseq_cli/train.py \
                $DATA_PATH \
                --restore-file $MODEL_PATH \
                --batch-size $MAX_SENTENCES \
                --task $TASK \
                --update-freq 1 \
                --seed $SEED \
                --reset-optimizer --reset-dataloader --reset-meters \
                --init-token 0 \
                --separator-token 2 \
                --arch $ARCHITECTURE \
                --criterion sentence_prediction \
                --num-classes $NUM_CLASSES \
                --weight-decay 0.01 \
                --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 \
                --maximize-best-checkpoint-metric \
                --best-checkpoint-metric $METRIC \
                --save-dir $SAVE_DIR \
                --lr-scheduler polynomial_decay \
                --lr $LR \
                --max-update $MAX_UPDATE \
                --total-num-update $MAX_UPDATE \
                --no-epoch-checkpoints \
                --no-last-checkpoints \
                --tensorboard-logdir $TENSORBOARD_LOGS \
                --log-interval 5 \
                --warmup-updates $WARMUP \
                --max-epoch $MAX_EPOCH \
                --keep-best-checkpoints 1 \
                --max-positions 256 \
                --valid-subset $VALID_SUBSET \
                --shorten-method 'truncate' \
                --no-save \
                --distributed-world-size 1)


2023-11-11 17:01:13 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 5, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': 'tensorboard_logs/sentence_prediction/books/RoBERTa_small_fr_ms8_mu1200_lr1e-05_me5/0', 'wandb_project': None, 'azureml_logging': False, 'seed': 0, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'us