<center><h2>ALTeGraD 2023<br>Lab Session 4: NLP Frameworks</h2> 07 / 11 / 2023<br> Dr. G. Shang, H. Abdine<br><br>


<b>Student name:</b> Balthazar Neveu

</center>
<font color='gray'>

In this lab you will learn how to use Fairseq and HuggingFace transformers - The most used libraries by researchers and developers  and finetune language models - to finetune a pretrained French language model ($RoBERTa_{small}^{fr}$) on the sentiment analysis dataset CLS_Books where each review is labeled as positive or negative and finetune a variant of BLOOM on a question/answer dataset.
</font>


In [1]:
from pathlib import Path
def abspath(pth): return Path(pth).resolve()

<font color='gray'>

## Task 2: Preprocessing the dataset
Prepare the data for finetuning. To do so you have to:
1. Use the provided sentencepiece model to tokenize the text.
2. Binarize the data using `fairseq-preprocess` command

### Tokenizing the reviews

In this section we will tokenize the finetuning dataset using sentenpiece tokenizer. We have three splits in our datase: train valid and test sets.

In this task you have to use the trained sentencepiece tokenizer (RoBERTa_small_fr/sentencepiece.bpe.model) to tokenize 
- the input three files <b>train.review</b>, <b>valid.review</b> and <b>test.review</b> 
- and output the three files <b>train.spm.review</b>, <b>valid.spm.review</b> and <b>test.spm.review</b> containing the tokenized reviews.

Documentation: https://github.com/google/sentencepiece#readme
</font>

# Task 2.1 Tokenization
When we split the sentences into tokens here, we use a provided byte pair encoding model.

When training from scratch, we should start from scratch and tokenize a corpus from scratch...

but if we want to **re-use a pretrained model**, we need to follow its vocabulary definition.

In [15]:
SPLITS=['train', 'test', 'valid']
TOKENIZER_MODEL_PATH = '../models/RoBERTa_small_fr/sentencepiece.bpe.model'
DATA_BOOKS = Path('../data/cls.books/')

In [18]:
# TASK 2.1
import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL_PATH)

SENTS="review"

for split in SPLITS:
    with open(DATA_BOOKS/(split+'.'+SENTS), 'r') as f:
        reviews = f.readlines()
        
        tokenized = [" ".join(s.encode(review, out_type=str)) for review in reviews]
        print(f"Original: {reviews[0][:80]}, \t Tokenized: {tokenized[0][:80]}")
        # tokenize the data using s.encode and a loop(check the documentation)

        # It should look something like this :
        #▁An ci enne ▁VS ▁Nouvelle ▁version ▁plus
    
    with open(DATA_BOOKS/(split+'.spm' + "." + SENTS), 'w') as f:
        f.writelines("\n".join(tokenized)+'\n')

Original: Ce livre est tout simplement magique !  il vaut le detour suspense, humour, magi, 	 Tokenized: ▁Ce ▁livre ▁est ▁tout ▁simplement ▁mag ique ▁! ▁il ▁vaut ▁le ▁de tour ▁suspens e
Original: J'ai lu ce livre car dans ma ville, tout le monde s'en sert et le commande. C'es, 	 Tokenized: ▁J ' ai ▁lu ▁ce ▁livre ▁car ▁dans ▁ma ▁ville , ▁tout ▁le ▁monde ▁s ' en ▁sert ▁e
Original: Ce livre explique techniquement et de façon très compréhensible, même pour des n, 	 Tokenized: ▁Ce ▁livre ▁explique ▁technique ment ▁et ▁de ▁façon ▁très ▁com pré hen sible , ▁


<font color='gray'>

## <b>Binarizing/Preprocessing the finetuning dataset</b>

In this section, you have to binarize the CLS_Books dataset using the <b>fairseq/fairseq_cli/preprocess.py</b> script:

1- Binarize the tokenized reviews and put the output in <b>data/cls-books-bin/input0</b>. 
> Note: Our pretrained model's embedding matrix contains only the embedding of the vocab listed in the dictionary <b>dict.txt</b>.
>
> You need to use the dictionary in the binarization of the text to transform the tokens into indices. Also note that we are using Encoder only architecture, so we only have source data.

2- Binarize the labels (train.label, valid.label and test.label files) and put the output in <b>data/cls-books-bin/label</b>.

Documentation: https://fairseq.readthedocs.io/en/latest/command_line_tools.html
</font>

## Task 2.2 Binarization

- `--only-source` : Only process the source language
- `srcdict` allows forcing the tokenization dictionary.
- `-s "spm.review"` for source language , `-t ""` for an undefined target language.

Note on compression
> - train.review `1102kb`
> - train.spm `1741kb`  (added extra characters to get ready for binarization)
> - train.bin : `569kb` (file appears compressed, sentences read from disk will be lighter and decoded on the CPU).


In the context of `fairseq` ,  `.bin` and `.idx` files are integral to the data format used for efficient storage and retrieval of the training data. 

Here's what each of these file types represents:

1. **`.bin` Files**: 
   - These are binary files that contain the actual training data.
   - In `fairseq`, data (such as tokenized text) is converted into a numerical format (indices corresponding to tokens in the dictionary) and then stored in a binary format.
   - This binary format is more space-efficient and faster to **read from disk compared to plain text**, which is crucial for large datasets commonly used in machine learning.

2. **`.idx` Files**: 
   - These files are index files that accompany the `.bin` files.
   - The `.idx` file stores the byte offsets of each example (like a sentence or a document) in the corresponding `.bin` file.
   - When the training process requires a specific example from the dataset, it uses the `.idx` file to quickly find where that example starts and ends in the `.bin` file.
   - This allows for efficient random access to examples in the dataset without having to read the entire `.bin` file sequentially, which is especially beneficial for large datasets.

In summary, the combination of `.bin` and `.idx` files in `fairseq` enables efficient and fast access to large-scale datasets, a critical aspect of training modern neural networks, particularly in tasks like natural language processing where datasets can be extremely large.




In [17]:
# TASK 2.2
SRC_DICT = "../models/RoBERTa_small_fr/dict.txt"
DESTINATION_ROOT = Path("../data/cls-books-bin")

# (file suffix, folder output, dictionary to use)
CONFIG = [
    (".spm.review", "input0", f"--srcdict {SRC_DICT}"), # binarize the tokenized reviews
    (".label", "label", ""), # binarize the labels - fairseq preprocess will build the dictionary needed (0 & 1 basically)
]
for suffix, out_dir, src_dict in CONFIG:
    CORPUS_TRAIN, CORPUS_VALID, CORPUS_TEST = [str(DATA_BOOKS/f"{dataset_split}{suffix}") for dataset_split in ['train', 'valid', 'test']]
    DESTINATION_FOLDER = str(DESTINATION_ROOT/out_dir)
    !(python ~/fairseq/fairseq_cli/preprocess.py \
                --only-source \
                --workers 8 \
                $src_dict \
                --destdir "$DESTINATION_FOLDER"\
                --trainpref "$CORPUS_TRAIN" \
                --validpref "$CORPUS_VALID" \
                --testpref "$CORPUS_TEST" \
    )

2023-11-11 14:31:19 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, aim_repo=None, aim_run_hash=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, tr

<font color='gray'>

## <b>Finetuning $RoBERTa_{small}^{fr}$</b>

In this section you will use `fairseq/fairseq_cli/train.py` python script to finetune the pretrained model on the CLS_Books dataset (binarized data) for three different seeds: 0, 1 and 2.

Make sure to use the following hyper-parameters: 
- batch size=8
- max number of epochs = 5
- optimizer: Adam
- max learning rate: 1e-05,
- warm up ratio: 0.06, 
- learning rate scheduler: linear
<font>

In [1]:
DATA_SET='books'
TASK= 'sentence_prediction' # sentence prediction task on fairseq
MODEL='RoBERTa_small_fr'
DATA_PATH= "../data/cls-books-bin"
MODEL_PATH= "../models/RoBERTa_small_classif"
MAX_EPOCH= 5
MAX_SENTENCES= 8 # batch size
MAX_UPDATE = 1200 # number of backward propagation steps
LR= 1.E-5
VALID_SUBSET='valid,test' # for simplicity we will validate on both valid and test set, and then pick the value of test set corresponding the best validation score.
METRIC = 'accuracy' # use the accuracy metric
NUM_CLASSES = 2 # number of classes
SEEDS=3
CUDA_VISIBLE_DEVICES=0
WARMUP = int(0.06 * MAX_UPDATE) # warmup ratio=6% of the whole training
ARCHITECTURE = "roberta_small"

In [None]:
for SEED in range(SEEDS):
  TENSORBOARD_LOGS= 'tensorboard_logs/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_mu'+str(MAX_UPDATE)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
  SAVE_DIR= 'checkpoints/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_mu'+str(MAX_UPDATE)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
  !(python ~/fairseq/fairseq_cli/train.py \
                $DATA_PATH \
                --restore-file $MODEL_PATH \
                --batch-size $MAX_SENTENCES \
                --task $TASK \
                --update-freq 1 \
                --seed $SEED \
                --reset-optimizer --reset-dataloader --reset-meters \
                --init-token 0 \
                --separator-token 2 \
                --arch $ARCHITECTURE \
                --criterion sentence_prediction \
                --num-classes $NUM_CLASSES \
                --weight-decay 0.01 \
                --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 \
                --maximize-best-checkpoint-metric \
                --best-checkpoint-metric $METRIC \
                --save-dir $SAVE_DIR \
                --lr-scheduler polynomial_decay \
                --lr $LR \
                --max-update $MAX_UPDATE \
                --total-num-update $MAX_UPDATE \
                --no-epoch-checkpoints \
                --no-last-checkpoints \
                --tensorboard-logdir $TENSORBOARD_LOGS \
                --log-interval 5 \
                --warmup-updates $WARMUP \
                --max-epoch $MAX_EPOCH \
                --keep-best-checkpoints 1 \
                --max-positions 256 \
                --valid-subset $VALID_SUBSET \
                --shorten-method 'truncate' \
                --no-save \
                --distributed-world-size 1)
