# Task 2: Preprocessing the dataset
Prepare the data for finetuning. To do so you have to:
1. Use the provided sentencepiece model to tokenize the text.
2. Binarize the data using `fairseq-preprocess` command

## Tokenizing the reviews

In this section we will tokenize the finetuning dataset using sentenpiece tokenizer. We have three splits in our datase: train valid and test sets.

In this task you have to use the trained sentencepiece tokenizer (RoBERTa_small_fr/sentencepiece.bpe.model) to tokenize the three files <b>train.review</b>, <b>valid.review</b> and <b>test.review</b> and output the three files <b>train.spm.review</b>, <b>valid.spm.review</b> and <b>test.spm.review</b> containing the tokenized reviews.

Documentation: https://github.com/google/sentencepiece#readme

In [6]:
import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='../models/RoBERTa_small_fr/sentencepiece.bpe.model')

SPLITS=['train', 'test', 'valid']
SENTS="review"
DATA_BOOKS = '../data/cls.books/'
for split in SPLITS:
    with open(DATA_BOOKS+split+'.'+SENTS, 'r') as f:
        reviews = f.readlines()
        print(reviews[0][:80])
        tokenized = [" ".join(s.encode(review, out_type=str)) for review in reviews]
        # tokenize the data using s.encode and a loop(check the documentation)

        # It should look something like this :
        #▁An ci enne ▁VS ▁Nouvelle ▁version ▁plus

    with open(DATA_BOOKS+split+'.spm.'+SENTS, 'w') as f:
        for review in reviews:
          f.write("\n".join(review)+'\n')
    
    with open(DATA_BOOKS+split+'.spm.'+SENTS, 'w') as f:
        f.writelines("\n".join(tokenized)+'\n')

Ce livre est tout simplement magique !  il vaut le detour suspense, humour, magi
J'ai lu ce livre car dans ma ville, tout le monde s'en sert et le commande. C'es
Ce livre explique techniquement et de façon très compréhensible, même pour des n


In [5]:
print(tokenized[0][:80])

▁Ce ▁livre ▁explique ▁technique ment ▁et ▁de ▁façon ▁très ▁com pré hen sible , ▁


## <b>Binarizing/Preprocessing the finetuning dataset</b>

In this section, you have to binarize the CLS_Books dataset using the <b>fairseq/fairseq_cli/preprocess.py</b> script:

1- Binarize the tokenized reviews and put the output in <b>data/cls-books-bin/input0</b>. Note: Our pretrained model's embedding matrix contains only the embedding of the vocab listed in the dictionary <b>dict.txt</b>. You need to use the dictionary in the binarization of the text to transform the tokens into indices. Also note that we are using Encoder only architecture, so we only have source data.

2- Binarize the labels (train.label, valid.label and test.label files) and put the output in <b>data/cls-books-bin/label</b>.

Documentation: https://fairseq.readthedocs.io/en/latest/command_line_tools.html


## Binarization

`--only-source` : Only process the source language

In [14]:
# binarize the tokenized reviews
!(python ~/fairseq/fairseq_cli/preprocess.py \
              --only-source \
              --workers 8 \
              -s ../data/books/train.spm.review \
              --srcdict ../data/books/dict.txt \
              --destdir ../datacls-books-bin/input0\
)

# !(python libs/fairseq/fairseq_cli/preprocess.py \
#               --only-source \

#               --workers 8)#fill me - binarize the labels



In [8]:
# from fairseq_cli.preprocess import cli_main as preprocess_cli
# preprocess_cli()

usage: ipykernel_launcher.py [-h] [--no-progress-bar]
                             [--log-interval LOG_INTERVAL]
                             [--log-format {json,none,simple,tqdm}]
                             [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                             [--aim-run-hash AIM_RUN_HASH]
                             [--tensorboard-logdir TENSORBOARD_LOGDIR]
                             [--wandb-project WANDB_PROJECT]
                             [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
                             [--bf16] [--memory-efficient-bf16] [--fp16]
                             [--memory-efficient-fp16]
                             [--fp16-no-flatten-grads]
                             [--fp16-init-scale FP16_INIT_SCALE]
                             [--fp16-scale-window FP16_SCALE_WINDOW]
                             [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                             [--on-cpu-convert-precision]
                          

SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
