<a href="https://colab.research.google.com/github/UL-FRI-NLP-Course-2022-23/nlp-course-mbj/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers, ROUGE score and other dependencies.

In [None]:
! pip install transformers rouge-score nltk datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m90.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K   

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Then you need to install Git-LFS.

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.28.1


# Fine-tuning a model on a translation/paraphrasing task

Choose which model checkpoint to use

In [None]:
model_checkpoint = "cjvt/t5-sl-small"
# model_checkpoint = "cjvt/t5-sl-large"

## Loading the dataset

In [None]:
import os
from datasets import DatasetDict, Dataset
import pandas as pd
# %%
# path for when you use uploaded files:
# directory_path = os.getcwd()

# path for files from drive:
directory_path = "/content/drive/MyDrive/MAG-1/NLP/IMP-corpus-csv-sentence"


# Read all .csv files in the directory into a list of pandas DataFrames
dfs = []
for file in os.listdir(directory_path):
    if file.endswith(".csv"):
        file_path = os.path.join(directory_path, file)
        df = pd.read_csv(file_path)
        dfs.append(df)

# Concatenate all DataFrames into one combined DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Convert the combined DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(combined_df)

# Split the dataset into train, validate, and test sets
train_dataset = dataset.shuffle(seed=42).select(range(int(len(dataset) * 0.7)))
valid_dataset = dataset.shuffle(seed=42).select(range(int(len(dataset) * 0.7), int(len(dataset) * 0.85)))
test_dataset = dataset.shuffle(seed=42).select(range(int(len(dataset) * 0.85), len(dataset)))

# Create a DatasetDict object with train, validate, and test sets
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validate': valid_dataset,
    'test': test_dataset
})

# Access the individual datasets using dictionary-like syntax
train_dataset = dataset_dict['train']
valid_dataset = dataset_dict['validate']
test_dataset = dataset_dict['test']

In [None]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['orig', 'reg', 'lemma'],
        num_rows: 696299
    })
    validate: Dataset({
        features: ['orig', 'reg', 'lemma'],
        num_rows: 149207
    })
    test: Dataset({
        features: ['orig', 'reg', 'lemma'],
        num_rows: 149207
    })
})

In [None]:
print(dataset_dict['train'][0])

{'orig': 'Veliko konj je popadalo tudi v neizmérjene prepadi v gnječi, zato ker je bila soteska na obeh straneh stermoglava in pretergana; prekucnilo se je tudi nekoliko vojščakov, pa tudi tovorna živina se je valila s tovori navzdol, kakor razvaline največjega poslopja.', 'reg': 'veliko konj je popadalo tudi v neizmerjene prepadi v gnječi, zato ker je bila soteska na obeh straneh stermoglava in pretrgana; prekucnilo se je tudi nekoliko vojščakov, pa tudi tovorna živina se je valila s tovori navzdol, kakor razvaline največjega poslopja.', 'lemma': 'veliko konj biti popadati tudi v neizmerjen prepad v gnječa  zato ker biti biti soteska na oba stran stermoglav in pretrgan  prekucniti se biti tudi nekoliko vojščak  pa tudi tovoren živina se biti valiti z tovor navzdol  kakor razvalina velik poslopje'}


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
from datasets import load_metric
metric = load_metric("rouge")
metric

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

## Preprocessing data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.
That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You can directly call this tokenizer on one sentence or a list of sentences:

In [None]:
tokenizer("Pozdrav, to je ena poved!")

{'input_ids': [13220, 31354, 130, 34, 584, 280, 31349, 31413, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(["Pozdrav, to je prva poved.","To pa je druga."])

{'input_ids': [[13220, 31354, 130, 34, 1836, 280, 31349, 31358, 1], [352, 78, 34, 879, 31358, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the as_target_tokenizer context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Pozdrav, to je prva poved.","To pa je druga."]))

{'input_ids': [[13220, 31354, 130, 34, 1836, 280, 31349, 31358, 1], [352, 78, 34, 879, 31358, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [None]:
if model_checkpoint in ["cjvt/t5-sl-small", "cjvt/t5-sl-large", "t5-small", "t5-large"]:
    prefix = "translate: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
max_input_length = 1024
max_target_length = 128

input_param = "reg"
target_param = "orig"

def preprocess_function(examples):
    inputs = [prefix + original_text for original_text in examples[input_param]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples[target_param], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(dataset_dict['train'][:2])

{'input_ids': [[1647, 25, 36, 31388, 510, 14509, 34, 18, 5203, 27, 126, 9, 5238, 28, 6087, 54, 121, 42, 9, 128, 49, 73, 31354, 507, 512, 34, 390, 68, 36, 234, 24, 2380, 6650, 4, 215, 39, 502, 32, 29, 27614, 6, 31401, 54, 102, 31360, 6135, 45, 34, 126, 2058, 13882, 418, 158, 31354, 78, 126, 7547, 1638, 1661, 6, 45, 34, 312, 6528, 4, 130, 764, 9308, 31354, 2191, 101, 371, 14, 14424, 15590, 22, 31358, 1], [1647, 25, 36, 31388, 11649, 29, 37, 439, 751, 153, 374, 618, 31401, 2091, 10613, 34, 10673, 1311, 31354, 34, 192, 8039, 11649, 29, 37, 439, 751, 31358, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[3227, 14509, 34, 18, 5203, 27, 126, 9, 5238, 31351, 31439, 6087, 54, 121, 42, 9

To apply this function on all the pairs of sentences in our dataset, we just use the map method of our dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/696299 [00:00<?, ? examples/s]

Map:   0%|          | 0/149207 [00:00<?, ? examples/s]

Map:   0%|          | 0/149207 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine Tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the AutoModelForSeq2SeqLM class. Like with the tokenizer, the from_pretrained method will download and cache the model for us.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

# model = AutoModelForSeq2SeqLM.from_pretrained("cjvt/t5-sl-small")
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-stara-slo",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the batch_size defined at the top of the cell and customize the weight decay. Since the Seq2SeqTrainer will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the predict_with_generate option (to properly generate translations) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the hub_model_id argument to set the repo name (it needs to be the full name, including your namespace: for instance "sgugger/t5-finetuned-xsum" or "huggingface/t5-finetuned-xsum").

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our Seq2SeqTrainer is how to compute the metrics from the predictions. We need to define a function for this, which will just use the metric we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validate"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/martinjurkovic/t5-sl-small-finetuned-stara-slo into local empty directory.


We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.5519,0.412658,81.4809,74.0847,81.4683,81.462,14.8144
2,0.4552,0.342751,81.6942,74.33,81.6805,81.6769,14.8217
3,0.4164,0.328883,81.7571,74.4114,81.7438,81.7405,14.822


Several commits (2) will be pushed upstream.
Several commits (2) will be pushed upstream.


TrainOutput(global_step=130557, training_loss=0.5573156179848864, metrics={'train_runtime': 23397.3442, 'train_samples_per_second': 89.279, 'train_steps_per_second': 5.58, 'total_flos': 5.438211864643584e+16, 'train_loss': 0.5573156179848864, 'epoch': 3.0})

## TESTING THE MODEL

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("martinjurkovic/t5-sl-small-finetuned-stara-slo")

model = AutoModelForSeq2SeqLM.from_pretrained("martinjurkovic/t5-sl-small-finetuned-stara-slo")