LAST LINKs:
* https://colab.research.google.com/drive/1WIk2bxglElfZewOHboPFNj8H44_VAyKE?usp=sharing#scrollTo=Gw3IZYrfKl4Z
* https://medium.com/analytics-vidhya/fine-tune-a-roberta-encoder-decoder-model-trained-on-mlm-for-text-generation-23da5f3c1858
* https://huggingface.co/course/chapter7/7?fw=tf

LINKs:
* https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb

* https://github.com/Michael-M-Mike/Unibo-NLP-Assignments/blob/main/A2_Seq2Seq_Abstractive_Question_Answering_(QA)_on_CoQA/distilroberta_42.ipynb

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

# [0] Functions and imports

In [None]:
# %%capture
# !pip install datasets
# !pip install transformers
# !pip install tensorflow_addons
# !pip install allennlp-models


# NOTE:
#     - SEED ED ERRORE
#     - LUNGHEZZA INPUTS E OUTPUTS - https://towardsdatascience.com/to-distil-or-not-to-distil-bert-roberta-and-xlnet-c777ad92f8
#     - GRANDEZZA DATASETS
#     - WARNINGS NELLA CREAZIONE DEL MODELLO E NEL TRAINING
#     - COME UTILIZZARE SPAN DI TESTO

In [1]:
from IPython.display import display_html, clear_output
from itertools import chain,cycle
from copy import deepcopy
import urllib.request
import numpy as np
import json
import time
import os
import torch
import random 
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
import tensorflow_addons as tfa

# Display dataframes
def display(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:left"><td style="vertical-align:top">'
        html_str+=f'<h4 style="text-align: left;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

  from .autonotebook import tqdm as notebook_tqdm


## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [2]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        urllib.request.urlretrieve(url_path, filename=data_path)
        print("Download completed!")

In [3]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [4]:
train_data = json.load((open('coqa/train.json')))
test_data = json.load((open('coqa/test.json')))

qas = pd.json_normalize(train_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(train_data['data'], ['answers'],['id'])
train_val_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
train_val_df = train_val_df.loc[train_val_df['input_text_y']!='unknown']

qas = pd.json_normalize(test_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(test_data['data'], ['answers'],['id'])
test_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
test_df = test_df.loc[test_df['input_text_y']!='unknown']

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [5]:
from sklearn.model_selection import GroupShuffleSplit
from datasets import *
from transformers import AutoTokenizer, PreTrainedTokenizerFast

import plotly.express as px

In [6]:
# set_reproducibility(42)

train_inds, val_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 42).split(train_val_df, groups=train_val_df['story']))

train_df = train_val_df.iloc[train_inds]
val_df = train_val_df.iloc[val_inds].reset_index()

In [7]:
# train_df = train_df.replace(r'\n',' ', regex=True)
# val_df = val_df.replace(r'\n',' ', regex=True)
# test_df = test_df.replace(r'\n',' ', regex=True)

print(f'Training set [{train_df.shape}]')
print(f'\tFeatures: {list(train_df.columns)}')
display(train_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

print(f'Validation set [{val_df.shape}]')
print(f'\tFeatures: {list(val_df.columns)}')
display(val_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

print(f'\nTest set [{test_df.shape}]')
print(f'\tFeatures: {list(test_df.columns)}')
display(test_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

Training set [(85968, 11)]
	Features: ['input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']


Unnamed: 0,id,input_text_x,input_text_y,span_text
11,3zotghdk5ibi9cex97fepx7jetpso7,how many items are in this secret collection?,150000,"Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items."
12,3zotghdk5ibi9cex97fepx7jetpso7,Can anyone use this library?,anyone who can document their qualifications and research needs.,The Vatican Library is open to anyone who can document their qualifications and research needs.
14,3zotghdk5ibi9cex97fepx7jetpso7,what must be requested in person or by mail?,Photocopies,Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.
15,3zotghdk5ibi9cex97fepx7jetpso7,of what books?,only books published between 1801 and 1990,hotocopies for private study of pages from books published between 1801 and 1990


Validation set [(21308, 12)]
	Features: ['index', 'input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']


Unnamed: 0,id,input_text_x,input_text_y,span_text
11,369j354ofdapu1z2ebz3jj2p5ajg6o,what time were they going to the hospital?,about 11 P. M.,about 11 P. M.
12,369j354ofdapu1z2ebz3jj2p5ajg6o,what is the mother's name?,May Hobson,"Her mother, May Hobson"
13,369j354ofdapu1z2ebz3jj2p5ajg6o,how old is she?,40,"May Hobson, 40"
14,33m4ia01qg1t26scv925i0tg4jhrx5,What did Neolithic follow?,Holocene Epipaleolithic period,the Neolithic followed the terminal Holocene Epipaleolithic period
15,33m4ia01qg1t26scv925i0tg4jhrx5,What was the Neolithic considered?,the last part of the Stone Age,considered the last part of the Stone Age



Test set [(7917, 9)]
	Features: ['input_text_x', 'turn_id', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y']


Unnamed: 0,id,input_text_x,input_text_y,span_text
11,3dr23u6we5exclen4th8uq9rb42tel,Did they want Cotton to change the color of her fur?,no,We would never want you to be any other way
12,3azhrg4cu4ktme1zh7c2ro3pn2430d,what was the name of the fish,Asta.,Asta.
13,3azhrg4cu4ktme1zh7c2ro3pn2430d,What looked like a birds belly,a bottle,a bottle
14,3azhrg4cu4ktme1zh7c2ro3pn2430d,who said that,Asta.,"""It looks like a bird's belly,"" said Asta."
15,3azhrg4cu4ktme1zh7c2ro3pn2430d,Was Sharkie a friend?,Yes,Asta's friend Sharkie


Now we check if there is any overlapping dialogue between train and validation set.

In [8]:
set_train = set(train_df['id'])
set_val = set(val_df['id'])

overlap = False
for i in set_train:
    if i in set_val:
        overlap = True
        break

print('Overlap' if overlap else 'No overlap')

No overlap


In [9]:
features = ['story', 'input_text_x', 'span_text', 'input_text_y']
# features = ['id', 'story', 'input_text_x', 'span_text', 'input_text_y']

train_df_to_ds = train_df[features]
val_df_to_ds = val_df[features]
test_df_to_ds = test_df[features]

train_df_to_ds = train_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})
val_df_to_ds = val_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})
test_df_to_ds = test_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})

Now, since the dataset is huge and we are more focused on the reasoning behind our choices, we are going to extract a portion of it.
The next step is gonna be the truncation of the inputs lengths. The pre-trained models that are gonna be tested can process lengths up to 512, that is why our truncation will be at least equal to this value; moreover, we are going to sort the datasets according to the sum of the lengths of the 'context' and 'question' fields together, expecting to truncate the least possible number of examples.

In [55]:
# Combine 'context' and 'question' fields for each dataframe
train_df_to_ds['context_question'] = train_df_to_ds['context'] + train_df_to_ds['question']
val_df_to_ds['context_question'] = val_df_to_ds['context'] + val_df_to_ds['question']
test_df_to_ds['context_question'] = test_df_to_ds['context'] + test_df_to_ds['question']

# Define a function to compute the length of 'context_question'
def get_context_question_length(df):
    return (df['context_question'].apply(len))

# Compute the lengths of 'context_question' for each dataframe
train_lengths = get_context_question_length(train_df_to_ds)
val_lengths = get_context_question_length(val_df_to_ds)
test_lengths = get_context_question_length(test_df_to_ds)

# Sort each dataframe by length of 'context + question'
train_df_to_ds = train_df_to_ds.iloc[train_lengths.argsort()]
val_df_to_ds = val_df_to_ds.iloc[val_lengths.argsort()]
test_df_to_ds = test_df_to_ds.iloc[test_lengths.argsort()]

# Drop the 'context_question' column from each dataframe
train_df_to_ds = train_df_to_ds.drop('context_question', axis=1)
val_df_to_ds = val_df_to_ds.drop('context_question', axis=1)
test_df_to_ds = test_df_to_ds.drop('context_question', axis=1)

In [56]:
batch_size = 16
ratio = 1

train_samples = (round(train_df_to_ds.shape[0] * ratio / 100) // batch_size) * batch_size
val_samples = (round(val_df_to_ds.shape[0] * ratio / 100) // batch_size) * batch_size
test_samples = (round(test_df_to_ds.shape[0] * ratio / 100) // batch_size) * batch_size

train_dataset = Dataset.from_dict(train_df_to_ds.iloc[:train_samples])
val_dataset = Dataset.from_dict(val_df_to_ds.iloc[:val_samples])
test_dataset = Dataset.from_dict(test_df_to_ds.iloc[:test_samples])

dataset_COQA = DatasetDict({'train':train_dataset,'validation':val_dataset,'test':test_dataset})

In [57]:
dataset_COQA

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'text', 'answer'],
        num_rows: 848
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'text', 'answer'],
        num_rows: 208
    })
    test: Dataset({
        features: ['id', 'context', 'question', 'text', 'answer'],
        num_rows: 64
    })
})

In [58]:
inputs_lengths = [len(x[0])+len(x[1]) for x in zip(train_val_df['input_text_x'],\
                                               train_val_df['story'])]

max_length_input = min(512,round(np.quantile(list(set(inputs_lengths)), .05))) 
print(f'Max length:{max_length_input}')

# stride = int(max_length_input/3)
# print(f'Stride:{stride}')

fig_inputs = px.box(list(set(inputs_lengths)))
fig_inputs.show()

Max length:512


In [59]:
outputs_lengths = [len(x) for x in train_val_df['input_text_y']]

max_length_answer = round(np.quantile(list(set(outputs_lengths)), .75))
print(f'Max length (3rd quantile):{max_length_answer}')

fig_inputs = px.box(list(set(outputs_lengths)))
fig_inputs.show()

Max length (3rd quantile):160


In [60]:
def prepare_features(batch, tokenizer, max_length_input, max_length_answer):
    # Tokenize the Question, Context and text columns
    encoded_batch_inputs = tokenizer(
        batch['question'],
        batch['context'],
#         data['text'],
        add_special_tokens=False,
        max_length=max_length_input,
        truncation='only_second',
        padding='max_length'
    )

    # Tokenize the Answer column
    encoded_batch_labels = tokenizer(
        batch['answer'],
        max_length=max_length_answer,
        padding='max_length'
    )
    
    encoded_batch_inputs['labels'] = encoded_batch_labels.input_ids.copy()
    encoded_batch_inputs['decoder_input_ids'] = deepcopy(encoded_batch_inputs['labels'])
    encoded_batch_inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token\
                                    for token in labels]\
                                    for labels in encoded_batch_inputs['labels']]
    
    encoded_batch_inputs['labels_mask'] = encoded_batch_labels.attention_mask.copy()


    return encoded_batch_inputs

* [M1] DistilRoBERTa (distilberta-base)

In [61]:
model_checkpoint_M1 = 'distilroberta-base'
tokenizer_M1 = AutoTokenizer.from_pretrained(model_checkpoint_M1)
assert isinstance(tokenizer_M1, PreTrainedTokenizerFast)

# Setting the BOS and EOS token
tokenizer_M1.bos_token = tokenizer_M1.cls_token
tokenizer_M1.eos_token = tokenizer_M1.sep_token

tokenized_datasets_M1 = DatasetDict()

# Use the `prepare_features` functions
tokenized_datasets_M1 = dataset_COQA.map(
    lambda batch: prepare_features(batch, tokenizer_M1, max_length_input, max_length_answer),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['train'].column_names
)

100%|██████████████████████████████████████████████████████████████████████████████████| 53/53 [00:00<00:00, 72.75ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 76.17ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 78.38ba/s]


In [62]:
tokenized_datasets_M1

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 848
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 208
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 64
    })
})

* [M2] BERTTiny (bert-tiny)

In [63]:
model_checkpoint_M2 = 'prajjwal1/bert-tiny'
tokenizer_M2 = AutoTokenizer.from_pretrained(model_checkpoint_M2)
assert isinstance(tokenizer_M2, PreTrainedTokenizerFast)

# Setting the BOS and EOS token
tokenizer_M2.bos_token = tokenizer_M2.cls_token
tokenizer_M2.eos_token = tokenizer_M2.sep_token

tokenized_datasets_M2 = DatasetDict()

# Use the `prepare_features` functions
tokenized_datasets_M2 = dataset_COQA.map(
    lambda datarow: prepare_features(datarow, tokenizer_M2, max_length_input, max_length_answer),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['train'].column_names
)

100%|██████████████████████████████████████████████████████████████████████████████████| 53/53 [00:00<00:00, 71.22ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 72.37ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 65.54ba/s]


In [64]:
tokenized_datasets_M2

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 848
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 208
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels', 'decoder_input_ids', 'labels_mask'],
        num_rows: 64
    })
})

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [65]:
from transformers import TFEncoderDecoderModel

class EncoderDecoder(tf.keras.Model):
    """
    Custom keras model that wraps the TFEncoderDecoderModel
    """

    def __init__(self, model_checkpoint, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.model_name = model_checkpoint

        # tie_encoder_decoder to share weights and half the number of parameters
        self.model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(model_checkpoint, model_checkpoint,
                                                                           encoder_from_pt=True,
                                                                           decoder_from_pt=True,
                                                                           tie_encoder_decoder=True)

    def call(self, inputs, **kwargs):
        loss = self.model(input_ids=inputs['input_ids'],
                          attention_mask=inputs['attention_mask'],
                          decoder_input_ids=inputs['decoder_input_ids'],
                          decoder_attention_mask=inputs['labels_mask'],
                          labels=inputs['labels']).loss
        return loss

    def generate(self, **kwargs):
        return self.model.generate(decoder_start_token_id=self.model.config.decoder.pad_token_id,
                                   **kwargs)

When loading a pre-trained model into a target model, any layers present in the pre-trained model but not in the target model will be discarded. Conversely, any layers present in the target model but not in the pre-trained model will be initialized according to the initialization strategy of the target model.

This behavior is expected when using pre-trained models, and is due to the fact that the architecture of the target model may differ from that of the pre-trained model. It is important to note that this discrepancy in architecture does not necessarily imply that the target model will perform poorly out of the box. However, it is generally necessary to fine-tune the target model on a downstream task in order to achieve good performance.

* [M1] DistilRoBERTa (distilberta-base)

In [66]:
model_M1 = EncoderDecoder(model_checkpoint=model_checkpoint_M1)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.
All model checkpoint layers were used when initializing TFRobertaModel.


In [67]:
model_M1.model.config.decoder_start_token_id = tokenizer_M1.cls_token_id
model_M1.model.config_eos_token_id = tokenizer_M1.sep_token_id
model_M1.model.config.pad_token_id = tokenizer_M1.pad_token_id
model_M1.model.config.vocab_size = model_M1.model.config.encoder.vocab_size

In [None]:
# model_M1.model.config.num_beams = 4
# model_M1.model.config.min_length = 1
# model_M1.model.config.length_penalty = 3.0
# model_M1.model.config.early_stopping = True
# model_M1.model.config.no_repeat_ngram_size = 3
# model_M1.model.config.max_length = max_length_answer

* [M2] BERTTiny (bert-tiny)

In [29]:
model_M2 = EncoderDecoder(model_checkpoint=model_checkpoint_M2)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

In [30]:
model_M2.model.config.decoder_start_token_id = tokenizer_M2.cls_token_id
model_M2.model.config_eos_token_id = tokenizer_M2.sep_token_id
model_M2.model.config.pad_token_id = tokenizer_M2.pad_token_id
model_M2.model.config.vocab_size = model_M2.model.config.encoder.vocab_size

In [None]:
# model_M2.model.config.num_beams = 4
# model_M2.model.config.min_length = 1
# model_M2.model.config.length_penalty = 3.0
# model_M2.model.config.early_stopping = True
# model_M2.model.config.no_repeat_ngram_size = 3
# model_M2.model.config.max_length = max_length_answer

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [68]:
# from transformers import DefaultDataCollator, TrainingArguments, Trainer
from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer

A dataset collator is a function used in data processing for deep learning models, 
especially in training and evaluation. It collates, or collects, several examples 
from a dataset into a batch and performs operations on the batch, such as padding 
or stacking. This is usually done to make the input data compatible with the model's 
batch size, which is the number of samples processed together in one forward/backward pass. 
The dataset collator takes care of the preprocessing required to format the examples in the batch, 
allowing the data to be efficiently processed by the deep learning framework.

In [69]:
data_collator_M1 = DataCollatorForSeq2Seq(tokenizer_M1, padding= 'max_length',\
                                           max_length=max_length_answer,return_tensors="tf")
# data_collator_M1 = DefaultDataCollator(return_tensors="tf")

In [77]:
class EncoderDecoderTrainer(object):
    """
    Simple wrapper class

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included)
    """

    def __init__(self, keras_model):
        self.keras_model = keras_model
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=5e-05)
        
    @tf.function
    def compute_loss(self, inputs):
        loss = self.keras_model(inputs=inputs)
        return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
        with tf.GradientTape() as tape:
            loss = self.compute_loss(inputs=inputs)

        grads = tape.gradient(loss, self.keras_model.trainable_variables)
        return loss, grads

    @tf.function
    def batch_fit(self, inputs):
        loss, grads = self.train_op(inputs=inputs)
        self.optimizer.apply_gradients(zip(grads, self.keras_model.trainable_variables))
        return loss
    
    @tf.function
    def evaluate(self, dataset):
        total_loss = 0.0
        all_predictions = []
        all_targets = []
        for batch in dataset:
            print('1')
            loss = self.compute_loss(inputs=batch)
            print('2')
            total_loss += loss
            print('3')
            generated = self.keras_model.generate(input_ids=batch['input_ids'],
                                                 max_length=max_length_answer,
                                                 repetition_penalty=3.,
                                                 min_length=5,
                                                 no_repeat_ngram_size=3,
                                                 early_stopping=True,
                                                 num_beams=4
                                                 )
            print('4')
            generated = tokenizer_M1.batch_decode(generated, skip_special_tokens=True)
            
            print('5')
            print(f'Generated: {generated}')
            
            print('6')
            all_predictions.extend(generated)
            print('7')
            all_targets.extend(tokenizer_M1.batch_decode(batch['labels'], skip_special_tokens=True))

        avg_loss = total_loss / len(dataset)
        metrics = compute_squad_metrics(all_predictions, all_targets)
        return avg_loss, metrics

* [M1] DistilRoBERTa (distilberta-base)

In [78]:
trainer_M1 = EncoderDecoderTrainer(keras_model=model_M1)

In [79]:
tf_train_M1 = tokenized_datasets_M1['train'].to_tf_dataset(batch_size=batch_size,\
                                                    collate_fn=data_collator_M1,\
                                                    drop_remainder=True)

tf_val_M1 = tokenized_datasets_M1['validation'].to_tf_dataset(batch_size=batch_size,\
                                                    collate_fn=data_collator_M1,\
                                                    drop_remainder=True)

tf_test_M1 = tokenized_datasets_M1['test'].to_tf_dataset(batch_size=batch_size,\
                                                    collate_fn=data_collator_M1,\
                                                    drop_remainder=True)

In [80]:
epochs = 3
history = []
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")
    start_time_epoch = time.time()
    total_loss = 0.0
    num_batches = 0
    
    for batch in tqdm(tf_train_M1):
        start_time_batch = time.time()
        loss = trainer_M1.batch_fit(inputs=batch)
        end_time_batch = time.time()

        elapsed_time_batch = end_time_batch - start_time_batch
        
        clear_output(wait=True)
        print(f'\tBatch Loss: {loss} -- Elapsed Time: {elapsed_time_batch:.2f} seconds')

        total_loss += loss
        num_batches += 1
        
    end_time_epoch = time.time()
    elapsed_time_epoch = end_time_epoch - start_time_epoch
    
    avg_loss = total_loss / num_batches
    print(f"Epoch {epoch+1}/{epochs} -- Average Loss: {avg_loss:.4f} -- Elapsed Time: {elapsed_time_epoch:.2f} seconds")
    
    print("Validating...")
    val_loss, metrics = trainer_M1.evaluate(tf_val_M1)
    print(f"Validation Loss: {val_loss:.4f} -- Metrics: {metrics}")
    
    history.append({'epoch': epoch+1, 'total_loss': total_loss, 'avg_loss': avg_loss,\
                   'elapsed_time': elapsed_time_epoch, 'val_loss': val_loss,\
                   'metrics': metrics})

Epoch 1/3


  0%|                                                                                           | 0/53 [00:00<?, ?it/s]



  2%|█▌                                                                               | 1/53 [01:11<1:01:53, 71.42s/it]

	Batch Loss: 8.864155769348145 -- Elapsed Time: 71.33 seconds


  4%|███▏                                                                               | 2/53 [01:59<49:05, 57.76s/it]

	Batch Loss: 7.789313316345215 -- Elapsed Time: 48.17 seconds


  6%|████▋                                                                              | 3/53 [02:51<46:02, 55.24s/it]

	Batch Loss: 7.692808628082275 -- Elapsed Time: 52.25 seconds


  8%|██████▎                                                                            | 4/53 [03:52<46:43, 57.21s/it]

	Batch Loss: 5.9853386878967285 -- Elapsed Time: 60.22 seconds


  9%|███████▊                                                                           | 5/53 [04:56<47:43, 59.65s/it]

	Batch Loss: 6.584510326385498 -- Elapsed Time: 63.98 seconds


 11%|█████████▍                                                                         | 6/53 [05:54<46:17, 59.09s/it]

	Batch Loss: 6.3280792236328125 -- Elapsed Time: 57.99 seconds


 13%|██████████▉                                                                        | 7/53 [06:57<46:29, 60.65s/it]

	Batch Loss: 6.186530590057373 -- Elapsed Time: 63.85 seconds


 15%|████████████▌                                                                      | 8/53 [08:08<47:50, 63.80s/it]

	Batch Loss: 5.457001209259033 -- Elapsed Time: 70.54 seconds


 17%|██████████████                                                                     | 9/53 [09:13<47:06, 64.24s/it]

	Batch Loss: 6.1941304206848145 -- Elapsed Time: 65.22 seconds


 19%|███████████████▍                                                                  | 10/53 [10:18<46:13, 64.51s/it]

	Batch Loss: 6.519091606140137 -- Elapsed Time: 65.09 seconds


 21%|█████████████████                                                                 | 11/53 [11:24<45:24, 64.87s/it]

	Batch Loss: 5.995038032531738 -- Elapsed Time: 65.69 seconds


 23%|██████████████████▌                                                               | 12/53 [12:29<44:27, 65.05s/it]

	Batch Loss: 6.537737846374512 -- Elapsed Time: 65.46 seconds


 25%|████████████████████                                                              | 13/53 [13:38<44:09, 66.23s/it]

	Batch Loss: 6.2258195877075195 -- Elapsed Time: 68.93 seconds


 26%|█████████████████████▋                                                            | 14/53 [14:44<42:54, 66.02s/it]

	Batch Loss: 5.486377239227295 -- Elapsed Time: 65.53 seconds


 28%|███████████████████████▏                                                          | 15/53 [15:50<41:48, 66.02s/it]

	Batch Loss: 5.649102210998535 -- Elapsed Time: 66.01 seconds


 30%|████████████████████████▊                                                         | 16/53 [16:56<40:42, 66.02s/it]

	Batch Loss: 5.436712265014648 -- Elapsed Time: 66.00 seconds


 32%|██████████████████████████▎                                                       | 17/53 [18:01<39:25, 65.70s/it]

	Batch Loss: 4.884352684020996 -- Elapsed Time: 64.97 seconds


 34%|███████████████████████████▊                                                      | 18/53 [19:05<37:59, 65.13s/it]

	Batch Loss: 5.897516250610352 -- Elapsed Time: 63.78 seconds


 36%|█████████████████████████████▍                                                    | 19/53 [20:14<37:33, 66.28s/it]

	Batch Loss: 6.346921920776367 -- Elapsed Time: 68.97 seconds


 38%|██████████████████████████████▉                                                   | 20/53 [21:20<36:27, 66.30s/it]

	Batch Loss: 5.787376403808594 -- Elapsed Time: 66.33 seconds


 40%|████████████████████████████████▍                                                 | 21/53 [22:24<34:58, 65.59s/it]

	Batch Loss: 5.961485862731934 -- Elapsed Time: 63.92 seconds


 42%|██████████████████████████████████                                                | 22/53 [23:28<33:38, 65.12s/it]

	Batch Loss: 5.888381004333496 -- Elapsed Time: 64.03 seconds


 43%|███████████████████████████████████▌                                              | 23/53 [24:41<33:43, 67.44s/it]

	Batch Loss: 4.77360725402832 -- Elapsed Time: 72.85 seconds


 45%|█████████████████████████████████████▏                                            | 24/53 [25:54<33:25, 69.16s/it]

	Batch Loss: 5.445986270904541 -- Elapsed Time: 73.15 seconds


 47%|██████████████████████████████████████▋                                           | 25/53 [27:01<32:02, 68.65s/it]

	Batch Loss: 4.999186038970947 -- Elapsed Time: 67.46 seconds


 49%|████████████████████████████████████████▏                                         | 26/53 [28:06<30:21, 67.47s/it]

	Batch Loss: 5.641362190246582 -- Elapsed Time: 64.71 seconds


 51%|█████████████████████████████████████████▊                                        | 27/53 [29:11<28:50, 66.54s/it]

	Batch Loss: 6.148046970367432 -- Elapsed Time: 64.37 seconds


 53%|███████████████████████████████████████████▎                                      | 28/53 [30:15<27:25, 65.81s/it]

	Batch Loss: 5.988811016082764 -- Elapsed Time: 64.11 seconds


 55%|████████████████████████████████████████████▊                                     | 29/53 [31:19<26:06, 65.25s/it]

	Batch Loss: 5.748340129852295 -- Elapsed Time: 63.93 seconds


 57%|██████████████████████████████████████████████▍                                   | 30/53 [32:23<24:53, 64.95s/it]

	Batch Loss: 5.11627197265625 -- Elapsed Time: 64.25 seconds


 58%|███████████████████████████████████████████████▉                                  | 31/53 [33:27<23:42, 64.66s/it]

	Batch Loss: 5.746488571166992 -- Elapsed Time: 63.96 seconds


 60%|█████████████████████████████████████████████████▌                                | 32/53 [34:31<22:34, 64.50s/it]

	Batch Loss: 5.739840984344482 -- Elapsed Time: 64.13 seconds


 62%|███████████████████████████████████████████████████                               | 33/53 [35:35<21:29, 64.50s/it]

	Batch Loss: 5.478469371795654 -- Elapsed Time: 64.50 seconds


 64%|████████████████████████████████████████████████████▌                             | 34/53 [36:40<20:23, 64.39s/it]

	Batch Loss: 5.540102958679199 -- Elapsed Time: 64.14 seconds


 66%|██████████████████████████████████████████████████████▏                           | 35/53 [37:44<19:16, 64.25s/it]

	Batch Loss: 5.356662750244141 -- Elapsed Time: 63.91 seconds


 68%|███████████████████████████████████████████████████████▋                          | 36/53 [38:47<18:10, 64.13s/it]

	Batch Loss: 5.224983215332031 -- Elapsed Time: 63.85 seconds


 70%|█████████████████████████████████████████████████████████▏                        | 37/53 [39:51<17:05, 64.10s/it]

	Batch Loss: 5.711575031280518 -- Elapsed Time: 64.02 seconds


 72%|██████████████████████████████████████████████████████████▊                       | 38/53 [40:57<16:09, 64.65s/it]

	Batch Loss: 4.314576625823975 -- Elapsed Time: 65.93 seconds


 74%|████████████████████████████████████████████████████████████▎                     | 39/53 [42:02<15:03, 64.54s/it]

	Batch Loss: 5.217354774475098 -- Elapsed Time: 64.27 seconds


 75%|█████████████████████████████████████████████████████████████▉                    | 40/53 [43:06<13:58, 64.51s/it]

	Batch Loss: 4.623166561126709 -- Elapsed Time: 64.42 seconds


 77%|███████████████████████████████████████████████████████████████▍                  | 41/53 [44:05<12:33, 62.81s/it]

	Batch Loss: 4.035751819610596 -- Elapsed Time: 58.84 seconds


 79%|████████████████████████████████████████████████████████████████▉                 | 42/53 [45:01<11:09, 60.86s/it]

	Batch Loss: 4.308506011962891 -- Elapsed Time: 56.32 seconds


 81%|██████████████████████████████████████████████████████████████████▌               | 43/53 [45:57<09:53, 59.36s/it]

	Batch Loss: 4.1707048416137695 -- Elapsed Time: 55.86 seconds


 83%|████████████████████████████████████████████████████████████████████              | 44/53 [46:53<08:43, 58.21s/it]

	Batch Loss: 2.4980835914611816 -- Elapsed Time: 55.51 seconds


 85%|█████████████████████████████████████████████████████████████████████▌            | 45/53 [47:48<07:39, 57.44s/it]

	Batch Loss: 3.151451349258423 -- Elapsed Time: 55.63 seconds


 87%|███████████████████████████████████████████████████████████████████████▏          | 46/53 [48:44<06:38, 56.99s/it]

	Batch Loss: 2.9992587566375732 -- Elapsed Time: 55.95 seconds


 89%|████████████████████████████████████████████████████████████████████████▋         | 47/53 [49:41<05:40, 56.80s/it]

	Batch Loss: 2.476213216781616 -- Elapsed Time: 56.35 seconds


 91%|██████████████████████████████████████████████████████████████████████████▎       | 48/53 [50:37<04:42, 56.60s/it]

	Batch Loss: 2.3527889251708984 -- Elapsed Time: 56.12 seconds


 92%|███████████████████████████████████████████████████████████████████████████▊      | 49/53 [51:33<03:46, 56.52s/it]

	Batch Loss: 1.5260435342788696 -- Elapsed Time: 56.34 seconds


 94%|█████████████████████████████████████████████████████████████████████████████▎    | 50/53 [52:29<02:49, 56.46s/it]

	Batch Loss: 1.0197683572769165 -- Elapsed Time: 56.32 seconds


 96%|██████████████████████████████████████████████████████████████████████████████▉   | 51/53 [53:26<01:52, 56.42s/it]

	Batch Loss: 0.875504195690155 -- Elapsed Time: 56.33 seconds


 98%|████████████████████████████████████████████████████████████████████████████████▍ | 52/53 [54:22<00:56, 56.40s/it]

	Batch Loss: 0.7910298705101013 -- Elapsed Time: 56.34 seconds


100%|██████████████████████████████████████████████████████████████████████████████████| 53/53 [55:18<00:00, 62.62s/it]

	Batch Loss: 0.5292432308197021 -- Elapsed Time: 56.14 seconds
Epoch 1/3 -- Average Loss: 5.0047 -- Elapsed Time: 3318.76 seconds
Validating...





1
2
3
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


ValueError: in user code:

    File "C:\Users\Antonio\AppData\Local\Temp\ipykernel_13832\1059585463.py", line 43, in evaluate  *
        generated = self.keras_model.generate(input_ids=batch['input_ids'],
    File "C:\Users\Antonio\AppData\Local\Temp\ipykernel_13832\2328410840.py", line 28, in generate  *
        **kwargs)
    File "C:\Users\Antonio\anaconda3\envs\NLP\lib\site-packages\transformers\generation_tf_utils.py", line 590, in generate  *
        seed=model_kwargs.pop("seed", None),
    File "C:\Users\Antonio\anaconda3\envs\NLP\lib\site-packages\transformers\generation_tf_utils.py", line 1641, in _generate  *
        input_ids,
    File "C:\Users\Antonio\anaconda3\envs\NLP\lib\site-packages\transformers\generation_tf_utils.py", line 2708, in beam_search_body_fn  *
        model_inputs = self.prepare_inputs_for_generation(flatten_beam_dim(input_token), **model_kwargs)
    File "C:\Users\Antonio\anaconda3\envs\NLP\lib\site-packages\transformers\models\encoder_decoder\modeling_tf_encoder_decoder.py", line 684, in prepare_inputs_for_generation  *
        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids, past=past)
    File "C:\Users\Antonio\anaconda3\envs\NLP\lib\site-packages\transformers\models\roberta\modeling_tf_roberta.py", line 1163, in prepare_inputs_for_generation  *
        attention_mask = tf.ones(input_shape)

    ValueError: Cannot convert a partially known TensorShape <unknown> to a Tensor.


In [None]:
for e in history:
    print(f"Epoch {e['epoch']}/{epochs} -- Average Loss: {e['avg_loss']:.4f} -- Elapsed Time: {e['elapsed_time']:.2f} seconds")
    print(f"\tValidation Loss: {e['val_loss']:.4f} -- Metrics: {e['metrics']}")

* [M2] BERTTiny (bert-tiny)

## [Task 6] Train and evaluate $f_\theta(P, Q)$

Write your own script to train and evaluate your $f_\theta(P, Q)$.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

* [M1] DistilRoBERTa (distilberta-base)

In [None]:
# model_M1.save_model(f'{model_checkpoint_M1}-finetuned-coqa')

* [M2] BERTTiny (bert-tiny)

## [Task 6] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?