LAST LINK:
* https://huggingface.co/course/chapter7/7?fw=tf

LINK:
* https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb

* https://github.com/Michael-M-Mike/Unibo-NLP-Assignments/blob/main/A2_Seq2Seq_Abstractive_Question_Answering_(QA)_on_CoQA/distilroberta_42.ipynb

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

# [0] Functions and imports

In [1]:
%%capture
!pip install datasets
!pip install transformers

In [4]:
from IPython.display import display_html
from itertools import chain,cycle
# import matplotlib.pyplot as plt 
from tqdm import tqdm
import urllib.request
import numpy as np
import json
import torch
import os
import random 
import pandas as pd
import tensorflow as tf

# Display dataframes
def display(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:left"><td style="vertical-align:top">'
        html_str+=f'<h4 style="text-align: left;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [None]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [None]:
train_data = json.load((open('coqa/train.json')))
test_data = json.load((open('coqa/test.json')))

qas = pd.json_normalize(train_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(train_data['data'], ['answers'],['id'])
train_val_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
train_val_df = train_val_df.loc[train_val_df['input_text_y']!='unknown']

qas = pd.json_normalize(test_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(test_data['data'], ['answers'],['id'])
test_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
test_df = test_df.loc[test_df['input_text_y']!='unknown']

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [None]:
from sklearn.model_selection import GroupShuffleSplit
from datasets import *
from transformers import AutoTokenizer, PreTrainedTokenizerFast

import plotly.express as px

In [None]:
set_reproducibility(42)

train_inds, val_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 42).split(train_val_df, groups=train_val_df['story']))

train_df = train_val_df.iloc[train_inds]
val_df = train_val_df.iloc[val_inds].reset_index()

In [None]:
print(f'Training set [{train_df.shape}]')
print(f'\tFeatures: {list(train_df.columns)}')
display(train_df.loc[10:10,['input_text_x', 'input_text_y', 'span_text', 'story']])

print(f'Validation set [{val_df.shape}]')
print(f'\tFeatures: {list(val_df.columns)}')
display(val_df.loc[10:10,['input_text_x', 'input_text_y', 'span_text', 'story']])

print(f'\nTest set [{test_df.shape}]')
print(f'\tFeatures: {list(test_df.columns)}')
display(test_df.loc[10:10,['input_text_x', 'input_text_y', 'span_text', 'story']])

Now we check if there is any overlapping dialogue between train and validation set.

In [None]:
set_train = set(train_df['story'])
set_val = set(val_df['story'])

overlap = False
for i in set_train:
  if i in set_val:
    overlap = True
    break

print('Overlap' if overlap else 'No overlap')

In [None]:
features = ['id', 'story','input_text_x', 'input_text_y', 'span_start', 'span_end']

train_df_to_ds = train_df[features]
val_df_to_ds = val_df[features]
test_df_to_ds = test_df[features]

In [None]:
ratio = 2

train_samples = round(train_df_to_ds.shape[0] * ratio / 100)
val_samples = round(val_df_to_ds.shape[0] * ratio / 100)
test_samples = round(test_df_to_ds.shape[0] * ratio / 100) 

train_dataset = Dataset.from_dict(train_df_to_ds.iloc[:train_samples])
val_dataset = Dataset.from_dict(val_df_to_ds.iloc[:val_samples])
test_dataset = Dataset.from_dict(test_df_to_ds.iloc[:test_samples])

dataset_COQA = DatasetDict({'train':train_dataset,'validation':val_dataset,'test':test_dataset})

In [None]:
dataset_COQA

In [None]:
batch_size = 16

inputs_lengths = [len(x[0])+len(x[1]) for x in zip(train_val_df['input_text_x'],\
                                               train_val_df['story'])]

max_length_input = round(np.quantile(list(set(inputs_lengths)), .50)) 
stride = int(max_length_input/3)
# print(f'Max length (Third quartile):{max_length_input}')
print(f'Max length (Median):{max_length_input}')
print(f'Stride:{stride}')

fig_inputs = px.box(list(set(inputs_lengths)))
fig_inputs.show()

In [None]:
# outputs_lengths = [len(x) for x in train_val_df['input_text_y']]

# max_length_answer = round(np.quantile(list(set(outputs_lengths)), .50))
# # print(f'Max length (Third quartile):{max_length_answer}')
# print(f'Max length (Median):{max_length_answer}')

# fig_inputs = px.box(list(set(outputs_lengths)))
# fig_inputs.show()

In [None]:
model_checkpoint_M1 = 'distilroberta-base'
tokenizer_M1 = AutoTokenizer.from_pretrained(model_checkpoint_M1)
assert isinstance(tokenizer_M1, PreTrainedTokenizerFast)

model_checkpoint_M2 = 'prajjwal1/bert-tiny'
tokenizer_M2 = AutoTokenizer.from_pretrained(model_checkpoint_M2)
assert isinstance(tokenizer_M2, PreTrainedTokenizerFast)

In [None]:
def prepare_train_features(data, tokenizer, max_length_input, stride):
    questions = [q.strip() for q in data['input_text_x']]

    # Tokenize the Question and Context columns
    encoded_inputs = tokenizer(
        questions,
        data['story'],
        max_length=max_length_input,
        stride = stride,
        truncation='only_second',
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    
    offset_mapping = encoded_inputs.pop('offset_mapping')
    sample_map = encoded_inputs.pop('overflow_to_sample_mapping')
    
    answers = data['input_text_y']
    start_positions = []
    end_positions = []
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = data['span_start'][0]
        end_char = data['span_end'][0]
        sequence_ids = encoded_inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        
        while sequence_ids[idx] != 1:
          idx += 1
        context_start = idx
        
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    encoded_inputs['start_positions'] = start_positions
    encoded_inputs['end_positions'] = end_positions

    return encoded_inputs

def prepare_val_features(data, tokenizer, max_length_input, stride):
    questions = [q.strip() for q in data['input_text_x']]

    # Tokenize the Question and Context columns
    encoded_inputs = tokenizer(
        questions,
        data['story'],
        max_length=max_length_input,
        stride = stride,
        truncation='only_second',
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    
    sample_map = encoded_inputs.pop('overflow_to_sample_mapping')
    example_ids = []

    for i in range(len(encoded_inputs['input_ids'])):
        sample_idx = sample_map[i]
        example_ids.append(data['id'][sample_idx])

        sequence_ids = encoded_inputs.sequence_ids(i)
        offset = encoded_inputs['offset_mapping'][i]
        encoded_inputs['offset_mapping'][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    encoded_inputs["example_id"] = example_ids

    return encoded_inputs

* [M1] DistilRoBERTa (distilberta-base)

In [None]:
tokenized_datasets_M1 = DatasetDict()

# Use the `prepare_features` functions
tokenized_datasets_M1['train'] = dataset_COQA['train'].map(
    lambda datarow: prepare_train_features(datarow, tokenizer_M1, max_length_input, stride),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['train'].column_names
)

tokenized_datasets_M1['validation'] = dataset_COQA['validation'].map(
    lambda datarow: prepare_val_features(datarow, tokenizer_M1, max_length_input, stride),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['validation'].column_names
)

In [None]:
tokenized_datasets_M1

* [M2] BERTTiny (bert-tiny)

In [None]:
tokenized_datasets_M2 = DatasetDict()

# Use the `prepare_features` functions
tokenized_datasets_M2['train'] = dataset_COQA['train'].map(
    lambda datarow: prepare_train_features(datarow, tokenizer_M2, max_length_input, stride),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['train'].column_names
)

tokenized_datasets_M2['validation'] = dataset_COQA['validation'].map(
    lambda datarow: prepare_val_features(datarow, tokenizer_M2, max_length_input, stride),
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset_COQA['validation'].column_names
)

In [None]:
tokenized_datasets_M2

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [None]:
from transformers import TFAutoModelForQuestionAnswering, TrainingArguments, Trainer

* [M1] DistilRoBERTa (distilberta-base)

In [None]:
model_M1 = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint_M1)

* [M2] BERTTiny (bert-tiny)

In [None]:
model_M2 = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint_M2, from_pt=True)

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [None]:
from transformers import DefaultDataCollator

A dataset collator is a function used in data processing for deep learning models, 
especially in training and evaluation. It collates, or collects, several examples 
from a dataset into a batch and performs operations on the batch, such as padding 
or stacking. This is usually done to make the input data compatible with the model's 
batch size, which is the number of samples processed together in one forward/backward pass. 
The dataset collator takes care of the preprocessing required to format the examples in the batch, 
allowing the data to be efficiently processed by the deep learning framework.

In [None]:
data_collator = DefaultDataCollator(return_tensors="tf")

* [M1] DistilRoBERTa (distilberta-base)

In [None]:
tf_train_dataset_M1 = model_M1.prepare_tf_dataset(
    tokenized_datasets_M1['train'],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=batch_size,
)
tf_eval_dataset_M2 = model_M1.prepare_tf_dataset(
    tokenized_datasets_M1['validation'],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=batch_size,
)

In [None]:
from transformers import create_optimizer
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_epochs = 3
num_train_steps = len(tf_train_dataset_M1) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model_M1.compile(optimizer=optimizer)

# Train in mixed-precision float16
# tf.keras.mixed_precision.set_global_policy("mixed_float16")

* [M2] BERTTiny (bert-tiny)

## [Task 6] Train and evaluate $f_\theta(P, Q)$

Write your own script to train and evaluate your $f_\theta(P, Q)$.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

* [M1] DistilRoBERTa (distilberta-base)

In [None]:
# We're going to do validation afterwards, so no validation mid-training
model_M1.fit(tf_train_dataset_M1, epochs=num_train_epochs)

In [None]:
# model_M1.save_model(f'{model_checkpoint_M1}-finetuned-coqa')

* [M2] BERTTiny (bert-tiny)

## [Task 6] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?