LINK:
* https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb

* https://github.com/Michael-M-Mike/Unibo-NLP-Assignments/blob/main/A2_Seq2Seq_Abstractive_Question_Answering_(QA)_on_CoQA/distilroberta_42.ipynb

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

# [0] Functions and imports

In [1]:
%%capture
!pip install datasets
!pip install transformers

In [2]:
from IPython.display import display_html
from itertools import chain,cycle
import matplotlib.pyplot as plt 
from tqdm import tqdm
import urllib.request
import numpy as np
import json
import torch
import os
import random 
import pandas as pd
import tensorflow as tf

# Display dataframes
def display(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:left"><td style="vertical-align:top">'
        html_str+=f'<h4 style="text-align: left;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [3]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [4]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [5]:
train_data = json.load((open('coqa/train.json')))
test_data = json.load((open('coqa/test.json')))

qas = pd.json_normalize(train_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(train_data['data'], ['answers'],['id'])
train_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
train_df = train_df.loc[train_df['input_text_y']!='unknown']

qas = pd.json_normalize(test_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(test_data['data'], ['answers'],['id'])
test_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
test_df = test_df.loc[test_df['input_text_y']!='unknown']

print(f'Training set [{train_df.shape}]')
print(f'\tFeatures: {list(train_df.columns)}')
display(train_df.loc[10:15,['story','input_text_x', 'input_text_y', 'span_text']])

print(f'\nTest set [{test_df.shape}]')
print(f'\tFeatures: {list(test_df.columns)}')
display(test_df.loc[10:15,['story','input_text_x', 'input_text_y', 'span_text']])

Training set [(107276, 11)]
	Features: ['input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']


Unnamed: 0,story,input_text_x,input_text_y,span_text
10,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",when were the Secret Archives moved from the rest of the library?,at the beginning of the 17th century;,atican Secret Archives were separated from the library at the beginning of the 17th century;
11,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",how many items are in this secret collection?,150000,"Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items."
12,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",Can anyone use this library?,anyone who can document their qualifications and research needs.,The Vatican Library is open to anyone who can document their qualifications and research needs.
14,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",what must be requested in person or by mail?,Photocopies,Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.
15,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",of what books?,only books published between 1801 and 1990,hotocopies for private study of pages from books published between 1801 and 1990



Test set [(7917, 9)]
	Features: ['input_text_x', 'turn_id', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y']


Unnamed: 0,story,input_text_x,input_text_y,span_text
10,"Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n\n""What are you doing, Cotton?!"" \n\n""I only wanted to be more like you"". \n\nCotton's mommy rubbed her face on Cotton's and said ""Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way"". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. \n\n""Don't ever do that again, Cotton!"" they all cried. ""Next time you might mess up that pretty white fur of yours and we wouldn't want that!"" \n\nThen Cotton thought, ""I change my mind. I like being special"".",What did the other cats do when Cotton emerged from the bucket of water?,licked her face,Her sisters licked her face
11,"Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n\n""What are you doing, Cotton?!"" \n\n""I only wanted to be more like you"". \n\nCotton's mommy rubbed her face on Cotton's and said ""Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way"". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. \n\n""Don't ever do that again, Cotton!"" they all cried. ""Next time you might mess up that pretty white fur of yours and we wouldn't want that!"" \n\nThen Cotton thought, ""I change my mind. I like being special"".",Did they want Cotton to change the color of her fur?,no,We would never want you to be any other way
12,"Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long. \n\nOne day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. ""What is it?"" said Asta's friend Sharkie. ""It looks like a bird's belly,"" said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it. \n\nThe bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. ""What does it say?"" they asked. \n\nAsta's papa read the note. He told Asta and Sharkie, ""This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her."" And that is what they did.",what was the name of the fish,Asta.,Asta.
13,"Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long. \n\nOne day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. ""What is it?"" said Asta's friend Sharkie. ""It looks like a bird's belly,"" said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it. \n\nThe bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. ""What does it say?"" they asked. \n\nAsta's papa read the note. He told Asta and Sharkie, ""This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her."" And that is what they did.",What looked like a birds belly,a bottle,a bottle
14,"Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long. \n\nOne day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. ""What is it?"" said Asta's friend Sharkie. ""It looks like a bird's belly,"" said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it. \n\nThe bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. ""What does it say?"" they asked. \n\nAsta's papa read the note. He told Asta and Sharkie, ""This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her."" And that is what they did.",who said that,Asta.,"""It looks like a bird's belly,"" said Asta."
15,"Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long. \n\nOne day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. ""What is it?"" said Asta's friend Sharkie. ""It looks like a bird's belly,"" said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it. \n\nThe bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. ""What does it say?"" they asked. \n\nAsta's papa read the note. He told Asta and Sharkie, ""This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her."" And that is what they did.",Was Sharkie a friend?,Yes,Asta's friend Sharkie


## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [6]:
from sklearn.model_selection import train_test_split
from datasets import *

In [7]:
set_reproducibility(42)

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)
val_df = val_df.reset_index()

print(f'Validation set [{val_df.shape}]')
print(f'\tFeatures: {list(val_df.columns)}')
display(val_df.loc[10:15,['story','input_text_x', 'input_text_y', 'span_text']])

Validation set [(21456, 12)]
	Features: ['index', 'input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']


Unnamed: 0,story,input_text_x,input_text_y,span_text
10,"On paper, the race in Kentucky between Sen. Mitch McConnell and his Democratic challenger, Alison Lundergan Grimes, should be pretty clear-cut: The experienced veteran easily beats a political novice. But like most things, it's not. \n\nMcConnell must cross the first hurdle by beating his primary challenger, Matt Bevin, before he engages in what is expected to be one of the most expensive and bitterly fought Senate campaigns this midterm season. \n\nA lot is at stake overall in November: control of the Senate and the political fate of one of the most powerful Republicans in Washington. \n\nGrimes' advantage \n\nGrimes, 35, was just 7 when McConnell was first elected to the Senate. \n\nMitch McConnell would face biggest challenge yet in Alison Grimes \n\nWhile he rose up the ranks in Washington and became Senate Republican leader, Grimes practiced law and won statewide office as secretary of state in 2011. \n\nDespite her short political career, like McConnell, her name carries weight -- for better or worse. \n\nGrimes' family has a long history in state Democratic politics. Her father, Jerry, was the former chairman of the Kentucky Democratic Party and a state legislator. But he was forced out of those roles over legal problems facing his catering company. \n\nWhile the family name has been battered, its connections survive: She'll have access to the deep pockets and support of her father's allies, including Bill and Hillary Clinton. \n\nThe former President has already hit the trail for Grimes, raising more than $600,000 at one Louisville event in February.",When is the election?,November,A lot is at stake overall in November: control of the Senate and the political fate of one of the most powerful Republicans in Washington. \n
11,"CHAPTER IX \n\nDOCTOR PATSY \n\nNext morning Uncle John and the Weldons--including the precious baby--went for a ride into the mountains, while Beth and Patsy took their embroidery into a sunny corner of the hotel lobby. \n\nIt was nearly ten o'clock when A. Jones discovered the two girls and came tottering toward them. Tottering is the right word; he fairly swayed as he made his way to the secluded corner. \n\n""I wish he'd use a cane,"" muttered Beth in an undertone. ""I have the feeling that he's liable to bump his nose any minute."" \n\nPatsy drew up a chair for him, although he endeavored to prevent her. \n\n""Are you feeling better this morning?"" she inquired. \n\n""I--I think so,"" he answered doubtfully. ""I don't seem to get back my strength, you see."" \n\n""Were you stronger before your accident?"" asked Beth. \n\n""Yes, indeed. I went swimming, you remember. But perhaps I was not strong enough to do that. I--I'm very careful of myself, yet I seem to grow weaker all the time."" \n\nThere was a brief silence, during which the girls plied their needles. \n\n""Are you going to stay in this hotel?"" demanded Patsy, in her blunt way. \n\n""For a time, I think. It is very pleasant here,"" he said. \n\n""Have you had breakfast?"" \n\n""I took a food-table style=""display:inline""t at daybreak."" \n\n""Huh!"" A scornful exclamation. Then she glanced at the open door of the dining-hall and laying aside her work she rose with a determined air and said: \n\n""Come with me!"" \n\n""Where?"" \n\nFor answer she assisted him to rise. Then she took his hand and marched him across the lobby to the dining room.",Who did the girls meet that wasn't walking well?,A. Jones,It was nearly ten o'clock when A. Jones discovered the two girls and came tottering toward them.
12,"Bertrand Arthur William Russell, 3rd Earl Russell, (; 18 May 1872 – 2 February 1970) was a British philosopher, logician, mathematician, historian, writer, social critic, political activist and Nobel laureate. At various points in his life he considered himself a liberal, a socialist, and a pacifist, but he also admitted that he had ""never been any of these things, in any profound sense"". He was born in Monmouthshire into one of the most prominent aristocratic families in the United Kingdom. \n\nIn the early 20th century, Russell led the British ""revolt against idealism"". He is considered one of the founders of analytic philosophy along with his predecessor Gottlob Frege, colleague G. E. Moore, and protégé Ludwig Wittgenstein. He is widely held to be one of the 20th century's premier logicians. With A. N. Whitehead he wrote ""Principia Mathematica"", an attempt to create a logical basis for mathematics. His philosophical essay ""On Denoting"" has been considered a ""paradigm of philosophy"". His work has had a considerable influence on mathematics, logic, set theory, linguistics, artificial intelligence, cognitive science, computer science (see type theory and type system), and philosophy, especially the philosophy of language, epistemology, and metaphysics. \n\nRussell was a prominent anti-war activist; he championed anti-imperialism. Occasionally, he advocated preventive nuclear war, before the opportunity provided by the atomic monopoly had passed, and ""welcomed with enthusiasm"" world government. He went to prison for his pacifism during World War I. Later, he concluded war against Adolf Hitler was a necessary ""lesser of two evils"". He criticized Stalinist totalitarianism, attacked the involvement of the United States in the Vietnam War, and was an outspoken proponent of nuclear disarmament. In 1950 Russell was awarded the Nobel Prize in Literature ""in recognition of his varied and significant writings in which he champions humanitarian ideals and freedom of thought"".",Where was he born?,Great Britain,"Bertrand Arthur William Russell, 3rd Earl Russell, (; 18 May 1872 – 2 February 1970) was a British philosopher"
13,"Federalism refers to the mixed or compound mode of government, combining a general government (the central or 'federal' government) with regional governments (provincial, state, Land, cantonal, territorial or other sub-unit governments) in a single political system. Its distinctive feature, exemplified in the founding example of modern federalism of the United States of America under the Constitution of 1789, is a relationship of parity between the two levels of government established. It can thus be defined as a form of government in which there is a division of powers between two levels of government of equal status. \n\nUntil recently, in the absence of prior agreement on a clear and precise definition, the concept was thought to mean (as a shorthand) 'a division of sovereignty between two levels of government'. New research, however, argues that this cannot be correct, as dividing sovereignty - when this concept is properly understood in its core meaning of the final and absolute source of political authority in a political community - is not possible. The descent of the United States into Civil War in the mid-nineteenth century, over disputes about unallocated competences concerning slavery and ultimately the right of secession, showed this. One or other level of government could be sovereign to decide such matters, but not both simultaneously. Therefore, it is now suggested that federalism is more appropriately conceived as 'a division of the powers flowing from sovereignty between two levels of government'. What differentiates the concept from other multi-level political forms is the characteristic of equality of standing between the two levels of government established. This clarified definition opens the way to identifying two distinct federal forms, where before only one was known, based upon whether sovereignty resides in the whole (in one people) or in the parts (in many peoples): the federal state (or federation) and the federal union of states (or federal union), respectively. Leading examples of the federal state include the United States, Germany, Canada, Switzerland, Australia and India. The leading example of the federal union of states is the European Union.",What was the civil war ultimately about?,the right of secession,"The descent of the United States into Civil War in the mid-nineteenth century, over disputes about unallocated competences concerning slavery and ultimately the right of secession, showed this"
14,"CHAPTER XV. \n\n""DROP IT."" \n\nFor ten or twelve days after the little dinner in Berkeley Square Guss Mildmay bore her misfortunes without further spoken complaint. During all that time, though they were both in London, she never saw Jack De Baron, and she knew that in not seeing her he was neglecting her. But for so long she bore it. It is generally supposed that young ladies have to bear such sorrow without loud complaint; but Guss was more thoroughly emancipated than are some young ladies, and when moved was wont to speak her mind. At last, when she herself was only on foot with her father, she saw Jack De Baron riding with Lady George. It is quite true that she also saw, riding behind them, her perfidious friend, Mrs. Houghton, and a gentleman whom at that time she did not know to be Lady George's father. This was early in March, when equestrians in the park are not numerous. Guss stood for a moment looking at them, and Jack De Baron took off his hat. But Jack did not stop, and went on talking with that pleasant vivacity which she, poor girl, knew so well and valued so highly. Lady George liked it too, though she could hardly have given any reason for liking it, for, to tell the truth, there was not often much pith in Jack's conversation. \n\nOn the following morning Captain De Baron, who had lodgings in Charles Street close to the Guards' Club, had a letter brought to him before he was out of bed. The letter was from Guss Mildmay, and he knew the handwriting well. He had received many notes from her, though none so interesting on the whole as was this letter. Miss Mildmay's letter to Jack was as follows. It was written, certainly, with a swift pen, and, but that he knew her writing well, would in parts have been hardly legible.",did she feel neglected?,yes,he knew that in not seeing her he was neglecting her.
15,"Josie started planning her new garden in the winter. She chose flowers and vegetable style=""display:inline""s that could grow in her area. She looked through the seed magazines. She ordered the tastiest kind of each vegetable style=""display:inline"" and the prettiest kind of each flower. She talked to a friend about her plans. It seemed like the snow would never melt. \n\nBut Josie didn't have to wait for spring to get started. Six weeks before the last frost, Josie planted seeds indoors. The tiny seedlings pushed up through the soil and began to grow. \n\nFinally spring arrived. Each day, Josie moved the seedlings outside for a few hours so they could get used to the cooler temperatures. Josie worked in her garden, digging the soil. She added a special growing mix from the garden store to make the soil better. When everything was ready, she removed the seedlings from their trays and planted them in her garden. The warm sun and rich soil helped her vegetable style=""display:inline""s and flowers grow.",was she planning on growing ugly flowers,no,"She ordered the tastiest kind of each vegetable style=""display:inline"" and the prettiest kind of each flower."


In [8]:
train_df = train_df[['story','input_text_x', 'input_text_y', 'span_start', 'span_end']]
val_df = val_df[['story','input_text_x', 'input_text_y', 'span_start', 'span_end']]
test_df = test_df[['story','input_text_x', 'input_text_y', 'span_start', 'span_end']]

In [9]:
train_dataset = Dataset.from_dict(train_df.iloc[:40000])
val_dataset = Dataset.from_dict(val_df.iloc[:2000])
test_dataset = Dataset.from_dict(test_df.iloc[:2000])

dataset_COQA = DatasetDict({'train':train_dataset,'validation':val_dataset,'test':test_dataset})

In [10]:
dataset_COQA

DatasetDict({
    train: Dataset({
        features: ['story', 'input_text_x', 'input_text_y', 'span_start', 'span_end'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['story', 'input_text_x', 'input_text_y', 'span_start', 'span_end'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['story', 'input_text_x', 'input_text_y', 'span_start', 'span_end'],
        num_rows: 2000
    })
})

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [11]:
from transformers import AutoTokenizer, PreTrainedTokenizerFast
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

* [M1] DistilRoBERTa (distilberta-base)

In [12]:
model_checkpoint_M1 = 'distilroberta-base'
    
tokenizer_M1 = AutoTokenizer.from_pretrained(model_checkpoint_M1)

model_M1 = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_M1)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be 

In [13]:
assert isinstance(tokenizer_M1, PreTrainedTokenizerFast)

* [M2] BERTTiny (bert-tiny)

In [14]:
model_checkpoint_M2 = 'prajjwal1/bert-tiny'
    
tokenizer_M2 = AutoTokenizer.from_pretrained(model_checkpoint_M2)

model_M2 = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_M2)

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model

In [15]:
assert isinstance(tokenizer_M2, PreTrainedTokenizerFast)

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

* [M1] DistilRoBERTa (distilberta-base)

In [16]:
max_length_question = 384 
max_length_answer = 6 
doc_stride = 128 

In [17]:
# The function `prepare_features` is used to tokenize each row of the `Dataset` object. 
# The function takes as input a single row of the `Dataset` object, which contains the questions, 
# the stories and answers.

# The function first tokenizes the questions and stories using the tokenizer object and stores the 
# resulting `input_ids` and `attention_mask` in the same row of the `Dataset`. Then it tokenizes the 
# answers and stores the resulting `input_ids` in the labels column of the `Dataset`.

# The original fields are then removed from the `Dataset` row to save memory. Finally, the `PAD` token 
# is removed from the labels column by checking if the `PAD` token appears in each list of labels, and 
# if so, the list is truncated at that point.

# The updated function is then used in the `map` method of the `Dataset` object to tokenize all rows in 
# the `Dataset` in a batched manner, with a specified batch size. The resulting Dataset object will 
# contain the `input_ids`, `attention_mask`, and `labels` columns, which are ready to be used as 
# inputs for a transformer-based question answering model.

def prepare_features(data_row, tokenizer, max_length_question, max_length_answer, doc_stride=None):
    questions = data_row['input_text_x']
    stories = data_row['story']
    answers = data_row['input_text_y']
    
    # Tokenize the Question and Context columns
    try:
      encoded_inputs = tokenizer(questions, 
                                stories, 
                                padding='max_length', 
                                truncation='only_second', 
                                max_length=max_length_question,
                                doc_stride=doc_stride)
    except:
      encoded_inputs = tokenizer(questions, 
                                stories, 
                                padding='max_length', 
                                truncation='only_second', 
                                max_length=max_length_question)
                                 
    # Tokenize the Answer column
    encoded_labels = tokenizer(answers, 
                               padding='max_length', 
                               truncation=True, 
                               max_length=max_length_answer)

    # Store the resulting input_ids and attention_mask in the same row of the Dataset
    data_row['input_ids'] = encoded_inputs.input_ids
    data_row['attention_mask'] = encoded_inputs.attention_mask

    # Store the resulting input_ids in the labels column
    data_row['labels'] = encoded_labels.input_ids.copy()

    # Remove the original columns
    data_row.pop('story')
    data_row.pop('input_text_x')
    data_row.pop('input_text_y')

    # # shift the labels and ignore the PAD token
    data_row['labels'] = [[-100 if token == tokenizer.pad_token_id else token 
                            for token in labels] for labels in data_row['labels']]

    # remove the PAD token
    # data_row["labels"] = [labels[:labels.index(tokenizer.pad_token_id)] if tokenizer.pad_token_id in labels 
    #                      else labels for labels in data_row["labels"]]

    return data_row

In [18]:
# Use the `prepare_features` function in the map method of the Dataset object
tokenized_datasets_M1 = dataset_COQA.map(
    lambda datarow: prepare_features(datarow, tokenizer_M1, max_length_question, max_length_answer),
    batched=True,
    # batch_size=64
)

  0%|          | 0/40 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [19]:
tokenized_datasets_M1

DatasetDict({
    train: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

* [M2] BERTTiny (bert-tiny)

In [20]:
max_length_question = 384 
max_length_answer = 6 
doc_stride = 128 

In [21]:
# Use the `prepare_features` function in the map method of the Dataset object
tokenized_datasets_M2 = dataset_COQA.map(
    lambda datarow: prepare_features(datarow, tokenizer_M2, max_length_question, max_length_answer),
    batched=True,
    # batch_size=64
)

  0%|          | 0/40 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [22]:
tokenized_datasets_M2

DatasetDict({
    train: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['span_start', 'span_end', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

## [Task 5] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

In [23]:
from transformers import default_data_collator

* [M1] DistilRoBERTa (distilberta-base)

In [24]:
model_checkpoint_M1 = 'distilroberta-base'

# Define the TrainingArguments
training_args_M1 = TrainingArguments(
    output_dir=f'{model_checkpoint_M1}-finetuned-coqa',  # directory to save the fine-tuned model
    evaluation_strategy='epoch',  # evaluate after a fixed number of updates
    per_device_train_batch_size=16,  # batch size for each GPU/TPU core/CPU
    per_device_eval_batch_size=16,  # batch size for each GPU/TPU core/CPU
    weight_decay=0.01,  # L2 regularization
    learning_rate=2e-5,  # initial learning rate
    num_train_epochs=3,  # number of training epochs
    warmup_steps=100,  # number of warmup steps for the learning rate scheduler
    logging_steps=100,  # log the training loss every 500 steps
    save_steps=200,  # save the model every 500 steps
    seed=42,  # random seed for reproducibility
)



A dataset collator is a function used in data processing for deep learning models, 
especially in training and evaluation. It collates, or collects, several examples 
from a dataset into a batch and performs operations on the batch, such as padding 
or stacking. This is usually done to make the input data compatible with the model's 
batch size, which is the number of samples processed together in one forward/backward pass. 
The dataset collator takes care of the preprocessing required to format the examples in the batch, 
allowing the data to be efficiently processed by the deep learning framework.

In [25]:
trainer_M1 = Trainer(
    model_M1,
    training_args_M1,
    train_dataset=tokenized_datasets_M1['train'],
    eval_dataset=tokenized_datasets_M1['validation'],
    data_collator=default_data_collator,
    tokenizer=tokenizer_M1
)

In [26]:
trainer_M1.train()

The following columns in the training set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: labels. If labels are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7500
  Number of trainable parameters = 81529346


TypeError: ignored

In [None]:
trainer_M1.save_model(f'{model_checkpoint_M1}-finetuned-coqa')

* [M2] BERTTiny (bert-tiny)

In [None]:
model_name_M2 = 'prajjwal1/bert-tiny'
args_M2 = TrainingArguments(
    f"{model_name_M2}-finetuned-coqa",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

In [None]:
trainer_M2 = Trainer(
    model_M2,
    args_M2,
    train_dataset=tokenized_datasets_M2["train"],
    eval_dataset=tokenized_datasets_M2["validation"],
    data_collator=default_data_collator,
    tokenizer=tokenizer_M2,
)

In [None]:
trainer_M2.train()

## [Task 6] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?