## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import json
from pandas.io.json import json_normalize
import warnings
warnings.filterwarnings('ignore')

!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Dataset Download


In [5]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [3]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:05, 8.26MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:01, 7.06MB/s]                            

Download completed!





##Data Inspection

In [51]:
data=json.load((open('/content/coqa/train.json')))
data

{'version': '1.0',
 'data': [{'source': 'wikipedia',
   'id': '3zotghdk5ibi9cex97fepx7jetpso7',
   'filename': 'Vatican_Library.txt',
   'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising

In [56]:
#stessa cosa giù
data=json.load((open('/content/coqa/train.json')))
qas=json_normalize(data['data'], ['questions'],['source','id','story'])
ans=json_normalize(data['data'], ['answers'],['id'])
train_df = pd.merge(qas,ans, left_on=['id','turn_id'],right_on=['id','turn_id'] )
train_df.loc[ 10000: 108647,['turn_id','input_text_x','input_text_y','span_text'] ]

Unnamed: 0,turn_id,input_text_x,input_text_y,span_text
10000,6,Where did he grow up?,unknown,unknown
10001,7,Did he have any siblings?,yes,seventh child
10002,8,How many?,unknown,unknown
10003,9,What was his dad known for?,being abusive,"physically abused by his father,"
10004,10,Did his dad help him in any way?,yes,his father's strict discipline
...,...,...,...,...
108642,10,Who was a sub?,Xabi Alonso,substitute Xabi Alonso
108643,11,Was it his first game this year?,Yes,Xabi Alonso made his first appearance of the ...
108644,12,What position did the team reach?,third,Real moved up to third in the table
108645,13,Who was ahead of them?,Barca.,six points behind Barca.


In [85]:
cols = ["text","question","answer","span_text"]

coqa=pd.read_json('/content/coqa/train.json')

comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        if row["data"]["answers"][i]["input_text"] != 'unknown':
          temp_list.append(row["data"]["story"])
          temp_list.append(row["data"]["questions"][i]["input_text"])
          temp_list.append(row["data"]["answers"][i]["input_text"])
          temp_list.append(row["data"]["answers"][i]["span_text"])
          comp_list.append(temp_list)

#create pandas DataFrame
new_df = pd.DataFrame(comp_list, columns=cols) 
#save in csv format
new_df.to_csv("CoQA_data.csv", index=False)
#read and use it as csv
data = pd.read_csv("CoQA_data.csv")
data.iloc[0:100]

Unnamed: 0,text,question,answer,span_text
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475,Formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research,he Vatican Library is a research library
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law",Vatican Library is a research library for hist...
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology",Vatican Library is a research library for hist...
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project,"March 2014, the Vatican Library began an initi..."
...,...,...,...,...
95,(CNN) -- A lawsuit filed by the family of Robe...,AGAINST WHOM?,Fabulous Coach Lines,Fabulous Coach Lines
96,(CNN) -- A lawsuit filed by the family of Robe...,WHAT DOES THE FAMILY ACCUSE THE COMPANY OF?,The company consented to the illegal acts of h...,consented to the illegal acts of hazing by stu...
97,(CNN) -- A lawsuit filed by the family of Robe...,WHAT HAPPENED TO ROBERT?,Beaten to death,beaten to death
98,(CNN) -- A lawsuit filed by the family of Robe...,WHERE WAS HE KILLED?,In a bus.,Bus C


In [83]:
print("Number of question and answers: ", len(data))

Number of question and answers:  107276


## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [None]:
from sklearn.model_selection import train_test_split

train_data = df[df['split'] == 'train']
test_data = df[df['split'] == 'test']

x_train = train_data['text'].values
y_train = train_data['sentiment'].values

x_test = test_data['text'].values
y_test = test_data['sentiment'].values

# Random split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train,
                                                  train_size=0.80,
                                                  test_size=0.20,
                                                  random_state=42)

print('Dataset splits statistics: ')
print(f'Train data: {x_train.shape}')
print(f'Validation data: {x_val.shape}')


## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [None]:
from transformers import TFBertForQuestionAnswering
from transformers import BertTokenizer
from transformers import DistilBertForQuestionAnswering
from transformers import BertForQuestionAnswering