LAST LINKs:


* https://colab.research.google.com/drive/1WIk2bxglElfZewOHboPFNj8H44_VAyKE?usp=sharing#scrollTo=Gw3IZYrfKl4Z
* https://medium.com/analytics-vidhya/fine-tune-a-roberta-encoder-decoder-model-trained-on-mlm-for-text-generation-23da5f3c1858
* https://huggingface.co/course/chapter7/7?fw=tf

LINKs:
* https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb

* https://github.com/Michael-M-Mike/Unibo-NLP-Assignments/blob/main/A2_Seq2Seq_Abstractive_Question_Answering_(QA)_on_CoQA/distilroberta_42.ipynb

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

# [0] Functions and imports

In [1]:
# %%capture
# %pip install datasets
# %pip install transformers
# %pip install tensorflow_addons
# %pip install allennlp-models
# %pip install plotly==5.13.1


# NOTE:
#     - SEED ED ERRORE
#     - LUNGHEZZA INPUTS E OUTPUTS - https://towardsdatascience.com/to-distil-or-not-to-distil-bert-roberta-and-xlnet-c777ad92f8
#     - GRANDEZZA DATASETS
#     - WARNINGS NELLA CREAZIONE DEL MODELLO E NEL TRAINING
#     - COME UTILIZZARE SPAN DI TESTO

In [2]:
from IPython.display import display_html, clear_output
from itertools import chain,cycle
from copy import deepcopy
import urllib.request
import transformers
import numpy as np
import json
import time
import os
import torch
import random 
import pandas as pd
from tqdm import tqdm

from sklearn.model_selection import GroupShuffleSplit
from datasets import *
from transformers import AutoTokenizer, PreTrainedTokenizerFast, EncoderDecoderModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, AdamW, DataCollatorForSeq2Seq

In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")  # use the GPU
else:
    device = torch.device("cpu")  # use the CPU

print("Using device:", device)

Using device: cuda


In [4]:
import plotly.express as px

# Display dataframes
def display(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:left"><td style="vertical-align:top">'
        html_str+=f'<h4 style="text-align: left;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)
    
# Setting seeds for reproducibility
def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    transformers.set_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

    os.environ['TF_DETERMINISTIC_OPS'] = '1'

# Check tokenizer special tokens
def check_tokens(tokenizer):
    # Get the special tokens and their corresponding IDs
    special_tokens = tokenizer.special_tokens_map
    special_ids = tokenizer.convert_tokens_to_ids(list(special_tokens.values()))
    print("Special tokens:")
    for token_type, token_list in special_tokens.items():
        print(f"{token_type}: {token_list}")
    # Print the special tokens and their corresponding IDs
    for token, id in zip(special_tokens.keys(), special_ids):
        print(f"{token}: {id}")
        
# Compute metrics in the trainer
def compute_metrics(pred,tokenizer):
    labels = pred.label_ids
    preds = pred.predictions
    
    labels_text = tokenizer.batch_decode(labels, skip_special_tokens=True)
    preds_text = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    squad_scores=[]
    for i in range(len(preds_text)):
        squad_scores.append(compute_f1(str(preds_text[i]), str(labels_text[i])))
    mean_squad_f1 = sum(squad_scores)/len(squad_scores)

    return {"squad_f1_score": mean_squad_f1}

# # Generate Answers (on test set)
# def generate_answers(test_loader,model,tokenizer):
#     i=0
#     for batch in tqdm(test_loader):

#         example = batch['input_ids'].to(device)
#         att_mask = batch['attention_mask'].to(device)
#         generated_ids = model.generate(input_ids=example, 
#                                           attention_mask=att_mask,
#                                           max_length=tokenizer.model_max_length
#                                          )
#         ex = tokenizer.batch_decode(example, skip_special_tokens=True)
#         print(ex[0])
#         print(test_df['question'][i:i+8])

#         generated_answers = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
#         print(f'Generated ans: {generated_answers}')
#         true = batch["labels"]
#         ground_truth = tokenizer.batch_decode(true, skip_special_tokens=True)
#         print(f'True ans: {ground_truth}')
#         i+=8

### SQuAD metric

In [5]:
"""
Functions taken from [the official evaluation script]
(https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
for SQuAD version 2.0.
"""
import collections
import re
import string
from typing import Callable, Sequence, TypeVar, Tuple


def make_qid_to_has_ans(dataset):
    qid_to_has_ans = {}
    for article in dataset:
        for p in article["paragraphs"]:
            for qa in p["qas"]:
                qid_to_has_ans[qa["id"]] = bool(qa["answers"])
    return qid_to_has_ans


def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def get_tokens(s):
    if not s:
        return []
    return normalize_answer(s).split()


def compute_exact(a_pred: str, a_gold: str) -> int:
    return int(normalize_answer(a_pred) == normalize_answer(a_gold))


def compute_f1(a_pred: str, a_gold: str) -> float:
    pred_toks = get_tokens(a_pred)
    gold_toks = get_tokens(a_gold)
    common = collections.Counter(pred_toks) & collections.Counter(gold_toks)  # type: ignore[var-annotated]
    num_same = sum(common.values())
    if len(pred_toks) == 0 or len(gold_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return float(pred_toks == gold_toks)
    if num_same == 0:
        return 0.0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


_P = TypeVar("_P")
_G = TypeVar("_G")
_T = TypeVar("_T", int, float, Tuple[int, ...], Tuple[float, ...])


def metric_max_over_ground_truths(
    metric_fn: Callable[[_P, _G], _T], prediction: _P, ground_truths: Sequence[_G]
) -> _T:
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)


def get_metric_score(prediction: str, gold_answers: Sequence[str]) -> Tuple[int, float]:
    exact_scores = metric_max_over_ground_truths(compute_exact, prediction, gold_answers)
    f1_scores = metric_max_over_ground_truths(compute_f1, prediction, gold_answers)
    return exact_scores, f1_scores

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [6]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        urllib.request.urlretrieve(url_path, filename=data_path)
        print("Download completed!")

In [7]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [8]:
# Creating Dataframes and removing unanswerable questions
train_data = json.load((open('coqa/train.json')))
test_data = json.load((open('coqa/test.json')))

qas = pd.json_normalize(train_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(train_data['data'], ['answers'],['id'])
train_val_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
train_val_df = train_val_df.loc[train_val_df['input_text_y']!='unknown']

qas = pd.json_normalize(test_data['data'], ['questions'], ['source', 'id', 'story'])
ans = pd.json_normalize(test_data['data'], ['answers'],['id'])
test_df = pd.merge(qas,ans, left_on=['id','turn_id'], right_on=['id','turn_id'])
test_df = test_df.loc[test_df['input_text_y']!='unknown']

In [9]:
# Removing bad turns
train_val_df = train_val_df.loc[(train_val_df['bad_turn_x'] != 'True') & (train_val_df['bad_turn_y'] != 'True')]

# Removing equal text/answer entries
train_val_df = train_val_df[train_val_df.story != train_val_df.input_text_y]
test_df = test_df[test_df.story != test_df.input_text_y]

# Removing enties with empty answers
train_val_df = train_val_df[train_val_df['input_text_y'].str.len()>0]
test_df = test_df[test_df['input_text_y'].str.len()>0]

In [10]:
# Text preprocess
def preprocess(ds,columns):
#    ds = ds.replace(r'\n',' ', regex=True)
#    ds = ds.replace(r'[^\w\s]+', ' ', regex=True)
#     for feature in columns:
#         ds[feature] = ds[feature].str.lower().str.strip()
        
    return ds

columns = ['story', 'input_text_x', 'span_text', 'input_text_y']

train_val_df = preprocess(train_val_df,columns)
test_df = preprocess(test_df,columns)

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [11]:
# Train/Validation Split
set_reproducibility(42)

train_inds, val_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 42).split(train_val_df, groups=train_val_df['id']))

train_df = train_val_df.iloc[train_inds]
val_df = train_val_df.iloc[val_inds].reset_index()

In [12]:
# Checking the Dataframes
print(f'Training set [{train_df.shape}]')
print(f'\tFeatures: {list(train_df.columns)}')
display(train_df.head())
#display(train_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

print(f'Validation set [{val_df.shape}]')
print(f'\tFeatures: {list(val_df.columns)}')
#display(val_df.head())
#display(val_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

print(f'Test set [{test_df.shape}]')
print(f'\tFeatures: {list(test_df.columns)}')
#display(test_df.head())
#display(test_df.loc[11:15,['id', 'input_text_x', 'input_text_y', 'span_text']])

Training set [(85823, 11)]
	Features: ['input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']


Unnamed: 0,input_text_x,turn_id,bad_turn_x,source,id,story,span_start,span_end,span_text,input_text_y,bad_turn_y
0,When was the Vat formally opened?,1,,wikipedia,3zotghdk5ibi9cex97fepx7jetpso7,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",151,179,Formally established in 1475,It was formally established in 1475,
1,what is the library for?,2,,wikipedia,3zotghdk5ibi9cex97fepx7jetpso7,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",454,494,he Vatican Library is a research library,research,
2,for what subjects?,3,,wikipedia,3zotghdk5ibi9cex97fepx7jetpso7,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",457,511,"Vatican Library is a research library for history, law","history, and law",
3,and?,4,,wikipedia,3zotghdk5ibi9cex97fepx7jetpso7,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",457,545,"Vatican Library is a research library for history, law, philosophy, science and theology","philosophy, science and theology",
4,what was started in 2014?,5,,wikipedia,3zotghdk5ibi9cex97fepx7jetpso7,"The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.",769,879,"March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts",a project,


Validation set [(21452, 12)]
	Features: ['index', 'input_text_x', 'turn_id', 'bad_turn_x', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y', 'bad_turn_y']
Test set [(7917, 9)]
	Features: ['input_text_x', 'turn_id', 'source', 'id', 'story', 'span_start', 'span_end', 'span_text', 'input_text_y']


Now we check if there is any overlapping dialogue between train and validation set.

In [13]:
set_train = set(train_df['id'])
set_val = set(val_df['id'])

overlap = False
for i in set_train:
    if i in set_val:
        overlap = True
        break

print('Overlap' if overlap else 'No overlap')

No overlap


In [14]:
# Dataframes to Datasets
train_df_to_ds = train_df
val_df_to_ds = val_df
test_df_to_ds = test_df

train_df_to_ds = train_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})
val_df_to_ds = val_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})
test_df_to_ds = test_df_to_ds.rename(columns={'input_text_x': 'question', 'story': 'context',\
                                               'input_text_y': 'answer', 'span_text': 'text'})

Now, since the dataset is huge and we are more focused on the reasoning on our choices rather than obtaining the best results, we are going to extract a portion of it.
The next step is gonna be the truncation of the inputs lengths. The pre-trained models that are gonna be tested can process lengths up to 512, that is why our truncation will be at least equal to this value; moreover, we are going to sort the datasets according to the sum of the lengths of the 'context' and 'question' fields together, expecting to truncate the least possible number of examples.

In [15]:
# Combine 'context' and 'question' fields for each dataframe
train_df_to_ds['context_question'] = train_df_to_ds['context'] + train_df_to_ds['question']
val_df_to_ds['context_question'] = val_df_to_ds['context'] + val_df_to_ds['question']
test_df_to_ds['context_question'] = test_df_to_ds['context'] + test_df_to_ds['question']

# Define a function to compute the length of 'context_question'
def get_context_question_length(df):
    return (df['context_question'].apply(len))

# Compute the lengths of 'context_question' for each dataframe
train_lengths = get_context_question_length(train_df_to_ds)
val_lengths = get_context_question_length(val_df_to_ds)
test_lengths = get_context_question_length(test_df_to_ds)

# Sort each dataframe by length of 'context + question'
train_df_to_ds = train_df_to_ds.iloc[train_lengths.argsort()]
val_df_to_ds = val_df_to_ds.iloc[val_lengths.argsort()]
test_df_to_ds = test_df_to_ds.iloc[test_lengths.argsort()]

# Drop the 'context_question' column from each dataframe
train_df_to_ds = train_df_to_ds.drop('context_question', axis=1)
val_df_to_ds = val_df_to_ds.drop('context_question', axis=1)
test_df_to_ds = test_df_to_ds.drop('context_question', axis=1)

In [16]:
tokenizer=AutoTokenizer.from_pretrained("distilroberta-base")
assert isinstance(tokenizer, PreTrainedTokenizerFast)

In [17]:
def process_samples(sample):
    tokenized_data = tokenizer(sample["context"],sample["question"], truncation="only_first", max_length=512, padding="max_length")
    input_ids = tokenized_data["input_ids"]
    cls_index = input_ids.index(tokenizer.cls_token_id)

    if sample['context'][0]==-1:
        start_position = cls_index
        end_position = cls_index
    else:
        #gold_text = sample['text']
        gold_text = sample['text'][sample['span_start']:sample['span_end']]
        start_char = sample['span_start']
        end_char = sample['span_end']

        if sample['context'][start_char-1:end_char-1] == gold_text:
            start_char -= -1
            end_char -= 1
        elif sample['context'][start_char-2:end_char-2] == gold_text:
            start_char -= 2
            end_char -= 2

        start_token = tokenized_data.char_to_token(start_char)
        end_token = tokenized_data.char_to_token(end_char)

        if start_token is None:
            start_token = 512
        if end_token is None:
            end_token = 512

        start_position = start_token
        end_position = end_token
    return {'input_ids': tokenized_data['input_ids'],
            'attention_mask': tokenized_data['attention_mask'],
            'start_positions':start_position,
            'end_positions':end_position}
    

In [18]:
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df_to_ds)
eval_ds = Dataset.from_pandas(val_df_to_ds)
test_ds = Dataset.from_pandas(test_df_to_ds)
flat_train = train_ds.flatten()
flat_val = eval_ds.flatten()
flat_test = test_ds.flatten()
#print(flat_train[0])
processed_train_data =flat_train.select(range(3000)).map(process_samples)
processed_val_data = flat_val.select(range(3000)).map(process_samples)
processed_test_data = flat_test.select(range(1000)).map(process_samples)
print(processed_train_data)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'turn_id', 'bad_turn_x', 'source', 'id', 'context', 'span_start', 'span_end', 'text', 'answer', 'bad_turn_y', '__index_level_0__', 'input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 3000
})


In [19]:
import torch.nn as nn
import torch.optim as optim

In [20]:
class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.rnn = nn.LSTM(hidden_dim, hidden_dim, n_layers, dropout=dropout)

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, (hidden, cell) = self.rnn(embedded)
        return hidden, cell


In [21]:
# Define the decoder class
class Decoder(nn.Module):
    def __init__(self, output_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(output_dim, hidden_dim)
        self.rnn = nn.LSTM(hidden_dim, hidden_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(0))
        return prediction, hidden, cell

In [22]:
# Define the seq2seq model class
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.fc_out.out_features
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        hidden, cell = self.encoder(src)
        input = trg[0, :]
        for t in range(1, max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[t] if teacher_force and t < max_len else top1
        return outputs

In [37]:
from transformers import DataCollatorForSeq2Seq

input_dim = tokenizer.vocab_size
output_dim = tokenizer.vocab_size

hidden_dim = 256

n_layers = 2
dropout = 0.5
learning_rate=1e-3
batch_size = 8

encoder = Encoder(input_dim,hidden_dim,n_layers,dropout).to(device)
decoder = Decoder(output_dim, hidden_dim,n_layers,dropout).to(device)
model = Seq2Seq(encoder,decoder,device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=Seq2Seq)
train_iterator = torch.utils.data.DataLoader(processed_train_data, 
                                          batch_size=batch_size, 
                                          collate_fn=data_collator)
val_iterator = torch.utils.data.DataLoader(processed_val_data,batch_size=batch_size,collate_fn=data_collator)


# train_iterator, valid_iterator = torchtext.data.BucketIterator.splits((processed_train_data,processed_val_data),
#                                                        batch_size=batch_size,
#                                                        sort_within_batch=True,
#                                                        sort_key=lambda x: len(x['context']),
#                                                        device=device
#                                                        )



In [None]:
#Training Function

def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss=0
    for i, batch in enumerate(iterator):
        src = batch.context
        trg = batch.answer
        optimizer.zero_grad()
        output = model(src,trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1,output_dim)
        trg = trg[1:].view(-1)
        loss = criterion(output,trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(),clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss/len(iterator)
    
