## Notebook for the Encoder-Decoder style architecture

- The main idea is to use a single encoder and two decoders.
- The encoder and first decoder is trained on the task of reconstruction.
- Then the second encoder, in conjunction with the first encoder is used for the task of QA Generation.

In [3]:
from transformers import EncoderDecoderConfig, EncoderDecoderModel, AutoTokenizer
from torch.utils.data import Dataset, DataLoader

In [4]:
from config import EncoderConfig, ReconstructionDecoderConfig, QAGenerationDecoderConfig

reconstructionModel = EncoderDecoderModel.from_encoder_decoder_pretrained(EncoderConfig.Name, ReconstructionDecoderConfig.Name)


Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.crossattention.c_attn.bias', 'h.0.crossattention.c_attn.weight', 'h.0.crossattention.c_proj.bias', 'h.0.crossattention.c_proj.weight', 'h.0.crossattention.q_attn.bias', 'h.0.crossattention.q_attn.weight', 'h.0.ln_cross_attn.bias', 'h.0.ln_cross_attn.weight', 'h.1.crossattention.c_attn.bias', 'h.1.crossattention.c_attn.weight', 'h.1.crossattention.c_proj.bias', 'h.1.crossattention.c_proj.weight', 'h.1.crossattention.q_attn.bias', 'h.1.crossattention.q_attn.weight', 'h.1.ln_cross_attn.bias', 'h.1.ln_cross_attn.weight', 'h.10.crossattention.c_attn.bias', 'h.10.crossattention.c_attn.weight', 'h.10.crossattention.c_proj.bias', 'h.10.crossattention.c_proj.weight', 'h.10.crossattention.q_attn.bias', 'h.10.crossattention.q_attn.weight', 'h.10.ln_cross_attn.bias', 'h.10.ln_cross_attn.weight', 'h.11.crossattention.c_attn.bias', 'h.11.crossattention.c_attn.weight', 'h.11.crossat

In [5]:
tokenizer = AutoTokenizer.from_pretrained(EncoderConfig.Name)

In [6]:
import pickle

In [7]:
with open ('./data/paper_1.pkl', 'rb') as f:
    paper = pickle.load(f)

In [8]:
paper.relevantContent

'To guarantee a fair comparison with MultiQA, we have trained all the agents for extractive datasets using the same architecture as MultiQA, span-BERT, a BERT model pretrained for span extraction tasks that clearly outperforms BERT on the MRQA 2019 shared task (Joshi et al., 2020). More details on the implementation are provided in Appendix A.3. For the remaining datasets, we use agents that are publicly available on HuggingFace or Github with a performance close to the current state of the art. A summary of them is provided in Appendix A.2. We compare our approach with three types of models: i) multi-agent systems, ii) multi-dataset models, and iii) expert agents. The first family is represented by our main baseline, TWEAC, a model that maps questions to topics (or types of questions) to identify agents trained on that type of data (Geigle et al., 2021) and the simple max-voting ensemble. The second family of models is composed of Mul-tiQA (Talmor and Berant, 2019) and UnifiedQA (Khas

In [9]:
from dataHandler import getTrainData

In [10]:
dataset = getTrainData()

Data Prepared for 1001 papers...


In [11]:
sections, abstracts, targetQA = dataset

In [12]:
import pandas

In [13]:
## Creating a dataframe of the data
df = pandas.DataFrame(
    {
        "sections": sections,
        "abstracts": abstracts,
        "targetQA": targetQA
    }
)

In [14]:
print(df['abstracts'])

0       Despite serving as the foundation models for a...
2       Large Language Models (LLMs) have shown impres...
3       State-of-the-art techniques common to low reso...
4       Previous works show that Pre-trained Language ...
                              ...                        
996     This work explores the problem of generating t...
997     Aligning large language models (LLMs) to human...
998     This paper analyses two hitherto unstudied sit...
999     Subjective bias is ubiquitous on news sites, s...
1000    Mixed-initiative dialogue tasks involve repeat...
Name: abstracts, Length: 1001, dtype: object


In [None]:
def tokenizeData( dataFrame : pandas.DataFrame, tokenizer : AutoTokenizer, maxLength : int) :
    inputs = tokenizer.batch_encode_plus(
        dataFrame['sections'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

    summaries = tokenizer.batch_encode_plus(
        dataFrame['abstracts'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

    expectedQnA = tokenizer.batch_encode_plus(
        dataFrame['targetQA'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

In [15]:
## Creating a dataset from the dataframe.

class DatasetForReconstruction(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        return self.df.iloc[index]
    
    

SyntaxError: expected ':' (616831681.py, line 3)