## Notebook for the Encoder-Decoder style architecture

- The main idea is to use a single encoder and two decoders.
- The encoder and first decoder is trained on the task of reconstruction.
- Then the second encoder, in conjunction with the first encoder is used for the task of QA Generation.

In [1]:
from transformers import AutoModel, AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from alive_progress import alive_bar

In [2]:
from config import EncoderConfig, ReconstructionDecoderConfig, QAGenerationDecoderConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

encoder = AutoModel.from_pretrained(EncoderConfig.Name).to(device)
reconstructionDecoder = AutoModel.from_pretrained(ReconstructionDecoderConfig.Name).to(device)
qAGenerationDecoder = AutoModel.from_pretrained(QAGenerationDecoderConfig.Name).to(device)



In [3]:
tokenizer = AutoTokenizer.from_pretrained(EncoderConfig.Name)

In [4]:
import pickle

In [5]:
with open ('./data/paper_1.pkl', 'rb') as f:
    paper = pickle.load(f)

In [6]:
paper.relevantContent

'To guarantee a fair comparison with MultiQA, we have trained all the agents for extractive datasets using the same architecture as MultiQA, span-BERT, a BERT model pretrained for span extraction tasks that clearly outperforms BERT on the MRQA 2019 shared task (Joshi et al., 2020). More details on the implementation are provided in Appendix A.3. For the remaining datasets, we use agents that are publicly available on HuggingFace or Github with a performance close to the current state of the art. A summary of them is provided in Appendix A.2. We compare our approach with three types of models: i) multi-agent systems, ii) multi-dataset models, and iii) expert agents. The first family is represented by our main baseline, TWEAC, a model that maps questions to topics (or types of questions) to identify agents trained on that type of data (Geigle et al., 2021) and the simple max-voting ensemble. The second family of models is composed of Mul-tiQA (Talmor and Berant, 2019) and UnifiedQA (Khas

In [7]:
from dataHandler import getTrainData

In [8]:
dataset = getTrainData()

Data Prepared for 1001 papers...


In [9]:
sections, abstracts, targetQA = dataset

targetQA = [ ', '.join(x) for x in targetQA ]

In [10]:
import pandas

In [11]:
## Creating a dataframe of the data
df = pandas.DataFrame(
    {
        "sections": sections,
        "abstracts": abstracts,
        "targetQA": targetQA
    }
)

In [12]:
print(df['abstracts'])

0       Despite serving as the foundation models for a...
2       Large Language Models (LLMs) have shown impres...
3       State-of-the-art techniques common to low reso...
4       Previous works show that Pre-trained Language ...
                              ...                        
996     This work explores the problem of generating t...
997     Aligning large language models (LLMs) to human...
998     This paper analyses two hitherto unstudied sit...
999     Subjective bias is ubiquitous on news sites, s...
1000    Mixed-initiative dialogue tasks involve repeat...
Name: abstracts, Length: 1001, dtype: object


In [13]:
def tokenizeData( dataFrame : pandas.DataFrame, tokenizer : AutoTokenizer, maxLength : int) :
    inputs = tokenizer.batch_encode_plus(
        dataFrame['sections'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

    summaries = tokenizer.batch_encode_plus(
        dataFrame['abstracts'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

    expectedQnA = tokenizer.batch_encode_plus(
        dataFrame['targetQA'].tolist(),
        max_length = maxLength,
        padding = "max_length",
        truncation = True,
        return_tensors = "pt"
    )

    tokenizedDf = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "decoder_input_ids": inputs["input_ids"],
        "decoder_attention_mask": inputs["attention_mask"],
        "labels": expectedQnA["input_ids"]
    }

    return tokenizedDf

In [14]:
df = tokenizeData(df, tokenizer, 512)

In [15]:
print(df)

{'input_ids': tensor([[ 101, 2256, 3818,  ..., 2057, 2224,  102],
        [ 101, 2241, 2006,  ..., 2024, 2625,  102],
        [ 101, 3160, 1024,  ..., 2915, 2742,  102],
        ...,
        [ 101, 1999, 2233,  ..., 8906, 1012,  102],
        [ 101, 2057, 5136,  ...,    0,    0,    0],
        [ 101, 1999, 3816,  ..., 1996, 1052,  102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'decoder_input_ids': tensor([[ 101, 2256, 3818,  ..., 2057, 2224,  102],
        [ 101, 2241, 2006,  ..., 2024, 2625,  102],
        [ 101, 3160, 1024,  ..., 2915, 2742,  102],
        ...,
        [ 101, 1999, 2233,  ..., 8906, 1012,  102],
        [ 101, 2057, 5136,  ...,    0,    0,    0],
        [ 101, 1999, 3816,  ..., 1996, 1052,  102]]), 'decoder_attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1,

In [16]:
## Creating a dataset from the dataframe.

class DatasetForReconstruction(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        return {
            'input_ids' : torch.tensor(self.input_ids[index], dtype=torch.long),
            # 'attention_mask' : torch.tensor(self.attention_mask[index], dtype=torch.long),
            'decoder_input_ids' : torch.tensor(self.decoder_input_ids[index], dtype=torch.long),
            # 'decoder_attention_mask' : torch.tensor(self.decoder_attention_mask[index], dtype=torch.long),
            'labels' : torch.tensor(self.labels[index], dtype=torch.long)
        }

In [17]:
class Model :
    def __init__(self, encoder, reconstructionDecoder, qAGenerationDecoder) :
        self.encoder = encoder
        self.reconstructionDecoder = reconstructionDecoder
        self.qAGenerationDecoder = qAGenerationDecoder

In [22]:
def trainModel(model, dataLoader, epochs = 10, learningRate = 1e-4 ):

    optimizer = Adam(model.encoder.parameters(), lr = learningRate)


    for epoch in range(epochs):
        totalLoss = 0
        print(f"Epoch {epoch+1}/{epochs}")

        for batch in alive_bar(dataLoader, force_tty = True):
            input_ids = batch['input_ids'].to(device)
            output_ids = batch['decoder_input_ids'].to(device)
            labels = batch['labels'].to(device)

            model.zero_grad()

            encoder_outputs, encoder_hidden = model.encoder(input_ids = input_ids)

            reconstructionLoss = model.reconstructionDecoder(
                input_ids = output_ids,
                encoder_hidden = encoder_hidden,
                encoder_outputs = encoder_outputs
            )

            qnALoss = model.qaGenerationDecoder(
                input_ids = labels,
                encoder_hidden = encoder_hidden,
                encoder_outputs = encoder_outputs
            )

            loss = reconstructionLoss + qnALoss
            totalLoss += loss

            loss.backward()
            optimizer.step()

            alive_bar.text(f"Loss : {loss}")

In [23]:
dataset = DatasetForReconstruction(df)

In [24]:
## Training.
dataLoader = DataLoader(dataset, batch_size = 8, shuffle = True)
model = Model(encoder = encoder, reconstructionDecoder = reconstructionDecoder, qAGenerationDecoder = qAGenerationDecoder)

In [25]:
trainModel(model, dataLoader, epochs = 10, learningRate = 1e-4)

AttributeError: 'Model' object has no attribute 'parameters'