Question-answering (QA) systems are computer programs that are designed to understand natural language questions and provide accurate and relevant answers. They are a type of natural language processing (NLP) system that combine techniques from information retrieval, natural language understanding, and knowledge representation to understand a user’s question and provide a response.

There are several different types of QA systems, but they can generally be grouped into two categories: rule-based systems and machine learning-based systems.

- Rule-based QA systems use a set of predefined rules and heuristics to understand the user’s question and generate a response. These systems are typically based on a knowledge base that contains a collection of facts and rules.
- Machine learning-based QA systems, on the other hand, use machine learning algorithms to learn from a large dataset of questions and answers. These systems are trained to understand natural language questions and generate relevant responses. They typically rely on deep learning algorithms such as recurrent neural networks (RNNs) and transformer architectures.

One of the key innovations of T5 is its “prefix” approach to transfer learning, where the model is fine-tuned for a specific task by training it with a prefix added to the input text. For example, to fine-tune T5 for a text classification task, the input text would be prefixed with the task name and a separator, such as “classify: This is the input text.”.

T5 has been shown to achieve state-of-the-art results on a wide range of NLP tasks, and it’s considered a highly sophisticated and powerful NLP model, showing a high level of versatility, fine-tuning capability, and an efficient way to transfer knowledge.
- T5ForConditionalGeneration is a variant of T5 model that is specifically designed for conditional generation tasks such as text summarization, question-answering, and language translation.
- T5forSeq2SeqLM is a standard variant of T5 model that is trained on sequence-to-sequence tasks such as machine translation, text summarization, and dialogue generation.

In [3]:
!pip install transformers datasets evaluate accelerate rouge_score -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.0 M

## Import Library

In [28]:
import torch
import json
from tqdm import tqdm
import torch.nn as nn
from torch.optim import Adam
import evaluate  # Bleu
from torch.utils.data import Dataset, DataLoader, RandomSampler
import pandas as pd
import numpy as np
import transformers
from sklearn.model_selection import train_test_split

from transformers import T5ForConditionalGeneration, T5TokenizerFast

import warnings
warnings.filterwarnings("ignore")

In [5]:
# Download data
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

--2023-11-05 20:58:49--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.111.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘train-v1.1.json’


2023-11-05 20:58:50 (253 MB/s) - ‘train-v1.1.json’ saved [30288272/30288272]

--2023-11-05 20:58:50--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.111.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2023-11-05 20:58:50 (82.0 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]



In [6]:
#set variable & params
MODEL_CHECKPOINT = "t5-base"
TOKENIZER = T5TokenizerFast.from_pretrained(MODEL_CHECKPOINT)
MODEL = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT, return_dict=True)

TRAIN_PATH = "/content/train-v1.1.json"
TEST_PATH = "/content/dev-v1.1.json"

OPTIMIZER = Adam(MODEL.parameters(), lr=1e-5)
Q_LEN = 256
T_LEN = 32
BATCH_SIZE = 8
PREFIX = "question: "
END_PREFIX = " </s>"
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Load Data

In [7]:
with open(TRAIN_PATH) as f:
    train = json.load(f)

In [8]:
def prepare_data(data):
    articles = []

    for article in data["data"]:
        for paragraph in article["paragraphs"]:
            for qa in paragraph["qas"]:
                question = qa["question"]
                answer = qa["answers"][0]["text"]

                inputs = {"context": paragraph["context"], "question": question, "answer": answer}


                articles.append(inputs)

    return articles

In [9]:
train = prepare_data(train)

In [10]:
train[0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answer': 'Saint Bernadette Soubirous'}

In [11]:
# Create a Dataframe
data = pd.DataFrame(train)
data = data.loc[:5000]   # untuk mempercepat training
data.head()

Unnamed: 0,context,question,answer
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary


## Build Dataset

In [12]:
class QADataset(Dataset):
    def __init__(self, tokenizer, dataframe, q_len, t_len):
        self.tokenizer = tokenizer
        self.q_len = q_len
        self.t_len = t_len
        self.data = dataframe
        self.questions = self.data["question"]
        self.context = self.data["context"]
        self.answer = self.data['answer']

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.context[idx]
        answer = self.answer[idx]

        inputs = PREFIX + question + END_PREFIX
        target = answer + END_PREFIX

        question_tokenized = self.tokenizer(inputs, context, max_length=self.q_len, padding="max_length",
                                                    truncation=True, pad_to_max_length=True, add_special_tokens=True)
        answer_tokenized = self.tokenizer(target, max_length=self.t_len, padding="max_length",
                                          truncation=True, pad_to_max_length=True, add_special_tokens=True)

        labels = torch.tensor(answer_tokenized["input_ids"], dtype=torch.long)
        labels[labels == 0] = -100

        return {
            "input_ids": torch.tensor(question_tokenized["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(question_tokenized["attention_mask"], dtype=torch.long),
            "labels": labels,
            "decoder_attention_mask": torch.tensor(answer_tokenized["attention_mask"], dtype=torch.long)
        }

In [13]:
# Dataloader

train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

train_sampler = RandomSampler(train_data.index)
val_sampler = RandomSampler(val_data.index)

qa_dataset = QADataset(TOKENIZER, data, Q_LEN, T_LEN)

train_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
val_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=val_sampler)

In [14]:
MODEL.to(DEVICE)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

## Training Model

In [15]:
train_loss = 0
val_loss = 0
train_batch_count = 0
val_batch_count = 0

for epoch in range(2):
    MODEL.train()
    for batch in tqdm(train_loader, desc="Training batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        train_loss += outputs.loss.item()
        train_batch_count += 1

    #Evaluation
    MODEL.eval()
    for batch in tqdm(val_loader, desc="Validation batches"):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)
        decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

        outputs = MODEL(
                          input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels,
                          decoder_attention_mask=decoder_attention_mask
                        )

        OPTIMIZER.zero_grad()
        outputs.loss.backward()
        OPTIMIZER.step()
        val_loss += outputs.loss.item()
        val_batch_count += 1

    print(f"{epoch+1}/{2} -> Train loss: {train_loss / train_batch_count}\tValidation loss: {val_loss/val_batch_count}")


Training batches: 100%|██████████| 500/500 [04:37<00:00,  1.80it/s]
Validation batches: 100%|██████████| 126/126 [01:06<00:00,  1.89it/s]


1/2 -> Train loss: 0.36532203894853593	Validation loss: 0.25177121266633984


Training batches: 100%|██████████| 500/500 [04:31<00:00,  1.84it/s]
Validation batches: 100%|██████████| 126/126 [01:06<00:00,  1.89it/s]

2/2 -> Train loss: 0.31659629949275403	Validation loss: 0.214991165064497





In [16]:
MODEL.save_pretrained("qa_model")
TOKENIZER.save_pretrained("qa_tokenizer")


('qa_tokenizer/tokenizer_config.json',
 'qa_tokenizer/special_tokens_map.json',
 'qa_tokenizer/tokenizer.json')

## Predict Answer

In [18]:
def predict_answer(context, question, ref_answer=None):
    inputs = TOKENIZER(PREFIX + question, context, max_length=Q_LEN, padding="max_length", truncation=True, add_special_tokens=True)

    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to(DEVICE).unsqueeze(0)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to(DEVICE).unsqueeze(0)

    outputs = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask)

    predicted_answer = TOKENIZER.decode(outputs.flatten(), skip_special_tokens=True)

    if ref_answer:
        # Load the Bleu metric
        bleu = evaluate.load("google_bleu")
        score = bleu.compute(predictions=[predicted_answer],
                            references=[ref_answer])

        print("Context: \n", context)
        print("\n")
        print("Question: \n", question)
        return {
            "Reference Answer: ": ref_answer,
            "Predicted Answer: ": predicted_answer,
            "BLEU Score: ": score
        }
    else:
        return predicted_answer

In [22]:
idx = 1000

context = data.iloc[idx]["context"]
question = data.iloc[idx]["question"]
answer = data.iloc[idx]["answer"]

predict_answer(context, question, answer)

Context: 
 After Hurricane Katrina in 2005, Beyoncé and Rowland founded the Survivor Foundation to provide transitional housing for victims in the Houston area, to which Beyoncé contributed an initial $250,000. The foundation has since expanded to work with other charities in the city, and also provided relief following Hurricane Ike three years later.


Question: 
 How much did Beyonce initially contribute to the foundation?


{'Reference Answer: ': '$250,000',
 'Predicted Answer: ': '$250,000',
 'BLEU Score: ': {'google_bleu': 1.0}}

In [23]:
context = """
virat kohli on tuesday slammed his 73rd international hundred by reaching the three-figure mark
off 80 delivers against sri lanka in first ODI in guwahati. The 43-year-old became faster batter
to slam 20 ODI hundreds on indian soil, taking 99 inings. He also overtook sachin tendulkar
to became fastest batter to smash 45 Odi hundreds, taking 257 innings.
"""

In [24]:
q1 = PREFIX + "How many innings does virat kohli took to reach 45 ODI Hundreds?"

predict_answer(context, q1)

'257'

In [26]:
q2 = PREFIX + "Which team did virat kohli score his 45th ODI Century Against?"
predict_answer(context, q2)

'sri lanka'