<a href="https://colab.research.google.com/github/ahmedovich19/Machine-Learning-Projects/blob/master/bioasq_Question_answering_with_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Transformers installation
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 20.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 52.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 52.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=3c143320ac

In [3]:
import json
import pandas as pd 
import collections
import numpy as np
import torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from termcolor import colored
import textwrap
from transformers import (
    XLNetTokenizerFast,
    XLNetForQuestionAnswering,
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering,
    BertTokenizerFast,
    BertForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator
)


In [4]:
!gdown --id 19ft5q44W4SuptJgTwR84xZjsHg1jvjSZ


Downloading...
From: https://drive.google.com/uc?id=19ft5q44W4SuptJgTwR84xZjsHg1jvjSZ
To: /content/QA.zip
0.00B [00:00, ?B/s]5.48MB [00:00, 48.2MB/s]


In [5]:
!unzip -q QA.zip

In [6]:
def extract_questions_and_answers(factoid_path: Path):
  with factoid_path.open() as json_file:
    data = json.load(json_file)
  
  questions = data['data'][0]['paragraphs']

  data_rows = []

  for question in questions:
    context = question['context']
    for question_and_answers in question['qas']:
      question = question_and_answers['question']
      id = question_and_answers['id']
      answers = question_and_answers['answers']
      for answer in answers:
        answer_text = answer['text']
        answer_start = answer['answer_start']
        answer_end  = answer_start + len(answer_text)

        data_rows.append({
            'id':id,
            'question': question,
            'context' : context,
            "answer_text" : answer_text,
            "answer_start" : answer_start,
            "answer_end" : answer_end
        })
  return pd.DataFrame(data_rows)

In [7]:
factoid_paths = sorted(list(Path("BioASQ/").glob("BioASQ-train-*")))
factoid_paths

[PosixPath('BioASQ/BioASQ-train-factoid-4b.json'),
 PosixPath('BioASQ/BioASQ-train-factoid-5b.json'),
 PosixPath('BioASQ/BioASQ-train-factoid-6b.json')]

In [8]:
dfs = []
for factoid_path in factoid_paths:
  dfs.append(extract_questions_and_answers(factoid_path))
df = pd.concat(dfs)

In [9]:
df

Unnamed: 0,id,question,context,answer_text,answer_start,answer_end
0,52bf208003868f1b06000019_002,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,231
1,52bf208003868f1b06000019_003,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,123
2,530cf4fe960c95ad0c00000b_001,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1220
3,53148a07dae131f847000002_001,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,426
4,53148a07dae131f847000002_002,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,712
...,...,...,...,...,...,...
4767,58dcb47c8acda34529000020_002,What is the role of TAD protein domain?,Sequestration of p53 in the cytoplasm by adeno...,transactivation domain,765,787
4768,58dcb47c8acda34529000020_003,What is the role of TAD protein domain?,Leu628 of the KIX domain of CBP is a key resid...,transactivation domain,139,161
4769,58dcb47c8acda34529000020_004,What is the role of TAD protein domain?,Sequestration of p53 in the cytoplasm by adeno...,transactivation domain,765,787
4770,58dcb47c8acda34529000020_005,What is the role of TAD protein domain?,Essential roles of Da transactivation domains ...,transcription activation domain,401,432


In [24]:
df = df[df['answer_start']>=0]

In [25]:
df.shape

(12984, 6)

In [26]:
train_df,val_df = train_test_split(df,test_size=0.05)

In [27]:
val_df.iloc[0]

id                                   56c1f00cef6e39474100003e_003
question                Aleglitazar is agonist of which receptor?
context         Effects of the dual peroxisome proliferator-ac...
answer_text        peroxisome proliferator-activated receptor-α/γ
answer_start                                                   20
answer_end                                                     66
Name: 3586, dtype: object

In [28]:
train_df.shape, val_df.shape

((12334, 6), (650, 6))

In [29]:
train_df_new = train_df.to_dict('list')
val_df_new = val_df.to_dict('list')

In [30]:
train_df_new.keys()

dict_keys(['id', 'question', 'context', 'answer_text', 'answer_start', 'answer_end'])

# Fine-tuning with custom datasets

In [31]:
model_checkpoint = 'bert-base-uncased'

In [32]:
tokenizer = BertTokenizerFast.from_pretrained(model_checkpoint)

Now `train_answers` and `val_answers` include the character end positions and the corrected start positions. Next,
let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
as sequence pairs.

In [33]:
max_length = 400 # The maximum length of a feature (question and context)

In [34]:
train_encodings = tokenizer(train_df_new['context'], train_df_new['question'], truncation=True, padding='max_length',max_length=400)
val_encodings = tokenizer(val_df_new['context'], val_df_new['question'], truncation=True,  padding='max_length',max_length=400)

Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
we can use the built in `BatchEncoding.char_to_token` method.

In [None]:
def add_token_positions(encodings, df):
    start_positions = []
    end_positions = []
    for i in range(len(df['answer_text'])):
        start_positions.append(encodings.char_to_token(i, df['answer_start'][i]))
        end_positions.append(encodings.char_to_token(i, df['answer_end'][i] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_df_new)
add_token_positions(val_encodings, val_df_new)

Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
PyTorch, we define a custom `Dataset` class. In TensorFlow, we pass a tuple of `(inputs_dict, labels_dict)` to the
`from_tensor_slices` method.

In [None]:
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

Now we can use a BertForQuestionAnswering for training:

In [52]:
model = BertForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)

In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
  train_losses = []
  model.train()
  for batch in train_loader:
    # Clear out the gradients (by default they accumulate)
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    # Forward pass
    optim.zero_grad()
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    train_losses.append(loss.item())
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optim.step()
  print(f"Train loss: {np.mean(train_losses)}")
  # Validation
  model.eval()
  val_losses = []
  for batch in val_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    with torch.no_grad():
      outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
      loss = outputs[0]
    val_losses.append(loss.item())
  print(f"Validation loss: {np.mean(val_losses)}")

Train loss: 1.1934588807552198
Validation loss: 1.0120271309846784
Train loss: 0.8928633790498882
Validation loss: 0.8718096106881048
Train loss: 0.799929228041795
Validation loss: 0.8515546736193866


<a id='resources'></a>

In [None]:
torch.save(model.state_dict(), 'bio_bert_model.bin')

In [None]:
model.load_state_dict(torch.load('bio_bert_model.bin'), strict=False)
model = model.to(device)

#OR you can train by Trainer

In [53]:
batch_size = 16

In [54]:
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [55]:
data_collator = default_data_collator

In [56]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_tokenized_dataset,
    eval_dataset=val_tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [None]:
trainer.save_model("bioasq_with_bert_model")

In [None]:
model.load_state_dict(torch.load('bio_bert_model.bin'), strict=False)
model = model.to(device)

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us: