#### T5  
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html (https://youtu.be/r6XY80Z9eSA?t=348)  
https://huggingface.co/transformers/model_doc/t5.html  
https://arxiv.org/abs/1910.10683  


https://www.youtube.com/watch?v=_l2wJb3QPdk  
https://www.youtube.com/watch?v=r6XY80Z9eSA  

pip install --quiet transformers==4.1.1  
pip install --quiet pytorch-lightning==1.1.1  
pip install --quiet tokenizers==0.9.4  
pip install --quiet sentencepiece==0.1.94  Not needed, included in the tokenizers
pip install --quiet pandas  
pip install --quiet sklearn  
pip install --quiet keras  
pip install --quiet tensorflow  
pip install --quiet termcolor  

## Tutorial, Part 1
https://www.youtube.com/watch?v=_l2wJb3QPdk

In google colab change runtime to GPU

In [None]:
!nvidia-smi

In [None]:
!pip install --quiet transformers==4.1.1
!pip install --quiet pytorch-lightning==1.1.1
!pip install --quiet tokenizers==0.9.4
!pip install --quiet sklearn
!pip install --quiet keras
!pip install --quiet tensorflow
!pip install --quiet termcolor

In [1]:
import json
import pandas as pd
import numpy as py
import torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from sklearn.model_selection import train_test_split
from termcolor import colored
import textwrap

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)


# model files are downloaded from https://huggingface.co/valhalla/t5-base-qa-qg-hl/tree/main
# if Internet access is available just use
# MODEL_FILES = "t5-base"
# instead of path to model files

#from sys import platform
#if "linux" in platform.lower():
#    MODEL_FILES = "/home/myuser/TransformerModels/t5-base-qa-qg-hl"
#    CHECKPOINT_PATH="/home/myuserTransformerModels/_CheckPoints"
#else:
#    MODEL_FILES = "C:/TransformerModels/t5-base-qa-qg-hl"
#    CHECKPOINT_PATH="C:/TransformerModels/_CheckPoints"

MODEL_FILES = "t5-base"
CHECKPOINT_PATH="./CheckPoints"


N_GPUS = 1 # Change here if you have GPUs
N_WORKERS = 4 # 4 in the tutorial. 0 if running on windows without GPU...


In [2]:
pl.seed_everything(42)

42

In [3]:
#model = AutoModelWithLMHead.from_pretrained("deep-learning-analytics/triviaqa-t5-base")
#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#model = model.to(device)

In [None]:
# https://github.com/dmis-lab/biobert#datasets
!gdown --id 19ft5q44W4SuptJgTwR84xZjsHg1jvjSZ

In [None]:
!unzip -q QA.zip

In [4]:
with Path("BioASQ/BioASQ-train-factoid-4b.json").open() as json_file:
    data = json.load(json_file)

In [5]:
questions = data["data"][0]["paragraphs"]
questions[0]

{'qas': [{'id': '52bf208003868f1b06000019_002',
   'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?',
   'answers': [{'text': 'autosomal dominant', 'answer_start': 213}]}],
 'context': 'Balanced t(11;15)(q23;q15) in a TP53+/+ breast cancer patient from a Li-Fraumeni syndrome family. Li-Fraumeni Syndrome (LFS) is characterized by early-onset carcinogenesis involving multiple tumor types and shows autosomal dominant inheritance. Approximately 70% of LFS cases are due to germline mutations in the TP53 gene on chromosome 17p13.1. Mutations have also been found in the CHEK2 gene on chromosome 22q11, and others have been mapped to chromosome 11q23. While characterizing an LFS family with a documented defect in TP53, we found one family member who developed bilateral breast cancer at age 37 yet was homozygous for wild-type TP53. Her mother also developed early-onset primary bilateral breast cancer, and a sister had unilateral breast cancer and a soft tissue sarcoma. Cytog

In [6]:
def extract_questions_and_answers(factoid_path: Path):
    with factoid_path.open() as json_file:
        data = json.load(json_file)
        
    questions = data["data"][0]["paragraphs"]
    data_rows = []
    for question in questions:
        context = question['context']
        for question_and_answers in question['qas']:
            question = question_and_answers["question"]
            answers = question_and_answers["answers"]
            
        for answer in answers:
            answer_text = answer["text"]
            answer_start = answer["answer_start"]
            answer_end = answer_start + len(answer_text)
            
            data_rows.append({
                "question":question,
                "context":context,
                "answer_text": answer_text,
                "answer_start":answer_start,
                "answer_end":answer_end
            })
            
    return pd.DataFrame(data_rows)

In [7]:
extract_questions_and_answers(Path("BioASQ/BioASQ-train-factoid-4b.json")).head

<bound method NDFrame.head of                                                question  \
0     What is the inheritance pattern of Li–Fraumeni...   
1     What is the inheritance pattern of Li–Fraumeni...   
2       Which type of lung cancer is afatinib used for?   
3     Which hormone abnormalities are characteristic...   
4     Which hormone abnormalities are characteristic...   
...                                                 ...   
3261  Which is the receptor for substrates of Chaper...   
3262  Which is the receptor for substrates of Chaper...   
3263  Which is the receptor for substrates of Chaper...   
3264  Which is the receptor for substrates of Chaper...   
3265  How many selenoproteins are encoded in the hum...   

                                                context         answer_text  \
0     Balanced t(11;15)(q23;q15) in a TP53+/+ breast...  autosomal dominant   
1     Genetic modeling of Li-Fraumeni syndrome in ze...  autosomal dominant   
2     Clinical perspecti

In [8]:
factoid_paths = sorted(list(Path("BioASQ/").glob("BioASQ-train-*")))
factoid_paths

[WindowsPath('BioASQ/BioASQ-train-factoid-4b.json'),
 WindowsPath('BioASQ/BioASQ-train-factoid-5b.json'),
 WindowsPath('BioASQ/BioASQ-train-factoid-6b.json')]

In [9]:
dfs = []

for factoid_path in factoid_paths:
    dfs.append(extract_questions_and_answers(factoid_path))
    
df = pd.concat(dfs)


In [10]:
print(len(df.question.unique()))
print(len(df.answer_text.unique()))
print(len(df.context.unique()))

443
661
2582


In [11]:
df.head()

Unnamed: 0,question,context,answer_text,answer_start,answer_end
0,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,231
1,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,123
2,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1220
3,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,426
4,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,712


## The duplicates removal in the following cells is done and explained in the second video

In [12]:
print(df.shape)
print(len(df.question.unique()))

(12988, 5)
443


In [13]:
df = df.drop_duplicates(subset=["context"]).reset_index(drop=True)

In [14]:
print(df.shape)
print(len(df.question.unique()))

(2582, 5)
441


In [15]:
sample_question = df.iloc[240]
sample_question

question        What is the characteristic feature of the Dyke...
context         Left hemisphere and male sex dominance of cere...
answer_text                                  cerebral hemiatrophy
answer_start                                                  130
answer_end                                                    150
Name: 240, dtype: object

In [16]:
def color_answer(question):
    answer_start , answer_end = question["answer_start"], question["answer_end"]
    context = question["context"]

    return colored(context[:answer_start], "white") + \
        colored(context[answer_start:answer_end], "green") + \
        colored(context[answer_end:], "white")

In [17]:
print(sample_question["question"])
print()
for wrap in textwrap.wrap(color_answer(sample_question), width = 130):
    print(wrap)

What is the characteristic feature of the Dyke-Davidoff-Masson syndrome.

[37mLeft hemisphere and male sex dominance of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome). Although radiological
findings of [0m[32mcerebral hemiatrophy[0m[37m (Dyke-Davidoff-Masson Syndrome) are well known, there is no systematic study
about the gender and the affected side in this syndrome. Brain images in 26 patients (mean aged 11) with cerebral hemiatrophy were
retrospectively reviewed. Nineteen patients (73.5%) were male and seven patients (26.5%) were female. Left hemisphere involvement
was seen in 18 patients (69.2%) and right hemisphere involvement was seen in eight patients (30.8%). We conclude that male gender
and left side involvement are frequent in cerebral hemiatrophy disease.[0m


### Tokenization

In [18]:
tokenizer = T5Tokenizer.from_pretrained(MODEL_FILES)

In [19]:
sample_encoding = tokenizer(
    "Would I rather be feared or loved?",
    "Easy. Both, I want both."
    )

In [20]:
sample_encoding.keys()

dict_keys(['input_ids', 'attention_mask'])

In [21]:
print(sample_encoding["input_ids"])

[5328, 27, 1066, 36, 3, 27625, 42, 1858, 58, 1, 6844, 5, 2867, 6, 27, 241, 321, 5, 1]


In [22]:
print(sample_encoding["attention_mask"])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [23]:
preds = [
    tokenizer.decode(input_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    for input_id in sample_encoding["input_ids"]
]

In [24]:
" ".join(preds)

'Would I rather be  feared or loved ? </s> Easy . Both , I want both . </s>'

In [25]:
encoding = tokenizer(
    sample_question["question"],
    sample_question["context"],
    max_length=396,
    padding="max_length",
    truncation="only_second",
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
    )
# truncation="only_second" because we do not want to truncate the question

In [26]:
encoding.keys()

dict_keys(['input_ids', 'attention_mask'])

In [27]:
tokenizer.special_tokens_map

{'eos_token': '</s>',
 'unk_token': '<unk>',
 'pad_token': '<pad>',
 'additional_special_tokens': "['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_i

In [28]:
tokenizer.eos_token, tokenizer.eos_token_id

('</s>', 1)

In [29]:
tokenizer.decode(encoding["input_ids"].squeeze())

'What is the characteristic feature of the Dyke-Davidoff-Masson syndrome.</s> Left hemisphere and male sex dominance of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome). Although radiological findings of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome) are well known, there is no systematic study about the gender and the affected side in this syndrome. Brain images in 26 patients (mean aged 11) with cerebral hemiatrophy were retrospectively reviewed. Nineteen patients (73.5%) were male and seven patients (26.5%) were female. Left hemisphere involvement was seen in 18 patients (69.2%) and right hemisphere involvement was seen in eight patients (30.8%). We conclude that male gender and left side involvement are frequent in cerebral hemiatrophy disease.</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

In [30]:
answer_encoding = tokenizer(
    sample_question["answer_text"],
    max_length=32,
    padding="max_length",
    truncation=True,
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
    )

In [31]:
tokenizer.decode(answer_encoding["input_ids"].squeeze())

'cerebral hemiatrophy</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'

In [32]:
labels = answer_encoding["input_ids"]
labels

tensor([[24387,     3,   107, 11658,    17, 29006,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])

In [33]:
# We need to convert the labels that are ignored or masked to -100
labels[labels == 0] = -100
labels

tensor([[24387,     3,   107, 11658,    17, 29006,     1,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100]])

In [34]:
class BioQADataset(Dataset):

    def __init__(
        self,
        data: pd.DataFrame,
        tokenizer: T5Tokenizer,
        source_max_token_len: int = 396,
        target_max_token_len: int = 32
    ):

        self.tokenizer = tokenizer
        self.data = data
        self.source_max_token_len = source_max_token_len
        self.target_max_token_len = target_max_token_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index: int):
        data_row = self.data.iloc[index]

        source_encoding = tokenizer(
            data_row["question"],
            data_row["context"],
            max_length=self.source_max_token_len,
            padding="max_length",
            truncation="only_second",
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors="pt"
        )

        target_encoding = tokenizer(
            data_row["answer_text"],
            max_length=self.source_max_token_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors="pt"
        )

        labels = target_encoding["input_ids"]
        labels[labels == 0] = -100

        return dict(
            question=data_row["question"],
            context=data_row["context"],
            answer_text=data_row["answer_text"],
            input_ids=source_encoding["input_ids"].flatten(),
            attention_mask=source_encoding["attention_mask"].flatten(),
            labels=labels.flatten()
        )

In [35]:
sample_dataset = BioQADataset(df, tokenizer)

In [36]:
for data in sample_dataset:
    print(data["question"])
    print(data["answer_text"])
    print(data["input_ids"][:20])
    print(data["labels"][:20])
    break

What is the inheritance pattern of Li–Fraumeni syndrome?
autosomal dominant
tensor([  363,    19,     8, 28915,  3275,    13,  1414,   104,   371,  6340,
           35,    23, 12398,    58,     1, 17904,    26,     3,    17,   599])
tensor([ 1510, 10348,   138, 12613,     1,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100])


In [37]:
train_df, val_df = train_test_split(df, test_size=0.05)
train_df.shape, val_df.shape

((2452, 5), (130, 5))

In [38]:
class BioQADataModule(pl.LightningDataModule):

    def __init__(
        self,
        train_df: pd.DataFrame,
        test_df: pd.DataFrame,
        tokenizer: T5Tokenizer,
        batch_size: int = 8,
        source_max_token_len: int = 396,
        target_max_token_len: int = 32
    ):
        super().__init__()
        self.batch_size = batch_size
        self.train_df = train_df
        self.test_df = test_df
        self.tokenizer = tokenizer
        self.source_max_token_len = source_max_token_len
        self.target_max_token_len = target_max_token_len

    def setup(self):
        self.train_dataset = BioQADataset(
            self.train_df,
            self.tokenizer,
            self.source_max_token_len,
            self.target_max_token_len
        )

        self.test_dataset = BioQADataset(
            self.test_df,
            self.tokenizer,
            self.source_max_token_len,
            self.target_max_token_len
        )

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=N_WORKERS
        )

    def val_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=1,
            num_workers=N_WORKERS
        )

    def test_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=1,
            num_workers=N_WORKERS
        )

In [39]:
BATCH_SIZE = 8
N_EPOCHS = 6

data_module = BioQADataModule(train_df, val_df, tokenizer, batch_size=BATCH_SIZE)
data_module.setup()

# Second video  
https://www.youtube.com/watch?t=348&v=r6XY80Z9eSA

In [40]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_FILES, return_dict=True)

Some weights of the model checkpoint at C:/TransformerModels/t5-base-qa-qg-hl were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Translation  

In [41]:
input_ids = tokenizer(
    "translate English to German: I talk a lot, so I've learned to tune myself out",
    return_tensors="pt"
).input_ids

generated_ids = model.generate(input_ids=input_ids)
generated_ids

tensor([[    0,  1674,  1131,    15,  2221,     6,    92,  2010,     3,   362,
         29484,     6,  2278, 12426, 27017,     1]])

In [42]:
preds = [
    tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    for gen_id in generated_ids
]

preds

['Ich rede viel, also habe ich gelernt, mich auszuschalten']

In [43]:
" ".join(preds)

'Ich rede viel, also habe ich gelernt, mich auszuschalten'

### back to english with google  
https://translate.google.com/?sl=auto&tl=en&text=Ich%20rede%20viel%2C%20also%20habe%20ich%20gelernt%2C%20mich%20auszuschalten&op=translate


# Summarization


How to generate text: using different decoding methods for language generation with Transformers  
https://huggingface.co/blog/how-to-generate

In [45]:
text = """
summarize: The FDA, an agency within the U.S. Department of Health and Human Services, protects the public health by assuring the safety, effectiveness, and security of human and veterinary drugs, vaccines and other biological products for human use, and medical devices.
The agency also is responsible for the safety and security of our nation’s food supply, cosmetics, dietary supplements, products that give off electronic radiation, and for regulating tobacco products.
The agency has updated its FDA COVID-19 Response At-A-Glance Summary, which provides a quick look at facts, figures, and highlights on the FDA's response efforts.
"""

In [46]:
input_ids = tokenizer(
    text,
    return_tensors="pt"
).input_ids

generated_ids = model.generate(input_ids=input_ids)

preds = [
    tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    for gen_id in generated_ids
]

" ".join(preds)

'The FDA is responsible for the safety and security of human and veterinary drugs, vaccines and'

# Question answering

In [47]:
output = model(
 input_ids = encoding["input_ids"],
 attention_mask=encoding["attention_mask"],
 labels=labels
)

#### encoding was defined previously:
<pre>
encoding = tokenizer(
    sample_question["question"],
    sample_question["context"],
    max_length=396,
    padding="max_length",
    truncation="only_second",
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
    )
# truncation="only_second" because we do not want to truncate the question
</pre>

In [48]:
model.config

T5Config {
  "_name_or_path": "C:/TransformerModels/t5-base-qa-qg-hl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 32,
      "num_beams": 4,
      "prefix": ""
    }
  },
  "use_cache": true,
  "vocab_size": 32102
}

In [49]:
output.logits.shape # see model.config ; 32102 is from vocabulary size; 32 comes from relative_attention_num_buckets; 1 is the batch size, a single example
# for each one of the 32102 vocabulary entry we have 32 outputs

torch.Size([1, 32, 32102])

In [50]:
output.loss

tensor(2.3435, grad_fn=<NllLossBackward>)

### Modeling

In [51]:
class BioQAModel(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_FILES, return_dict=True)

    def forward(self, input_ids, attention_mask, labels=None): # labels are optional because they are not supplied when testing
        output = self.model(
            input_ids = input_ids,
            attention_mask=attention_mask,
            labels=labels
            )

        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log("val_loss", loss, prog_bar=True, logger=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log("test_loss", loss, prog_bar=True, logger=True)
        return loss
    
    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=0.0001)

In [52]:
model = BioQAModel()

Some weights of the model checkpoint at C:/TransformerModels/t5-base-qa-qg-hl were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Model Training using our dataset

In [53]:
from keras.callbacks import ModelCheckpoint

# Checkpoint callback to save best model found during trainig
checkpoint_callback = ModelCheckpoint(
    filepath=CHECKPOINT_PATH,
    dirpath="checkpoints",
    filename="best-checkpoint",
    save_top_k=1, #just keep the best one
    verbose=True,
    monitor="val_loss",
    mode="min" # save the one with minimum validation loss
)

In [54]:
trainer = pl.Trainer(
    checkpoint_callback = checkpoint_callback,
    max_epochs = N_EPOCHS,
    gpus=N_GPUS,
    progress_bar_refresh_rate=30
)

GPU available: False, used: False
TPU available: None, using: 0 TPU cores


In [44]:
%load_ext tensorboard

In [56]:
%tensorboard --logdir ./lightning_logs

Reusing TensorBoard on port 6006 (pid 16668), started 17:47:50 ago. (Use '!kill 16668' to kill it.)

In [56]:
trainer.fit(model, data_module)


  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
Epoch 0:  14%|█▎        | 60/437 [2:24:33<15:08:16, 144.55s/it, loss=0.47, v_num=0, val_loss=3.11, train_loss=0.547] 

# Predictions

In [None]:
trained_model = BioQAModel.load_from_checkpoint(CHECKPOINT_PATH + "/best-checkpoint.ckpt")
trained_model.freeze()

In [55]:
def generate_answer(question):
    source_encoding = tokenizer(
        question["question"],
        question["context"],
        max_length=396,
        padding="max_length",
        truncation="only_second", # do not truncate question
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors="pt"
    )

    generated_ids = trained_model.model.generate(
        input_ids=source_encoding["input_ids"],
        attention_mask=source_encoding["attention_mask"],
        num_beams=1,
        max_length=80,
        repetition_penalty=2.5,
        length_penalty=1.0,
        early_stopping=True,
        use_cache=True
    )

    preds = [
        tokenizer.decode(generated_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for generated_id in generated_ids
    ]

    return " ".join(preds)

In [None]:
sample_question = val_df.iloc[0]
print(sample_question["question"])
print(sample_question["answer_text"])

In [None]:
generate_answer(sample_question)