In [1]:
from transformers import BertTokenizer, TFBertForQuestionAnswering
import tensorflow as tf

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad")

In [3]:
model = TFBertForQuestionAnswering.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad", from_pt=True)

All PyTorch model weights were used when initializing TFBertForQuestionAnswering.

All the weights of TFBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.


In [26]:
question = "What reduces risk of Covid-19?"
text = """
Coronavirus (COVID-19) can make anyone seriously ill. But for some people, the risk is higher.
At some point during the COVID-19 pandemic you may have been told you were at high risk of getting seriously ill from COVID-19 (sometimes called clinically vulnerable or clinically extremely vulnerable). You may also have been advised to stay at home (shield).
For most people at high risk from COVID-19, vaccination has significantly reduced this risk. You can follow the same advice as everyone else on how to avoid catching and spreading COVID-19.
Some people continue to be at high risk from COVID-19, despite vaccination.
"""

In [27]:
inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

In [7]:
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

In [8]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

In [9]:
tokenizer.decode(predict_answer_tokens)

'vaccination'

Potential approach for TF-IDF cosine similarity:
1. Extraction of titles from all JSON documents.
2. Getting the list of embeddings for all the titles using BERT.
3. Finding the embedding for the input query using BERT
4. Using Cosine similarity, to find the list of similar embeddings to that of input query. This generate the list of titles which are similar to the input query. 

Could extend this to search in the text body after. Recommend starting with a small amount of self-generated text to test approach.

In [4]:
import pandas as pd
import numpy as np

In [48]:
data = pd.read_csv("Data/clean_pmc.csv", nrows=1000)

In [49]:
not_null_data = data[data['abstract'].notnull()]

In [50]:
not_null_data.shape

(662, 9)

In [54]:
def get_answer(question, text):
    
    inputs = tokenizer(question, text, return_tensors="tf")
    outputs = model(**inputs)
    answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
    answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    predicted_answer = tokenizer.decode(predict_answer_tokens)

    return predicted_answer

In [53]:
question = "What are the risk factors of Covid-19?"

qa_dict={}
for index, row in not_null_data[:100].iterrows():

    if len(tokenizer.tokenize(row['abstract'])) < 512:
        
        text = row['abstract']
        inputs = tokenizer(question, text, return_tensors="tf")
        outputs = model(**inputs)
        answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
        answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
        predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

        predicted_answer = tokenizer.decode(predict_answer_tokens)
        if predicted_answer != '[CLS]':
            qa_dict[index] = predicted_answer

In [55]:
qa_dict

{4: 'the flow pattern and temperature distribution',
 19: '',
 38: '[CLS] what are the risk factors of covid - 19? [SEP] abstract we recently observed six cases of generalized dermatitis associated with malassezia overgrowth',
 42: 'what are the risk factors',
 47: 'what are the risk factors',
 58: '',
 59: '[CLS] what are the risk factors of covid - 19? [SEP] abstract human respiratory syncytial virus is an important cause of severe respiratory disease in young children, the elderly, and in immunocompromised adults. similarly, bovine respiratory syncytial virus ( brsv ) is causing severe, sometimes fatal, respiratory disease in calves. both viruses are pneumovirus',
 62: 'the ratio of the epidemic infection rate of the year to the average infected density of the former year',
 74: '[CLS] what are the risk factors of covid - 19? [SEP] abstract understanding of aerosol dispersion characteristics has many scientific and engineering applications. it is recognized that eulerian or lagrangi

**Next steps**:

- Take subsample of data (first 5 rows) and try extracting answers. Use abstracts rather than body text to overcome 512 token limit.
- Write code to truncate to 512 tokens for if abstracts are bigger.
- Recreate plan to extract most commonly mentioned risk factors (may have to adjust list of risk factors to match returns from sample data).
- Plan how to return candidate papers once BERT methodology (above steps) is determined.