In [1]:
from transformers import BertTokenizer, TFBertForQuestionAnswering
import tensorflow as tf

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad")

In [3]:
model = TFBertForQuestionAnswering.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad", from_pt=True)

All PyTorch model weights were used when initializing TFBertForQuestionAnswering.

All the weights of TFBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.


In [26]:
question = "What reduces risk of Covid-19?"
text = """
Coronavirus (COVID-19) can make anyone seriously ill. But for some people, the risk is higher.
At some point during the COVID-19 pandemic you may have been told you were at high risk of getting seriously ill from COVID-19 (sometimes called clinically vulnerable or clinically extremely vulnerable). You may also have been advised to stay at home (shield).
For most people at high risk from COVID-19, vaccination has significantly reduced this risk. You can follow the same advice as everyone else on how to avoid catching and spreading COVID-19.
Some people continue to be at high risk from COVID-19, despite vaccination.
"""

In [27]:
inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

In [7]:
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

In [8]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

In [9]:
tokenizer.decode(predict_answer_tokens)

'vaccination'

Potential approach for TF-IDF cosine similarity:
1. Extraction of titles from all JSON documents.
2. Getting the list of embeddings for all the titles using BERT.
3. Finding the embedding for the input query using BERT
4. Using Cosine similarity, to find the list of similar embeddings to that of input query. This generate the list of titles which are similar to the input query. 

Could extend this to search in the text body after. Recommend starting with a small amount of self-generated text to test approach.

In [11]:
import pandas as pd

In [12]:
data = pd.read_csv("Data/clean_pmc.csv", nrows=5)

In [13]:
data.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,14572a7a9b3e92b960d92d9755979eb94c448bb5,Immune Parameters of Dry Cows Fed Mannan Oligo...,"S T Franklin, M C Newman, K E Newman, K I Meek","S T Franklin (University of Kentucky, 40546-02...",Abstract\n\nThe objective of this study was to...,INTRODUCTION\n\nThe periparturient period is a...,Immune response of pregnant heifers and cows t...,"[{'first': 'S', 'middle': ['T'], 'last': 'Fran...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Immune ..."
1,bb790e8366da63c4f5e2d64fa7bbd5673b93063c,Discontinuous Transcription or RNA Processing ...,"Beate Schwer, Paolo Vista, Jan C Vos, Hendrik ...","Beate Schwer, Paolo Vista, Jan C Vos, Hendrik ...",,Discontinuous\n\nTranscription or RNA Processi...,Poly (riboadenylic acid) preferentially inhibi...,"[{'first': 'Beate', 'middle': [], 'last': 'Sch...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Poly (r..."
2,24f204ce5a1a4d752dc9ea7525082d225caed8b3,,,,,Letter to the Editor\n\nThe non-contact handhe...,Novel coronavirus is putting the whole world o...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Novel c..."
3,f5bc62a289ef384131f592ec3a8852545304513a,Pediatric Natural Deaths 30,"Elizabeth C Burton, Nicole A Singer",Elizabeth C Burton (Johns Hopkins University S...,,"Introduction\n\nWorldwide, the leading causes ...",In athletes who experienced sudden death or in...,"[{'first': 'Elizabeth', 'middle': ['C'], 'last...","{'BIBREF0': {'ref_id': 'b0', 'title': 'In athl..."
4,ab78a42c688ac199a2d5669e42ee4c39ff0df2b8,A real-time convective PCR machine in a capill...,"Yi-Fan Hsieh, Da-Sheng Lee, Ping-Hei Chen, Sha...","Yi-Fan Hsieh (National Taiwan University, 106,...","Abstract\n\nThis research reports the design, ...",Introduction\n\nMullis et al. developed the po...,"The Polymerase Chain Reaction, K B Mullis, F F...","[{'first': 'Yi-Fan', 'middle': [], 'last': 'Hs...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The Pol..."


In [62]:
text = data['abstract'][0]
question = "What is the periparturient period important for?"

In [66]:
len(tokenizer.tokenize(text))

407

In [65]:
inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'immune function of the cows and subsequent transfer of passive immunity to their calves'

**Next steps**:

- Take subsample of data (first 5 rows) and try extracting answers. Use abstracts rather than body text to overcome 512 token limit.
- Write code to truncate to 512 tokens for if abstracts are bigger.
- Recreate plan to extract most commonly mentioned risk factors (may have to adjust list of risk factors to match returns from sample data).
- Plan how to return candidate papers once BERT methodology (above steps) is determined.