In [1]:
import pandas as pd
pd.set_option('display.width',None)

## 1. Install 

This example uses the `transformers` [library](https://github.com/huggingface/transformers/) by huggingface. We'll start by installing the package.

In [2]:
!pip install transformers



In [3]:
import torch

## 2. Load Fine-Tuned BERT-large

For Question Answering we use the `BertForQuestionAnswering` class from the `transformers` library.

This class supports fine-tuning, but for this example we will keep things simpler and load a BERT model that has already been fine-tuned for the SQuAD benchmark.

The `transformers` library has a large collection of pre-trained models which you can reference by name and load easily. The full list is in their documentation [here](https://huggingface.co/transformers/pretrained_models.html).

For Question Answering, they have a version of BERT-large that has already been fine-tuned for the SQuAD benchmark. 

BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. 

(Note that this download is not using your own network bandwidth--it's between the Google instance and wherever the model is stored on the web).

Note: I believe this model was trained on version 1 of SQuAD, since it's not outputting whether the question is "impossible" to answer from the text (which is part of the task in v2 of SQuAD).


In [4]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Load the tokenizer as well. 

Side note: Apparently the vocabulary of this model is identicaly to the one in bert-base-uncased. You can load the tokenizer from `bert-base-uncased` and that works just as well.

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




# Data Load

In [6]:
import pandas as pd

In [7]:
clean_pdf_df = pd.read_csv("/home/jupyter/covid19-challenge/data/clean_doc_pdf.csv")

# Data Exploration

In [8]:
clean_pdf_df.shape

(59561, 9)

All the structures are the same

In [9]:
clean_pdf_df.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,1d7234df1a19a0cf0b4038ceb31e7a97881cdbd7,Virtual application of in situ simulation duri...,"Erich Hanel, E M , Monika Bilic, Kelly Hassall...","Erich Hanel, E M , Monika Bilic, Kelly Hassall...",Abstract\n\nThe coronavirus disease 2019 (COVI...,\n\nRÉSUMÉ L'arrivée de la pandémie causée par...,In situ simulation and its effects on patient ...,"[{'first': 'Erich', 'middle': [], 'last': 'Han...","{'BIBREF0': {'ref_id': 'b0', 'title': 'In situ..."
1,0ced1f946cce007aa319a0ba38aef2c4b14dab0e,BMC Immunology Identification of a novel conse...,"Yanbo Lv, Zhihua Ruan, Li Wang, Bing Ni, Yuzha...","Yanbo Lv (Third Military Medical University, 3...",Abstract\n\nThe spike (S) protein is a major s...,Background\n\nSevere acute respiratory syndrom...,Identification of a novel coronavirus in patie...,"[{'first': 'Yanbo', 'middle': [], 'last': 'Lv'...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Identif..."


# Transform text:
Text column of a dataframe --> dictionarity (keys: paragraph name, values: content)


In [10]:
def df_text_to_dict(df,col_name,output_col,print_it=False):
    text_list=[]
    for i in df.index:
        line = df.loc[i]
        text=str(line[col_name])
        text_dict=dict()
        comp_list= text.split("\n\n")
        for num in range(int((len(comp_list))/2)):
            key_str=str(num)+'_'+str(comp_list[num*2])
            key_str=key_str.strip()
            text_dict[key_str]=str(comp_list[num*2+1])
        text_list.append(text_dict)
        if print_it ==True:
            print(i)
    df[output_col]=text_list
    return df

## Add text_dict

In [11]:
ap_dataframe=df_text_to_dict(clean_pdf_df,"text","text_dict",False)

In [12]:
ap_dataframe.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,text_dict
0,1d7234df1a19a0cf0b4038ceb31e7a97881cdbd7,Virtual application of in situ simulation duri...,"Erich Hanel, E M , Monika Bilic, Kelly Hassall...","Erich Hanel, E M , Monika Bilic, Kelly Hassall...",Abstract\n\nThe coronavirus disease 2019 (COVI...,\n\nRÉSUMÉ L'arrivée de la pandémie causée par...,In situ simulation and its effects on patient ...,"[{'first': 'Erich', 'middle': [], 'last': 'Han...","{'BIBREF0': {'ref_id': 'b0', 'title': 'In situ...",{'0_': 'RÉSUMÉ L'arrivée de la pandémie causée...
1,0ced1f946cce007aa319a0ba38aef2c4b14dab0e,BMC Immunology Identification of a novel conse...,"Yanbo Lv, Zhihua Ruan, Li Wang, Bing Ni, Yuzha...","Yanbo Lv (Third Military Medical University, 3...",Abstract\n\nThe spike (S) protein is a major s...,Background\n\nSevere acute respiratory syndrom...,Identification of a novel coronavirus in patie...,"[{'first': 'Yanbo', 'middle': [], 'last': 'Lv'...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Identif...",{'0_Background': 'Severe acute respiratory syn...


In [13]:
sm_ap_dataframe=ap_dataframe[["paper_id","text_dict"]]

In [14]:
sm_ap_dataframe.loc[1]["text_dict"].keys()

dict_keys(['0_Background', '1_Selection of potential HLA-A*0201 binding peptides derived from SARS-CoV/S protein', '2_Ex vivo generation of peptide-specific CD8 + CTLs from healthy human donor PBLs', '3_In vivo induction of peptide-specific CD8 + CTLs in HLA-A2.1/K b transgenic mice', '4_Discussion', '5_Conclusion', '6_Peptides', '7_Cells and Cell Culture', '8_Animals', '9_Generation of CTLs in healthy donors', '10_Generation of CTLs in HLA-A2.1/K b transgenic mice', '11_Cytotoxicity assays', '12_Enzyme-linked immunosorbent spot (ELISPOT) assay'])

In [15]:
sm_ap_dataframe.shape

(59561, 2)

# Create `appended dataframe` with `text_dict` and `paper_id` only

In [16]:
sm_df=sm_ap_dataframe[["paper_id","text_dict"]].copy()

In [17]:
sm_df

Unnamed: 0,paper_id,text_dict
0,1d7234df1a19a0cf0b4038ceb31e7a97881cdbd7,{'0_': 'RÉSUMÉ L'arrivée de la pandémie causée...
1,0ced1f946cce007aa319a0ba38aef2c4b14dab0e,{'0_Background': 'Severe acute respiratory syn...
2,2a667552f5924857d6eaf9a8104e5f92e236dde9,{'0_': 'guide for their recognition and preven...
3,2f57fc005f74c28402fe57d88144a855ee09efb2,{'0_Introduction': 'The Internet of Things (Io...
4,238a7184d9d9876b23b92b957489b91c98365c39,"{'0_': 'pathogenesis of CIA, we have estabbshe..."
...,...,...
59556,27f21a68a6f074f434e506aee9492fa2c772847c,{'0_Background': 'Infectious bronchitis (IB) i...
59557,84af97d38b48f51f0ff1ed443262f5fa8d25376f,{'0_IntroductIon': 'First they came for the so...
59558,53d4c2f0d9ceb548cd3ccdd72490b0c7a5da9d63,{'0_': 'Summary Objectives: To investigate 2 p...
59559,f6312e7233ad4c0780492db3c05a61abe4fb6b22,{'0_Introduction': 'The majority of human coro...


## Test sm sm df ------------------

In [18]:
sm_sm_df=sm_df.loc[0:10].copy()

### Track time:

In [19]:
'''
import time    

start = start = time.time()
with open("test.txt", 'w') as f:
    for i in range(10000000):
        # print('This is a speed test', file=f)
        # f.write('This is a speed test\n')
end = time.time()
print(end - start)
'''

'\nimport time    \n\nstart = start = time.time()\nwith open("test.txt", \'w\') as f:\n    for i in range(10000000):\n        # print(\'This is a speed test\', file=f)\n        # f.write(\'This is a speed test\n\')\nend = time.time()\nprint(end - start)\n'

In [20]:
import json

In [22]:
#json.loads(json.dumps(sm_sm_df.loc[5].to_dict()))

### Json Write

In [23]:
with open("test_sm_sm_json_line.txt", 'w') as f:
    for i in sm_sm_df.index:
        json_content=json.dumps(sm_sm_df.loc[i].to_dict())
        f.write(json_content)
        f.write("\n")

### Json load

In [24]:
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        try: #print(line)
            json.loads(line)
            print("yes")
        except:
            #print(line)
            print("no")

yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes


## ----Test_Finish

# Actual Code-Load

Actual saving data to single-line json file

In [30]:
with open("sm_df.txt", 'w') as f:
    for i in sm_df.index:
        json_content=json.dumps(sm_df.loc[i].to_dict())
        f.write(json_content)
        f.write("\n")

Test Loading:

In [32]:
'''
with open("sm_df.txt", 'r') as f:
    for line in f:
        try: #print(line)
            json.loads(line)
            print("yes")
        except:
            #print(line)
            print("no")
'''

'\nwith open("sm_df.txt", \'r\') as f:\n    for line in f:\n        try: #print(line)\n            json.loads(line)\n            print("yes")\n        except:\n            #print(line)\n            print("no")\n'

# Test Q_A Function

In [33]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text,max_length=500
                                )

    # Report how long the input sequence is.
    #print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)
    
    
    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]
            
    s_scores = start_scores.detach().numpy().flatten()
    e_scores = end_scores.detach().numpy().flatten()
    #print('score:'+(start_scores)+"; "+str(end_scores))
    #print('score:'+str(max(s_scores))+"; "+str(min(e_scores)))
    #print(str(tensor[torch.argmax(start_scores)]))
    #print('Answer: "' + answer + '"')
    return [answer,str(max(s_scores)),len(input_ids)]

In [34]:
'''
for i in clean_pdf_df.text:
    answer_question("What do we know about Hypertension?",i)
'''

'\nfor i in clean_pdf_df.text:\n    answer_question("What do we know about Hypertension?",i)\n'

In [35]:
Q_example="What do we know about Hypertension?"

In [36]:
Q_example_a="[CLS] what do we know about hypertension ? [SEP]"

In [38]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_csv("sm_appended_df.csv",sep='`', index=False, header=False)

In [None]:
#pd.read_csv("sm_appended_df.csv",sep='`',header=None)

In [39]:
for i in ap_dataframe.index:
    line=ap_dataframe.loc[i]
    dct=line["text_dict"]
    for key in dct.keys():
        answer=answer_question(Q_example,dct[key])[0]
        s_score=answer_question(Q_example,dct[key])[1]
        token_len=answer_question(Q_example,dct[key])[2]
        if (answer!="[CLS]") & (float(s_score)>0) & (answer!=Q_example_a):
            print(answer)
            print(s_score)
            print(token_len)

.
1.3515848
500


KeyboardInterrupt: 

Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs.

In [335]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_json("sm_appended_df.json")