In [1]:
import pandas as pd
pd.set_option('display.width',None)

## 1. Install 

This example uses the `transformers` [library](https://github.com/huggingface/transformers/) by huggingface. We'll start by installing the package.

In [4]:
!pip install transformers

Collecting transformers
  Using cached transformers-2.10.0-py3-none-any.whl (660 kB)
Collecting tokenizers==0.7.0
  Using cached tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB)
Processing /home/ubuntu/.cache/pip/wheels/49/25/98/cdea9c79b2d9a22ccc59540b1784b67f06b633378e97f58da2/sacremoses-0.0.43-py3-none-any.whl
Collecting sentencepiece
  Using cached sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)
Collecting regex!=2019.12.17
  Using cached regex-2020.5.14-cp36-cp36m-manylinux2010_x86_64.whl (675 kB)
Installing collected packages: tokenizers, regex, sacremoses, sentencepiece, transformers
Successfully installed regex-2020.5.14 sacremoses-0.0.43 sentencepiece-0.1.91 tokenizers-0.7.0 transformers-2.10.0


In [2]:
import torch

## 2. Load Fine-Tuned BERT-large

For Question Answering we use the `BertForQuestionAnswering` class from the `transformers` library.

This class supports fine-tuning, but for this example we will keep things simpler and load a BERT model that has already been fine-tuned for the SQuAD benchmark.

The `transformers` library has a large collection of pre-trained models which you can reference by name and load easily. The full list is in their documentation [here](https://huggingface.co/transformers/pretrained_models.html).

For Question Answering, they have a version of BERT-large that has already been fine-tuned for the SQuAD benchmark. 

BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. 

(Note that this download is not using your own network bandwidth--it's between the Google instance and wherever the model is stored on the web).

Note: I believe this model was trained on version 1 of SQuAD, since it's not outputting whether the question is "impossible" to answer from the text (which is part of the task in v2 of SQuAD).


In [3]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Load the tokenizer as well. 

Side note: Apparently the vocabulary of this model is identicaly to the one in bert-base-uncased. You can load the tokenizer from `bert-base-uncased` and that works just as well.

In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Data Load

In [5]:
import pandas as pd

In [6]:
clean_pdf_df = pd.read_csv("../data/clean_doc_pdf.csv")

# Data Exploration

In [7]:
clean_pdf_df.shape

(59561, 9)

All the structures are the same

In [8]:
clean_pdf_df.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,0001418189999fea7f7cbe3e82703d71c85a6fe5,Absence of surface expression of feline infect...,"E Cornelissen, H L Dewerchin, E Van Hamme, H J...","E Cornelissen (Ghent University, Salisburylaan...",Abstract\n\nFeline infectious peritonitis viru...,Introduction\n\nFeline infectious peritonitis ...,Using direct immunofluorescence to detect coro...,"[{'first': 'E', 'middle': [], 'last': 'Corneli...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Using d..."
1,0003793cf9e709bc2b9d0c8111186f78fb73fc04,Title: Rethinking high-risk groups in COVID-19,"Anastasia Vishnevetsky, Michael Levy",Anastasia Vishnevetsky (Massachusetts General ...,,\n\nHow do we protect our 'high-risk' patient ...,"COVID-19), , Centers for Disease Control and P...","[{'first': 'Anastasia', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'COVID-1..."


# Transform text:
Text column of a dataframe --> dictionarity (keys: paragraph name, values: content)


In [9]:
def df_text_to_dict(df,col_name,output_col,print_it=False):
    text_list=[]
    for i in df.index:
        line = df.loc[i]
        text=str(line[col_name])
        text_dict=dict()
        comp_list= text.split("\n\n")
        for num in range(int((len(comp_list))/2)):
            key_str=str(num)+'_'+str(comp_list[num*2])
            key_str=key_str.strip()
            text_dict[key_str]=str(comp_list[num*2+1])
        text_list.append(text_dict)
        if print_it ==True:
            print(i)
    df[output_col]=text_list
    return df

## Add text_dict

In [10]:
ap_dataframe=df_text_to_dict(clean_pdf_df,"text","text_dict",False)

In [11]:
ap_dataframe.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,text_dict
0,0001418189999fea7f7cbe3e82703d71c85a6fe5,Absence of surface expression of feline infect...,"E Cornelissen, H L Dewerchin, E Van Hamme, H J...","E Cornelissen (Ghent University, Salisburylaan...",Abstract\n\nFeline infectious peritonitis viru...,Introduction\n\nFeline infectious peritonitis ...,Using direct immunofluorescence to detect coro...,"[{'first': 'E', 'middle': [], 'last': 'Corneli...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Using d...",{'0_Introduction': 'Feline infectious peritoni...
1,0003793cf9e709bc2b9d0c8111186f78fb73fc04,Title: Rethinking high-risk groups in COVID-19,"Anastasia Vishnevetsky, Michael Levy",Anastasia Vishnevetsky (Massachusetts General ...,,\n\nHow do we protect our 'high-risk' patient ...,"COVID-19), , Centers for Disease Control and P...","[{'first': 'Anastasia', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'COVID-1...",{'0_': 'How do we protect our 'high-risk' pati...


In [12]:
sm_ap_dataframe=ap_dataframe[["paper_id","text_dict"]]

In [13]:
sm_ap_dataframe.loc[1]["text_dict"].keys()

dict_keys(['0_'])

In [14]:
sm_ap_dataframe.shape

(59561, 2)

# Create `appended dataframe` with `text_dict` and `paper_id` only

In [15]:
sm_df=sm_ap_dataframe[["paper_id","text_dict"]].copy()

In [16]:
sm_df

Unnamed: 0,paper_id,text_dict
0,0001418189999fea7f7cbe3e82703d71c85a6fe5,{'0_Introduction': 'Feline infectious peritoni...
1,0003793cf9e709bc2b9d0c8111186f78fb73fc04,{'0_': 'How do we protect our 'high-risk' pati...
2,00039b94e6cb7609ecbddee1755314bcfeb77faa,{'0_INTRODUCTION': 'Severe acute respiratory s...
3,0004456994f6c1d5db7327990386d33c01cff32a,{'0_Background': 'The German standing commissi...
4,0004774b55eb0dad880aba9b572efe362660c5e0,"{'0_': '. So, if there is no singular definiti..."
...,...,...
59556,fff69e4894df7b4134bb2ddc830764459ac3edbe,{'0_INTRODUCTION': 'Coronavirus (CoV) infectio...
59557,fff6fe12beb51ee2641ddb5381378ff3560d8103,"{'0_Introduction': 'Worldwide, the adoption of..."
59558,fff6febdd287d474d7950b14faa899c4095557b3,"{'0_': 'pneumonia had not been the ""friend of ..."
59559,fff8b9e88db122ffcbaf1daf6b697e44eaaffd93,{'0_': 'Sir: Septic shock due to Mycobacterium...


## Test sm sm df ------------------

In [17]:
sm_sm_df=sm_df.loc[0:10].copy()

### Track time:

In [23]:
'''
import time    

start = start = time.time()
with open("test.txt", 'w') as f:
    for i in range(10000000):
        # print('This is a speed test', file=f)
        # f.write('This is a speed test\n')
end = time.time()
print(end - start)
'''

'\nimport time    \n\nstart = start = time.time()\nwith open("test.txt", \'w\') as f:\n    for i in range(10000000):\n        # print(\'This is a speed test\', file=f)\n        # f.write(\'This is a speed test\n\')\nend = time.time()\nprint(end - start)\n'

In [18]:
import json

In [19]:
json.loads(json.dumps(sm_sm_df.loc[5].to_dict()))

{'paper_id': '00060fb61742ff60e4e3ba4648c74a34cfe9560d',
 'text_dict': {'0_Introduction': 'The outbreak of a novel coronavirus disease has led to the current global health crisis. As of late March, there have been over 315,000 cases worldwide and over 25,000 cases in the United States [1] . COVID-19 has been of concern to many health care workers (HCW\'s) in the emergency department (ED), due to potential exposure and the possibility of transmitting the virus to other patients and family members. Currently, personal protection equipment (PPE) recommendations are in flux and it is unclear if patients outside of "flu-like symptoms" require PPE [2] . Furthermore, as increasing asymptomatic and atypical presentations develop, clear guidance is needed for HCW\'s. In China, asymptomatic and atypical presentations with gastrointestinal symptoms have been reported [3] , and it has been suggested that up to 10% of COVID-19 patients present with gastrointestinal symptoms. However, due to limited

### Json Write

In [20]:
with open("test_sm_sm_json_line.txt", 'w') as f:
    for i in sm_sm_df.index:
        json_content=json.dumps(sm_sm_df.loc[i].to_dict())
        f.write(json_content)
        f.write("\n")

### Json load

In [21]:
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        try: #print(line)
            json.loads(line)
            print("yes")
        except:
            #print(line)
            print("no")

yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes


## ----Test_Finish

# Actual Code-Load

In [31]:
with open("sm_ap_df_json.txt", 'w') as f:
    for i in sm_sm_df.index:
        json_content=json.dumps(sm_df.loc[i].to_dict())
        f.write(json_content)
        f.write("\n")

In [32]:
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        try: #print(line)
            json.loads(line)
            print("yes")
        except:
            #print(line)
            print("no")

yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes


# Test Q_A Function

In [22]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text,max_length=500
                                )

    # Report how long the input sequence is.
    #print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)
    
    
    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]
            
    s_scores = start_scores.detach().numpy().flatten()
    e_scores = end_scores.detach().numpy().flatten()
    #print('score:'+(start_scores)+"; "+str(end_scores))
    #print('score:'+str(max(s_scores))+"; "+str(min(e_scores)))
    #print(str(tensor[torch.argmax(start_scores)]))
    #print('Answer: "' + answer + '"')
    return [answer,str(max(s_scores)),len(input_ids)]

In [23]:
'''
for i in clean_pdf_df.text:
    answer_question("What do we know about Hypertension?",i)
'''

'\nfor i in clean_pdf_df.text:\n    answer_question("What do we know about Hypertension?",i)\n'

In [24]:
Q_example="What do we know about Hypertension?"

In [25]:
Q_example_a="[CLS] what do we know about hypertension ? [SEP]"

In [32]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_csv("sm_appended_df.csv",sep='`', index=False, header=False)

In [28]:
# pd.read_csv("sm_appended_df.csv",sep='`',header=None)

In [29]:
for i in ap_dataframe.index:
    line=ap_dataframe.loc[i]
    dct=line["text_dict"]
    for key in dct.keys():
        answer=answer_question(Q_example,dct[key])[0]
        s_score=answer_question(Q_example,dct[key])[1]
        token_len=answer_question(Q_example,dct[key])[2]
        if (answer!="[CLS]") & (float(s_score)>0) & (answer!=Q_example_a):
            print(answer)
            print(s_score)
            print(token_len)

the percentage of dead fipv positive monocytes in pyogranulomas was significantly higher than the control cells
0.03644225
101
plasma cytokine and chemokine concentrations were not in a gaussian distribution
0.46187294
107
plasma pmn cc chemokine il - 8 , monocyte cc chemokine mcp - 1 and th1 cxc chemokine ip - 10 concentrations were elevated
0.12026076
119
our study results suggest that hcws in an acute hospital care setting are at no higher risk of influenza than the general public
1.0827159
202
how first responders and the authorities perceived the event ( s )
0.49143612
74
the frequency is unknown
2.6545522
72
the authors report no conflict of interest
1.3563815
19
.
2.3439858
450


KeyboardInterrupt: 

Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs.

In [30]:
question="What is the best method to combat the hypercoagulable state seen in COVID-19 ?"
#from mrjob.job import MRJob
import torch
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
import pandas as pd
import json 

In [33]:
data_output=[]
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        con=dict(json.loads(line))
        print(type(con))

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>


In [34]:
data_output=[]
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        if len(data_output)>3:
            break
        con=json.loads(line)
        paper_id=con["paper_id"]
        text_dict=con["text_dict"]
        for key in text_dict.keys():
            content = text_dict[key]
            com_id=str(paper_id)+str(key)

            answer_text=str(content)        

            input_ids = tokenizer.encode(question, answer_text,max_length=500)

            # Report how long the input sequence is.
            #print('Query has {:,} tokens.\n'.format(len(input_ids)))

            # ======== Set Segment IDs ========
            # Search the input_ids for the first instance of the `[SEP]` token.
            sep_index = input_ids.index(tokenizer.sep_token_id)

            # The number of segment A tokens includes the [SEP] token istelf.
            num_seg_a = sep_index + 1

            # The remainder are segment B.
            num_seg_b = len(input_ids) - num_seg_a

            # Construct the list of 0s and 1s.
            segment_ids = [0]*num_seg_a + [1]*num_seg_b

            # There should be a segment_id for every input token.
            assert len(segment_ids) == len(input_ids)

            # ======== Evaluate ========
            # Run our example question through the model.
            start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                            token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

            # ======== Reconstruct Answer ========
            # Find the tokens with the highest `start` and `end` scores.
            answer_start = torch.argmax(start_scores)
            answer_end = torch.argmax(end_scores)


            # Get the string versions of the input tokens.
            tokens = tokenizer.convert_ids_to_tokens(input_ids)

            # Start with the first token.
            answer = tokens[answer_start]

            # Select the remaining answer tokens and join them with whitespace.
            for i in range(answer_start + 1, answer_end + 1):

                # If it's a subword token, then recombine it with the previous token.
                if tokens[i][0:2] == '##':
                    answer += tokens[i][2:]

                # Otherwise, add a space then the token.
                else:
                    answer += ' ' + tokens[i]

            s_scores = start_scores.detach().numpy().flatten()
            e_scores = end_scores.detach().numpy().flatten()  
            print(com_id)
            if (answer!="[CLS]") & (float(max(s_scores))>0)& (len(answer)>0):
                data_output.append([paper_id,key,answer,max(s_scores)])

0001418189999fea7f7cbe3e82703d71c85a6fe50_Introduction
0001418189999fea7f7cbe3e82703d71c85a6fe51_Cats with naturally occurring FIP
0001418189999fea7f7cbe3e82703d71c85a6fe52_Antibodies
0001418189999fea7f7cbe3e82703d71c85a6fe53_Isolation of FIPV positive cells
0001418189999fea7f7cbe3e82703d71c85a6fe54_Characterization of FIPV positive cells
0001418189999fea7f7cbe3e82703d71c85a6fe55_Cultivation of FIPV positive cells
0001418189999fea7f7cbe3e82703d71c85a6fe56_Statistics
0001418189999fea7f7cbe3e82703d71c85a6fe57_Results
0001418189999fea7f7cbe3e82703d71c85a6fe58_lane C). No difference in amount
0001418189999fea7f7cbe3e82703d71c85a6fe59_Discussion


In [35]:
data_output

[['0001418189999fea7f7cbe3e82703d71c85a6fe5',
  '1_Cats with naturally occurring FIP',
  'fcov antibody',
  2.663851],
 ['0001418189999fea7f7cbe3e82703d71c85a6fe5',
  '5_Cultivation of FIPV positive cells',
  'immunofluorescence staining',
  3.4416032],
 ['0001418189999fea7f7cbe3e82703d71c85a6fe5',
  '6_Statistics',
  'wilcoxon signed ranks test',
  1.8611692],
 ['0001418189999fea7f7cbe3e82703d71c85a6fe5',
  '8_lane C). No difference in amount',
  'viability staining',
  2.6267781]]

In [335]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_json("sm_appended_df.json")