# Content

### This file: explore how to clean data, run QA, and load data for MRJob
### Official notebook: `BERT_QA_Master`

In [2]:
import pandas as pd
pd.set_option('display.width',None)

## 1. Install 

This example uses the `transformers` [library](https://github.com/huggingface/transformers/) by huggingface. We'll start by installing the package.

In [3]:
!pip install transformers



In [4]:
import torch

## 2. Load Fine-Tuned BERT-large

For Question Answering we use the `BertForQuestionAnswering` class from the `transformers` library.

This class supports fine-tuning, but for this example we will keep things simpler and load a BERT model that has already been fine-tuned for the SQuAD benchmark.

The `transformers` library has a large collection of pre-trained models which you can reference by name and load easily. The full list is in their documentation [here](https://huggingface.co/transformers/pretrained_models.html).

For Question Answering, they have a version of BERT-large that has already been fine-tuned for the SQuAD benchmark. 

BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. 

(Note that this download is not using your own network bandwidth--it's between the Google instance and wherever the model is stored on the web).

Note: I believe this model was trained on version 1 of SQuAD, since it's not outputting whether the question is "impossible" to answer from the text (which is part of the task in v2 of SQuAD).


In [5]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Load the tokenizer as well. 

Side note: Apparently the vocabulary of this model is identicaly to the one in bert-base-uncased. You can load the tokenizer from `bert-base-uncased` and that works just as well.

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Data Load

In [7]:
import pandas as pd

In [8]:
clean_pdf_df = pd.read_csv("/home/ubuntu/covid19-challenge/data/clean_doc_pdf.csv")

# Data Exploration

In [9]:
clean_pdf_df.shape

(51868, 9)

All the structures are the same

In [10]:
clean_pdf_df.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,Development of animal models against emerging ...,"Troy C Sutton, Kanta Subbarao","Troy C Sutton (NIAID, NIH, United States), Kan...",Abstract\n\nTwo novel coronaviruses have emerg...,"Introduction\n\nWithin the last two decades, t...",Replication and shedding of MERS-CoV in upper ...,"[{'first': 'Troy', 'middle': ['C'], 'last': 'S...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Replica..."
1,640de65e9f09545c463bc419bffb7084fc40fae5,X-RAY CRYSTALLOGRAPHIC STUDIES OF THE IDIOTYPI...,"Nenad Ban, Alexander Mcpherson","Nenad Ban (University of California, 92521, Ri...",,\n\n1. viral: type B viral hepatitis (Kennedy ...,"Three-dimensional structure of antibodies, P M...","[{'first': 'Nenad', 'middle': [], 'last': 'Ban...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Three-d..."


# Transform text:
Text column of a dataframe --> dictionarity (keys: paragraph name, values: content)


In [12]:
def df_text_to_dict(df,col_name,output_col,print_it=False):
    text_list=[]
    for i in df.index:
        line = df.loc[i]
        text=str(line[col_name])
        text_dict=dict()
        comp_list= text.split("\n\n")
        for num in range(int((len(comp_list))/2)):
            key_str=str(num)+'_'+str(comp_list[num*2])
            key_str=key_str.strip()
            text_dict[key_str]=str(comp_list[num*2+1])
        text_list.append(text_dict)
        if print_it ==True:
            print(i)
    df[output_col]=text_list
    return df

## Add text_dict

In [13]:
ap_dataframe=df_text_to_dict(clean_pdf_df,"text","text_dict",False)

In [14]:
ap_dataframe.head(2)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,text_dict
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,Development of animal models against emerging ...,"Troy C Sutton, Kanta Subbarao","Troy C Sutton (NIAID, NIH, United States), Kan...",Abstract\n\nTwo novel coronaviruses have emerg...,"Introduction\n\nWithin the last two decades, t...",Replication and shedding of MERS-CoV in upper ...,"[{'first': 'Troy', 'middle': ['C'], 'last': 'S...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Replica...",{'0_Introduction': 'Within the last two decade...
1,640de65e9f09545c463bc419bffb7084fc40fae5,X-RAY CRYSTALLOGRAPHIC STUDIES OF THE IDIOTYPI...,"Nenad Ban, Alexander Mcpherson","Nenad Ban (University of California, 92521, Ri...",,\n\n1. viral: type B viral hepatitis (Kennedy ...,"Three-dimensional structure of antibodies, P M...","[{'first': 'Nenad', 'middle': [], 'last': 'Ban...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Three-d...",{'0_': '1. viral: type B viral hepatitis (Kenn...


In [15]:
sm_ap_dataframe=ap_dataframe[["paper_id","text_dict"]]

In [16]:
sm_ap_dataframe.loc[1]["text_dict"].keys()

dict_keys(['0_', '1_FabD1.3(Abl>-FabE225(Ab2) and FabD1.3-Lysozyme Complex', '2_FvD1.3(Abl)-FvE5.2(Ab2) and FabD1.3-Lysozyme complex', '3_B', '4_THE ANTI-FELINE INFECTIOUS PERITONITIS VIRUS SYSTEM', '5_Description of the Structure', '6_Idiotope-Anti-Idiotope Interface', '7_ANTI-ANGIOTENSIN II SYSTEM', '8_Number of Residues in a Loop', '9_V8 (IH)', '10_ANTI-LIPOPOLYSACCHARIDE A ANTIGEN OF BRUCELLA ABORTUS SYSTEM', '11_SELF COMPLEMENTARITY OF A MONO-CLONAL ANTIBODY GENERATED IN AN IDIOTYPIC CASCADE', '12_Structure of the Fab Fragment', '13_The Packing of the Fab Fragment and Self-Complementary Interactions', '14_Ag', '15_The Structure of the Complex', '16_The Interface between Self Complementary Antianti-idiotopes', '17_Possible Implications of the Self-Complementary Interaction'])

# To csv

In [42]:
sm_df=sm_ap_dataframe[["paper_id","text_dict"]].copy()

In [43]:
sm_df

Unnamed: 0,paper_id,text_dict
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,{'0_Introduction': 'Within the last two decade...
1,640de65e9f09545c463bc419bffb7084fc40fae5,{'0_': '1. viral: type B viral hepatitis (Kenn...
2,5da136317f5b97ed8371d5121d8828f1c9a5372d,{'0_Introduction': 'Malaria is a mosquito-born...
3,f9ae3db6ac88670b3f47b815bb7422a75f6d47c8,{'0_Introduction': 'Nearly 3 million confirmed...
4,a8676c57d7e3a52378b9e554cc0886ad91999e13,"{'0_': 'ziektegeschiedenis Patiënt A, een 29-j..."
...,...,...
51863,7ed6060dd9d540cbf92b794305429695efb775ce,{'0_': 'The outbreak and spread of coronavirus...
51864,2a35742783198a179b36b29a45fb3a7a28663026,{'0_': 'Cats have the ability to control postu...
51865,d119b09f850ebbfd1fef16f22c8eec38adcc684c,"{'0_| INTRODUCTION': 'In recent years, with th..."
51866,8624ce91f316d2aae5c09273f9308cc08ffcc25c,"{'0_Introduction, methods, and results': 'Porc..."


## Test `dict()` to `str()`: Fail

In [72]:
str(sm_ap_dataframe.text_dict[1])

'{\'0_\': \'1. viral: type B viral hepatitis (Kennedy et al, 1986) , the rabies virus glycoprotein (Reagan et al., 1983) , polio virus type 2 (Fons et al., 1985) , influenza hemagglutinin (Anders et al., 1989) , and bluetonge virus (Grieder et al., 1990 ); 2. bacterial: Streptococcus pneumoniae (McNamara et al., \\\\9S4) ,Pseudomonas aeruginosae (Schrieber et al., 1991) ; 3. parasitic: Trypanosoma rhodesiense (Sacks et al., 1982) , Schistosomias masoni (Kresina and Olds, 1989; Velge-Roussel et al., 1989) ; 4. fungal metabolites (which represent major agricultural contamination problems): trichothecene mycotoxin T-2 (Chanh et al., 1990) ; and 5. tumor antigens -with potential use in cancer therapy (reviewed in Langone, 1989) . Furthermore, this phenomenon has been utilized to identify putative receptors for the import of proteins into mitochondria (Pain et al., 1990) , and anti-anti-IgE idiotypic antibodies have been shown to mimic IgE in their binding to FCe receptor on mast cells invo

In [73]:
dict(str(sm_ap_dataframe.text_dict[1]))

ValueError: dictionary update sequence element #0 has length 1; 2 is required

## Test sm sm df

In [74]:
sm_sm_df=sm_df.loc[0:10].copy()

In [None]:
import time    

start = start = time.time()
with open("test.txt", 'w') as f:
    for i in range(10000000):
        # print('This is a speed test', file=f)
        # f.write('This is a speed test\n')
end = time.time()
print(end - start)

In [85]:
import json

In [117]:
json.loads(json.dumps(sm_sm_df.loc[5].to_dict()))

{'paper_id': 'c2ab03046662fc55e0162afc133b4f73ea9ed866',

### Json Write

In [115]:
with open("test_sm_sm_json_line.txt", 'w') as f:
    for i in sm_sm_df.index:
        json_content=json.dumps(sm_sm_df.loc[i].to_dict())
        f.write(json_content)
        f.write("\n")

### Json load

In [118]:
with open("test_sm_sm_json_line.txt", 'r') as f:
    for line in f:
        try: #print(line)
            json.loads(line)
            print("yes")
        except:
            #print(line)
            print("no")

yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes


## Try different seperator: Failed

In [120]:
sm_df.text_dict.head()

0    {'0_Introduction': 'Within the last two decade...
1    {'0_': '1. viral: type B viral hepatitis (Kenn...
2    {'0_Introduction': "Malaria is a mosquito-born...
3    {'0_Introduction': 'Nearly 3 million confirmed...
4    {'0_': 'ziektegeschiedenis Patiënt A, een 29-j...
Name: text_dict, dtype: object

In [121]:
sm_df.head()

Unnamed: 0,paper_id,text_dict
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,{'0_Introduction': 'Within the last two decade...
1,640de65e9f09545c463bc419bffb7084fc40fae5,{'0_': '1. viral: type B viral hepatitis (Kenn...
2,5da136317f5b97ed8371d5121d8828f1c9a5372d,"{'0_Introduction': ""Malaria is a mosquito-born..."
3,f9ae3db6ac88670b3f47b815bb7422a75f6d47c8,{'0_Introduction': 'Nearly 3 million confirmed...
4,a8676c57d7e3a52378b9e554cc0886ad91999e13,"{'0_': 'ziektegeschiedenis Patiënt A, een 29-j..."


In [122]:
sm_sm_df=sm_df.loc[0:10].copy()

In [123]:
sm_sm_df.to_csv("sm_sm_df.csv",sep='`', index=False, header=False)

In [125]:
with open("sm_sm_df.csv", 'r') as file:
    for line in file:
        paper_id,text_dict=line.split('`',2)
        #sm_dict=json.loads(text_dict)
        #print(text_dict[1:-2])
        eval(str(text_dict)
        #dict(text_dict)
        #print(text_dict.keys())

SyntaxError: unexpected EOF while parsing (<ipython-input-125-8f0ab546acd7>, line 8)

## Test sm

In [None]:
sm_ap_dataframe[["paper_id","text_dict"]].to_csv("sm_appended_df.csv",sep='`', index=False, header=False)

In [32]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_csv("sm_appended_df.csv",sep='`', index=False, header=False)

In [26]:
pd.read_csv("sm_appended_df.csv",sep='`',header=None)

Unnamed: 0,0,1
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,{'0_Introduction': 'Within the last two decade...
1,640de65e9f09545c463bc419bffb7084fc40fae5,{'0_': '1. viral: type B viral hepatitis (Kenn...
2,5da136317f5b97ed8371d5121d8828f1c9a5372d,"{'0_Introduction': ""Malaria is a mosquito-born..."
3,f9ae3db6ac88670b3f47b815bb7422a75f6d47c8,{'0_Introduction': 'Nearly 3 million confirmed...
4,a8676c57d7e3a52378b9e554cc0886ad91999e13,"{'0_': 'ziektegeschiedenis Patiënt A, een 29-j..."
...,...,...
51863,7ed6060dd9d540cbf92b794305429695efb775ce,"{'0_': ""The outbreak and spread of coronavirus..."
51864,2a35742783198a179b36b29a45fb3a7a28663026,{'0_': 'Cats have the ability to control postu...
51865,d119b09f850ebbfd1fef16f22c8eec38adcc684c,"{'0_| INTRODUCTION': 'In recent years, with th..."
51866,8624ce91f316d2aae5c09273f9308cc08ffcc25c,"{'0_Introduction, methods, and results': 'Porc..."


In [30]:
dict(sm_ap_dataframe["text_dict"][0])

{'0_Introduction': 'Within the last two decades, there have been several introductions of zoonotic pathogens into the human population. Specifically, two novel coronaviruses (CoV), Severe Acute Respiratory Syndrome-CoV (SARS-CoV) and Middle East Respiratory Syndrome-CoV (MERS-CoV) caused significant concern because they crossed the species barrier and caused severe disease. While SARS-CoV originated in Asia and spread rapidly to several countries throughout the world, MERS-CoV has largely been restricted to infections acquired in the Middle East. Both viruses are associated with spread from person to person and a high casefatality rate, thus the development of animal models for evaluation of anti-viral therapies and vaccines has been a high priority.SARS-CoV emerged in the Guangdong province of southern China in November, 2002 (Severe acute respiratory syndrome (SARS), 2003) . Retrospective analysis identified 11 cases between November 2002 and March 2003. Of these, 7 had documented co

In [332]:
sm_ap_dataframe[["paper_id","text_dict"]].to_csv("sm_appended_df.csv",)

In [344]:
# ap_dataframe.to_csv("appended_df.csv")

# To json

In [350]:
sm=sm_ap_dataframe[["paper_id","text_dict"]]

In [353]:
sm_dict=dict()
for i in range(sm.shape[0]):
    sm_dict[sm.paper_id[i]]=sm.text_dict[i]

In [351]:
#sm.head()

Unnamed: 0,paper_id,text_dict
0,0ae02f293c03e3e1a2d4582e62c22f2c0c291f48,{'0_Introduction': 'Within the last two decade...
1,640de65e9f09545c463bc419bffb7084fc40fae5,{'0_': '1. viral: type B viral hepatitis (Kenn...
2,5da136317f5b97ed8371d5121d8828f1c9a5372d,{'0_Introduction': 'Malaria is a mosquito-born...
3,f9ae3db6ac88670b3f47b815bb7422a75f6d47c8,{'0_Introduction': 'Nearly 3 million confirmed...
4,a8676c57d7e3a52378b9e554cc0886ad91999e13,"{'0_': 'ziektegeschiedenis Patiënt A, een 29-j..."


In [364]:
import json

In [370]:
with open("sm_appended_df.json",'a') as outfile:
    json.dump(sm_dict,outfile,ensure_ascii=False)

In [367]:
#sm_dict.to_json("sm_appended_df.json")

In [335]:
#sm_ap_dataframe[["paper_id","text_dict"]].to_json("sm_appended_df.json")

# Test Q_A Function

In [None]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text,max_length=500
                                )

    # Report how long the input sequence is.
    #print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)
    
    
    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]
            
    s_scores = start_scores.detach().numpy().flatten()
    e_scores = end_scores.detach().numpy().flatten()
    #print('score:'+(start_scores)+"; "+str(end_scores))
    #print('score:'+str(max(s_scores))+"; "+str(min(e_scores)))
    #print(str(tensor[torch.argmax(start_scores)]))
    #print('Answer: "' + answer + '"')
    return [answer,str(max(s_scores)),len(input_ids)]

In [284]:
'''
for i in clean_pdf_df.text:
    answer_question("What do we know about Hypertension?",i)
'''

'\nfor i in clean_pdf_df.text:\n    answer_question("What do we know about Hypertension?",i)\n'

In [285]:
Q_example="What do we know about Hypertension?"

In [286]:
Q_example_a="[CLS] what do we know about hypertension ? [SEP]"

In [288]:
for i in ap_dataframe.index:
    line=ap_dataframe.loc[i]
    dct=line["text_dict"]
    for key in dct.keys():
        answer=answer_question(Q_example,dct[key])[0]
        s_score=answer_question(Q_example,dct[key])[1]
        token_len=answer_question(Q_example,dct[key])[2]
        if (answer!="[CLS]") & (float(s_score)>0) & (answer!=Q_example_a):
            print(answer)
            print(s_score)
            print(token_len)

blood pressure
0.500568
330
excess variability
0.032215476
36
analogs of known chemical compounds are sometimes more effective than the parent molecule
0.5907004
77
no key lessons learned have been eliminated by merging these results .
0.7678749
100
[SEP]
0.06287956
227


KeyboardInterrupt: 

Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs.