## Assignment 3 (Question-Answering System)
### Li Fu Chuen a1865844
### Ka Hing Chan a1867200
### Manish Adhikari a1876371

This is the main notebook that runs the question-answering system. However, some sections in this notebook only load the saved file which created in other separate notebooks. For example, the fine tuned BERT was fine tuned in a separate notebook, named "BERT_Squad_2_0.ipynb". For all our notebooks, please visit our github repository https://github.com/fuchuenli/NLPAssignment3.git

### 1. Reading dataset and pre-processing

#### 1.1 Import required libray

In [1]:
# Import libraries 
import re
import copy
import time
import nltk
import spacy
import heapq
import pickle
import string
import threading
import numpy as np
import pandas as pd
from tqdm import tqdm
from gtts import gTTS
import ipywidgets as widgets
import ipywidgets as widgets
from collections import deque
from transformers import pipeline
from gensim.models import FastText
from defaultlist import defaultlist
from collections import defaultdict
from spacy_entity_linker import EntityLinker
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import display, Audio,clear_output, HTML
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, AutoModelForQuestionAnswering


#### 1.2 Import dataset

This section loaded the extracted CORD-19 Dataset from Kaggle. The dataset extraction was already done in a separate notebook, named "ckaggle_extrate_dataset_notebook.ipynb". Please refer to the coresponding notebook for more information. 

In [2]:
# Load the sampled non-preprocessed covid-19 dataset
dataset = pd.read_csv("Corpus_10k.csv")
dataset

Unnamed: 0,cord_uid,title,abstract,authors,publish_time,corpus
0,bjvyx8u1,Knowledge and Attitude towards COVID-19 Vaccin...,BACKGROUND: It is imperative to ensure optimal...,"Aklil, Mastewal Belayneh; Temesgan, Wubedle Ze...",2022-05-03,Coronavirus disease 2019 (COVID-19) is a new s...
1,vfo59saj,Improving Quality and Efficiency in Pediatric ...,INTRODUCTION: Many children with behavioral he...,"Emerson, Beth L.; Setzer, Erika; Blake, Eileen...",2022-01-21,"Before initiating this work, our pediatric eme..."
2,vvr5x5cn,A cross-country comparison of user experience ...,Autonomous solutions for transportation are em...,"Bellone, Mauro; Ismailogullari, Azat; Kantala,...",2021-03-08,This paper analyses Mobility as a Service (Maa...
3,be43zmo6,ddPCR Reveals SARS-CoV-2 Variants in Florida W...,Wastewater was screened for the presence of fu...,"Gering, E.; Colbert, J.; Schmedes, S.; Duncan,...",2021-04-13,Several SARS-CoV-2 variants of concern have be...
4,drejl99k,A Serological Survey of Infectious Disease in ...,BACKGROUND: Gray wolves (Canis lupus) were rei...,"Almberg, Emily S.; Mech, L. David; Smith, Doug...",2009-09-16,Several high-mortality disease outbreaks among...
...,...,...,...,...,...,...
9995,2tgzky7g,Characterization and Filtration Efficiency of ...,The enormous world demand for personal protect...,"Pierpaoli, Mattia; Giosuè, Chiara; Czerwińska,...",2021-11-10,"The current SARS-CoV-2 virus, which causes the..."
9996,7gxslltu,Association between prehospital fluid resuscit...,Prehospital fluid resuscitation with crystallo...,"Sung, Chih-Wei; Sun, Jen-Tang; Huang, Edward P...",2022-03-08,Traumatic injury is a universal problem that c...
9997,f5r4hcgv,"PLANETAMOS, A Physics Show Musical (Phyusical)",We present a physics show musical with live ph...,"Becker, Lara; Busley, Erik; Dietl, Jakob; Drei...",2022-01-26,Abstract We present a physics show musical wit...
9998,0qdqb2em,Severity of COVID-19 is inversely correlated w...,Background: SARS-CoV-2 genome accumulates poin...,"Abe, K.; Kabe, Y.; Uchiyama, S.; Iwasaki W., Y...",2020-11-24,Severe acute respiratory syndrome coronavirus ...


In [3]:
# Set a constriant to include articles with less than 100k characters only
too_long = [index for index, text in enumerate(dataset["corpus"]) if type(text) != str or len(text) < 100 or len(text) > 100000]
dataset = dataset.loc[~dataset.index.isin(too_long)].reset_index(drop=True)

#### 1.3 Data Pre-Processing

This section loaded the coreferenced corpus. Coreference resolution was already done in a separate notebook, named "coreference_notebook.ipynb". Please refer to the coresponding notebook for more information. 

In [5]:
# Load the coreferenced corpus into a new dataframe
dataset_coreference = dataset.copy(deep=True)
for index in range(len(dataset)):
    with open(f"coreference_10k/corpus_coref_{index}.txt", "r") as f:
        dataset_coreference["corpus"][index] = f.read()

Unnamed: 0,cord_uid,title,abstract,authors,publish_time,corpus
0,bjvyx8u1,Knowledge and Attitude towards COVID-19 Vaccin...,BACKGROUND: It is imperative to ensure optimal...,"Aklil, Mastewal Belayneh; Temesgan, Wubedle Ze...",2022-05-03,Coronavirus disease 2019 (COVID-19)is a new st...
1,vfo59saj,Improving Quality and Efficiency in Pediatric ...,INTRODUCTION: Many children with behavioral he...,"Emerson, Beth L.; Setzer, Erika; Blake, Eileen...",2022-01-21,"Before initiating this work, our pediatric eme..."
2,vvr5x5cn,A cross-country comparison of user experience ...,Autonomous solutions for transportation are em...,"Bellone, Mauro; Ismailogullari, Azat; Kantala,...",2021-03-08,This paper analyses Mobility as a Service (Maa...
3,be43zmo6,ddPCR Reveals SARS-CoV-2 Variants in Florida W...,Wastewater was screened for the presence of fu...,"Gering, E.; Colbert, J.; Schmedes, S.; Duncan,...",2021-04-13,Several SARS-CoV-2 variants of concern have be...
4,drejl99k,A Serological Survey of Infectious Disease in ...,BACKGROUND: Gray wolves (Canis lupus) were rei...,"Almberg, Emily S.; Mech, L. David; Smith, Doug...",2009-09-16,Several high-mortality disease outbreaks among...
...,...,...,...,...,...,...
9781,2tgzky7g,Characterization and Filtration Efficiency of ...,The enormous world demand for personal protect...,"Pierpaoli, Mattia; Giosuè, Chiara; Czerwińska,...",2021-11-10,"The current SARS-CoV-2 virus, which causes the..."
9782,7gxslltu,Association between prehospital fluid resuscit...,Prehospital fluid resuscitation with crystallo...,"Sung, Chih-Wei; Sun, Jen-Tang; Huang, Edward P...",2022-03-08,Traumatic injury is a universal problem that c...
9783,f5r4hcgv,"PLANETAMOS, A Physics Show Musical (Phyusical)",We present a physics show musical with live ph...,"Becker, Lara; Busley, Erik; Dietl, Jakob; Drei...",2022-01-26,Abstract We present a physics show musical wit...
9784,0qdqb2em,Severity of COVID-19 is inversely correlated w...,Background: SARS-CoV-2 genome accumulates poin...,"Abe, K.; Kabe, Y.; Uchiyama, S.; Iwasaki W., Y...",2020-11-24,Severe acute respiratory syndrome coronavirus ...


In [4]:
# A function to removing IEEE, APA in-text references, URLs and non-english characters(but keeping punctuation)
def data_cleaning(corpus):
    pattern = r"\[\d+(,\s*\d+)*\]|\(([A-Za-z]+( [A-Za-z]+)+)\., [0-9]+\)|http://\S+|www.\S+|[^\x00-\x7F]+"
    clean_corpus = re.sub(pattern, "", corpus)
    return clean_corpus

dataset["corpus"] = dataset["corpus"].apply(lambda x: data_cleaning(x))

In [6]:
# Split the corpus into paragraphs for later section
def paragraph_tokenize(text):
    return nltk.tokenize.blankline_tokenize(text)

dataset["corpus"] = dataset["corpus"].apply(lambda x: paragraph_tokenize(x))
dataset_coreference["corpus"] = dataset_coreference["corpus"].apply(lambda x: paragraph_tokenize(x))

In [51]:
dataset_coreference

Unnamed: 0,cord_uid,title,abstract,authors,publish_time,corpus
0,bjvyx8u1,Knowledge and Attitude towards COVID-19 Vaccin...,BACKGROUND: It is imperative to ensure optimal...,"Aklil, Mastewal Belayneh; Temesgan, Wubedle Ze...",2022-05-03,[Coronavirus disease 2019 (COVID-19)is a new s...
1,vfo59saj,Improving Quality and Efficiency in Pediatric ...,INTRODUCTION: Many children with behavioral he...,"Emerson, Beth L.; Setzer, Erika; Blake, Eileen...",2022-01-21,"[Before initiating this work, our pediatric em..."
2,vvr5x5cn,A cross-country comparison of user experience ...,Autonomous solutions for transportation are em...,"Bellone, Mauro; Ismailogullari, Azat; Kantala,...",2021-03-08,[This paper analyses Mobility as a Service (Ma...
3,be43zmo6,ddPCR Reveals SARS-CoV-2 Variants in Florida W...,Wastewater was screened for the presence of fu...,"Gering, E.; Colbert, J.; Schmedes, S.; Duncan,...",2021-04-13,[Several SARS-CoV-2 variants of concern have b...
4,drejl99k,A Serological Survey of Infectious Disease in ...,BACKGROUND: Gray wolves (Canis lupus) were rei...,"Almberg, Emily S.; Mech, L. David; Smith, Doug...",2009-09-16,[Several high-mortality disease outbreaks amon...
...,...,...,...,...,...,...
9781,2tgzky7g,Characterization and Filtration Efficiency of ...,The enormous world demand for personal protect...,"Pierpaoli, Mattia; Giosuè, Chiara; Czerwińska,...",2021-11-10,"[The current SARS-CoV-2 virus, which causes th..."
9782,7gxslltu,Association between prehospital fluid resuscit...,Prehospital fluid resuscitation with crystallo...,"Sung, Chih-Wei; Sun, Jen-Tang; Huang, Edward P...",2022-03-08,[Traumatic injury is a universal problem that ...
9783,f5r4hcgv,"PLANETAMOS, A Physics Show Musical (Phyusical)",We present a physics show musical with live ph...,"Becker, Lara; Busley, Erik; Dietl, Jakob; Drei...",2022-01-26,[Abstract We present a physics show musical wi...
9784,0qdqb2em,Severity of COVID-19 is inversely correlated w...,Background: SARS-CoV-2 genome accumulates poin...,"Abe, K.; Kabe, Y.; Uchiyama, S.; Iwasaki W., Y...",2020-11-24,[Severe acute respiratory syndrome coronavirus...


#### 1.4 Spacy Pipeline

This section loaded the processed corpus passed by Spacey pipeline. The procession was already done in a separate notebook, named "spacy_pipeline_notebook.ipynb". Please refer to the coresponding notebook for more information. 

In [7]:
# Load Spacy NLP pipeline
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)



Pipeline components included:  ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner', 'merge_noun_chunks', 'merge_entities']


In [8]:
# Load the saved processed corpus
with open('corpus_pickle_10k_paragraph', 'rb') as f:
    corpus = pickle.load(f)

### 2. Named Entity Recognition and Knowledge Base

#### 2.1 Data Arrangement

In [9]:
# A function for tokenization (and lemmatisation)
def spacy_tokenizer(text):
    tokens = [token.lemma_ for token in text if not token.is_punct]
    return tokens

In [10]:
# Create three dataset based on the process corpus for different purposes
corpus_sent = defaultlist(lambda: defaultlist(list))
corpus_tokenized = []
corpus_frag_tokenized = defaultlist(lambda: defaultlist(list))

for article_index in tqdm(range(len(corpus))):
    for frag_index in range(len(corpus[article_index])):
        corpus_sent[article_index][frag_index] = list(corpus[article_index][frag_index].sents)
        corpus_frag_tokenized[article_index][frag_index] = spacy_tokenizer(corpus[article_index][frag_index])
        for sentence in corpus_sent[article_index][frag_index]:
            corpus_tokenized.append(spacy_tokenizer(sentence))

100%|██████████| 9786/9786 [01:42<00:00, 95.58it/s] 


In [11]:
# A sample showing entities in a corpus
spacy.displacy.render(corpus[100][2], style="ent")

#### 2.2 Entity Extraction

In [12]:
# Create a set of all entities in all corpus
# Create a dictionary with key=article index and value=entities in the article
all_entity = set()
article_entity = defaultdict(set)

for article_index in tqdm(range(len(corpus))):
    for frag_index in range(len(corpus[article_index])):
        for entity in corpus[article_index][frag_index].ents:
            article_entity[article_index].add(entity.lemma_)
            all_entity.add(entity.lemma_)

100%|██████████| 9786/9786 [00:45<00:00, 212.92it/s]


In [13]:
# Content of all_entity
all_entity

{'h5n1infection',
 'alcam',
 'pure mathematical formulation',
 'the rac1-nadph pathway',
 'mrna deliveryefficiency',
 'however, inclusion',
 'Heavily Indebted Poor Countries',
 'the recall period',
 'infeccione',
 'good plating efficiency',
 'arch model',
 'such multi-component robotic complex',
 'the ecats scenario',
 'site fidelity',
 'trypanosoma specie',
 '56 sample datum',
 'the typical morphologic correlate',
 'significantly more datum',
 'widespread Corona Virus Disease2019',
 'the same ordering',
 'task distribution',
 'artd15(ref',
 'low back pain',
 'a glycoapoprotein',
 'the custodian',
 'also, the studywas',
 'immunocompromised/active cancer',
 'notamment pour',
 '(e.g., weight loss',
 'the denv non-structural protein 1',
 'considerable life change',
 'male)i',
 'patientsobservational',
 'mk-specific receptor',
 'share informationthought',
 'sick cattle',
 'theytheir frontlist ebooksin mid-2014',
 'theystudy',
 'xp_001351047.1',
 'the optimized result',
 'tf prediction',
 '

In [14]:
# Content of article entity
article_entity[452]

{'(ButterflyIQButterfly Network',
 '(ED',
 '(lus',
 '(nv',
 '(picc',
 '(rt-pcr',
 '2-week history',
 'CT',
 'ED',
 'Guilford',
 'Internal Medicine ward',
 'Lung',
 'Lung ultrasoundis',
 'Lung ultrasounduse',
 'USA',
 '[nv',
 'a 60-year-old man',
 'a chest computed tomography',
 'a history',
 'a nasopharyngeal swab reverse transcription-polymerase chain reaction',
 'a satisfactory evolution',
 'a wbc',
 'abnormality',
 'absence',
 'acquisition',
 'admit',
 'aeration',
 'an early detection',
 'an establish technique',
 'anterior chest',
 'b-line',
 'bedside',
 'bilateral',
 'blood culture',
 'blood gas analysis',
 'blood transfusion',
 'bronchial wall',
 'case',
 'chronology',
 'clinical skill',
 'coagulopathy',
 'complicated',
 'current',
 'd-dimer',
 'deep breathing',
 'degree',
 'density',
 'devastating disease',
 'dexamethasone',
 'diagnose',
 'disc space',
 'discharge',
 'disease',
 'dorsal pain',
 'dry bibasal crackle',
 'dyspnea',
 'emergency department',
 'epidermidispresente',
 

#### 2.3 Knowledge Base(Entity Linking)

#### 2.3.1 Knowledge Base(Manual)

In [15]:
# Create a small knowledge manually to extend and connect different entities with alternative names
knowledge_base = {}
knowledge_base["covid"] = {'COVID','Covid','COVID-','Covid-','COVID-19', 'covid19','cOVID-19','COVID-19','covid-19','Covid-19','CoViD-19','COvid-19', "SARS-CoV-2", 'sars-cov-2'}
knowledge_base["coronavirus"] = {'Coronavirus', 'the coronavirus'}
knowledge_base["sars-cov2 infected people"] = {"covid-19 patient"}
knowledge_base["cross protection"] = {"cross immunity", "heterologous protection", "heterovariant immunity"}
knowledge_base["SARS-CoV"] = {"SARS coronavirus", "SARS-CoV-1", "SARS-CoV-1 virus"}
knowledge_base["animal study"] = {"preclinical study", "animal experiment", "animal model", "in vivo study", "animal research"}
knowledge_base["rapid testing"] = {"point-of-care testing", "rapid antigen test"}
knowledge_base["serological test"] = {'antibody test', 'Antibody test','Immunoassay', 'immunoassay', 'blood test', 'Serology', 'serology', 'ELISA', 'antibody titer'}
knowledge_base["social distance"] = {'social distancing', 'physical distancing', 'Physical distancing', 'social isolation', 'quarantine', 'Quarantine', 'stay-at-home', 'Separation', 'separation', 'self-quarantine'}
knowledge_base["clinical trial"] = {'clinical study', 'Clinical Study', 'drug trial', 'therapeutic trial', 'interventional study', 'experimental study', 'Treatment Trial', 'treatment trial', 'human subject research','clinical research', 'Clinical Research'}
knowledge_base["ace inhibitor"] = {'ACE inhibitor','angiotensin-converting enzyme inhibitor'}


In [16]:
# Display the expanded knowledge base
knowledge_base

{'covid': {'COVID',
  'COVID-',
  'COVID-19',
  'COvid-19',
  'CoViD-19',
  'Covid',
  'Covid-',
  'Covid-19',
  'SARS-CoV-2',
  'cOVID-19',
  'covid-19',
  'covid19',
  'sars-cov-2'},
 'coronavirus': {'Coronavirus', 'the coronavirus'},
 'sars-cov2 infected people': {'covid-19 patient'},
 'cross protection': {'cross immunity',
  'heterologous protection',
  'heterovariant immunity'},
 'SARS-CoV': {'SARS coronavirus', 'SARS-CoV-1', 'SARS-CoV-1 virus'},
 'animal study': {'animal experiment',
  'animal model',
  'animal research',
  'in vivo study',
  'preclinical study'},
 'rapid testing': {'point-of-care testing', 'rapid antigen test'},
 'serological test': {'Antibody test',
  'ELISA',
  'Immunoassay',
  'Serology',
  'antibody test',
  'antibody titer',
  'blood test',
  'immunoassay',
  'serology'},
 'social distance': {'Physical distancing',
  'Quarantine',
  'Separation',
  'physical distancing',
  'quarantine',
  'self-quarantine',
  'separation',
  'social distancing',
  'social i

#### 2.3.2 Knowledge Base(Wikidata)

In [17]:
# Link entities in corpus to canonical form in Wikidata
nlp_el = spacy.blank("en")
nlp_el.add_pipe("entityLinker", last=True)
print('Pipeline components included: ', nlp_el.pipe_names)
corpus_el = []
for article in tqdm(corpus):
    for frag in article:
        corpus_el.append(nlp_el(frag))

Pipeline components included:  ['entityLinker']


100%|██████████| 9786/9786 [10:47<00:00, 15.11it/s]  


In [18]:
# Store the linked entities into a dictionary
entity_linker = defaultdict(set)
for frag in tqdm(corpus_el):
    el = frag._.linkedEntities
    for ent in el:
        entity_linker[ent.get_label()].add(ent.get_span().lemma_)

100%|██████████| 300297/300297 [00:27<00:00, 11097.75it/s]


In [19]:
# Merge manually created knowledge base with the linked entities
knowledge_base.update(entity_linker)
knowledge_base

{'covid': {'COVID',
  'COVID-',
  'COVID-19',
  'COvid-19',
  'CoViD-19',
  'Covid',
  'Covid-',
  'Covid-19',
  'SARS-CoV-2',
  'cOVID-19',
  'covid-19',
  'covid19',
  'sars-cov-2'},
 'coronavirus': {'Coronavirus', 'the coronavirus'},
 'sars-cov2 infected people': {'covid-19 patient'},
 'cross protection': {'cross immunity',
  'heterologous protection',
  'heterovariant immunity'},
 'SARS-CoV': {'SARS coronavirus', 'SARS-CoV-1', 'SARS-CoV-1 virus'},
 'animal study': {'animal experiment',
  'animal model',
  'animal research',
  'in vivo study',
  'preclinical study'},
 'rapid testing': {'point-of-care testing', 'rapid antigen test'},
 'serological test': {'Antibody test',
  'ELISA',
  'Immunoassay',
  'Serology',
  'antibody test',
  'antibody titer',
  'blood test',
  'immunoassay',
  'serology'},
 'social distance': {'social distance'},
 'clinical trial': {'Clinical Study',
  'Clinical Trial',
  'clinical study',
  'clinical trial',
  'human subject research'},
 'ace inhibitor': {'

In [45]:
# Display one of newly linked entity
knowledge_base["World Health Organization"]

{'WHO',
 'World Health Organisation',
 'World Health Organization',
 'World Health organisation',
 'World Health organization',
 'who',
 'world Health Organisation',
 'world Health Organization',
 'world Health organization',
 'world health Organization',
 'world health organization'}

### 3. Indexing method and Word Embedding

#### 3.1 Inverted Index

In [20]:
# Created an inverted index with a dictionary
# key=entity, values=article mentioned the entity 
inverted_index = defaultdict(list)
for target_entity in tqdm(all_entity):
    for index, entity in article_entity.items():
        if target_entity in entity:
            inverted_index[target_entity].append(index)

100%|██████████| 2136950/2136950 [33:48<00:00, 1053.67it/s] 


In [21]:
# An example of "covid" in the inverted index
inverted_index["covid"]

[28,
 31,
 48,
 55,
 104,
 106,
 169,
 186,
 285,
 291,
 319,
 434,
 506,
 579,
 671,
 679,
 685,
 687,
 700,
 766,
 990,
 1094,
 1100,
 1110,
 1118,
 1121,
 1184,
 1198,
 1272,
 1284,
 1300,
 1331,
 1346,
 1347,
 1388,
 1476,
 1499,
 1508,
 1535,
 1544,
 1564,
 1604,
 1683,
 1729,
 1758,
 1795,
 1796,
 1843,
 1880,
 1914,
 1983,
 1990,
 2003,
 2179,
 2405,
 2419,
 2443,
 2481,
 2612,
 2635,
 2689,
 2739,
 2787,
 2788,
 2792,
 2901,
 3027,
 3055,
 3123,
 3156,
 3175,
 3202,
 3219,
 3338,
 3346,
 3516,
 3625,
 3637,
 3667,
 3682,
 3753,
 3781,
 3786,
 3919,
 3971,
 4005,
 4047,
 4121,
 4267,
 4283,
 4299,
 4338,
 4355,
 4398,
 4448,
 4458,
 4564,
 4583,
 4721,
 4750,
 4763,
 4827,
 4857,
 4859,
 4861,
 5039,
 5044,
 5059,
 5202,
 5246,
 5247,
 5376,
 5383,
 5462,
 5473,
 5479,
 5489,
 5544,
 5568,
 5608,
 5646,
 5697,
 5709,
 5737,
 5754,
 5755,
 5905,
 5935,
 6016,
 6105,
 6156,
 6288,
 6344,
 6379,
 6383,
 6423,
 6495,
 6510,
 6606,
 6636,
 6649,
 6686,
 6691,
 6829,
 6878,
 6895,
 69

#### 3.2 Word Embedding(Fasttext)

The fasttext is trained beforehand due to long training time (>6 hours). The method for training the fasttext model is commented out for your reference and only loading the model is demonstrated here.

In [22]:
# Train a fasttext model
# The trained model is saved after train because it takes >5 hours to train

fasttext_model = FastText(corpus_tokenized, vector_size=300, window=10, min_count=1, workers=4)

In [23]:
# Load the trained fasttext model
fasttext_model = FastText.load("fasttext_model/fasttext_model")

In [24]:
# A function turn multiple words into a single 300-dimension vector by averaging the word embeddings
def average_fasttext(sentence):
    length_of_sentence = len(sentence)
    if length_of_sentence == 0:
        return np.zeros((1,fasttext_model.vector_size))
    sentence_fasttext = np.zeros((1,fasttext_model.vector_size))
    for word in sentence:
        sentence_fasttext += fasttext_model.wv[word]
    avg_fasttext_sentence = sentence_fasttext/length_of_sentence
    return avg_fasttext_sentence

In [25]:
# A function turn query into a single 300-dimension vector by averaging the word embeddings
def average_fasttext_query(query):
    length_of_sentence = len(query)
    sentence_fasttext = np.zeros((1,fasttext_model.vector_size))
    for words in query:
        related_word_fasttext = np.zeros((1,fasttext_model.vector_size))
        for word in words:
            try:
                related_word_fasttext += fasttext_model.wv[word]
            except:
                continue
        sentence_fasttext += related_word_fasttext/len(words)
    avg_fasttext_sentence = sentence_fasttext/length_of_sentence
    return avg_fasttext_sentence


In [26]:
# Pre-calculated the word vector of each paragraph
corpus_fragment_wv = deque()
for article in tqdm(corpus_frag_tokenized):
    article_wv = deque()
    for fragment in article:
        article_wv.append(average_fasttext(fragment))
    corpus_fragment_wv.append(article_wv)

100%|██████████| 9786/9786 [01:30<00:00, 108.61it/s]


### 4. Text matching utility

#### 4.1 Query Preprocessing

In [27]:
# Clean query with punctuation, stop word and extract entities
def preprocessed_query(query):
    stop_words = nlp.Defaults.stop_words
    punctuation = string.punctuation
    stop_words = [stopword for stopword in nlp.Defaults.stop_words]

    query_clean = []
    query_tokenized_unclean = [word.lemma_ for word in nlp(query) if not word.is_punct and word.text not in stop_words]
    query_tokenized = []
    for phrase in query_tokenized_unclean:
        query_tokenized.append(phrase)
        for word in phrase.split():
            if word in stop_words:
                query_tokenized[-1] = query_tokenized[-1].replace(word, "").strip()
                
    query_tokenized = [word for word in query_tokenized if word != ""]
    
    for word in query_tokenized:
        query_clean.append([word])

        for key, value in knowledge_base.items():
            if word == key or word in value:
                query_clean.remove([word])
                query_clean.append(list(set(list(value) + [key])))
                break
                
    return query_clean

In [28]:
# An example shown the preprocessed query
preprocessed_query("How many people have been infected by covid in 2022?")

[['Humankind',
  'human',
  'humankind',
  'People',
  'homosapien',
  'person',
  'people',
  'human being'],
 ['infect'],
 ['Covid-',
  'covid',
  'COVID-19',
  'cOVID-19',
  'COVID',
  'Covid',
  'sars-cov-2',
  'Covid-19',
  'COVID-',
  'SARS-CoV-2',
  'COvid-19',
  'CoViD-19',
  'covid19',
  'covid-19'],
 ['2022']]

#### 4.2 Search potential relevant articles

In [29]:
# A function to find the article involved entities in the query
def matched_article(query):
    sets = defaultlist(set)

    for index, cluster in enumerate(query):
        for word in cluster:
            sets[index].update(inverted_index[word])
    
    filtered_set = set.intersection(*sets)
    if not filtered_set:
        filtered_set = set.union(*sets)
    return filtered_set

In [52]:
# An example demonstrate the matched_article function
len(matched_article(preprocessed_query("How many people have been infected by covid in 2022?")))

8666

#### 4.3 Rank relevant article snippets

In [31]:
# A function to score the similarity of the query and the fragment(only top 5 are out)
def rank(query):
    query = preprocessed_query(query)
    match_doc = matched_article(query)
    score = {}
    query_w2v = average_fasttext_query(query)
    for index in match_doc:
        for frag_index, fragment_wv in enumerate(corpus_fragment_wv[index]):
            fragment_w2v = corpus_fragment_wv[index][frag_index]
            score[(index, frag_index)] = cosine_similarity(fragment_w2v, query_w2v)
    
    top_keys = heapq.nlargest(20, score, key=score.get)
    filter_keys = []
    for key in top_keys:
        if len(dataset["corpus"][key[0]][key[1]]) > 200 and len(filter_keys) < 5:
            filter_keys.append(key)
    sorted_score = [(key, score[key]) for key in filter_keys]

    return sorted_score

In [46]:
# An example query in using rank()
# output: ((article_index, snippet_index), cosine_similarity_score)
rank("When was covid first discovered")

[((1395, 36), array([[0.82893701]])),
 ((3745, 79), array([[0.82649166]])),
 ((1033, 0), array([[0.81627276]])),
 ((599, 0), array([[0.81011801]])),
 ((1844, 0), array([[0.80652202]]))]

In [49]:
# A demo of relevant snippet
dataset["corpus"][1033][0]

'In December 2019, a new coronavirus disease named COVID-19 by the World Health Organization broke out in Wuhan, China. COVID-19 is caused by an unknown coronavirus called severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). As of May 13, 2021, COVID-19 had a cumulative total of 159,949,065 confirmed cases with over 3,322,439 deaths . With the continuing threat of SARS-CoV-2 to global health, it is urgent to develop effective prevention and treatment strategies for SARS-CoV-2 transmission [, , , , , ].'

### 5. Question-Answering System

#### 5.1 Summerisation

In [35]:
# A function to return a dataframe of top 5 relevant snippet and their information
def passage(query):
    search_result = rank(query)
    title = []
    cord_uid = []
    snippet = []
    for (article_index, snippet_index), score in search_result:
        title.append(dataset["title"][article_index])
        cord_uid.append(dataset["cord_uid"][article_index])
        snippet.append(dataset["corpus"][article_index][snippet_index])
        
    search_dataframe = pd.DataFrame({"Title": title, "Cord_uid": cord_uid, "snippet": snippet})

    return search_dataframe


In [36]:
# Load a summerisation pre-train fine tuned BART
summariser_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

summariser_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [37]:
# Summerise the top 5 snippets
def summarise_passage(passages):
    merged_snippet =  " ".join(passages)
    summarizer = pipeline("summarization", model=summariser_model, tokenizer=summariser_tokenizer)
    if len(merged_snippet) >= 1024:
        summarised_snippet = summarizer(merged_snippet[:1024], max_length=250, min_length=30, do_sample=False)[0]["summary_text"]
    else:
        summarised_snippet = summarizer(merged_snippet, max_length=250, min_length=30, do_sample=False)[0]["summary_text"]
    return summarised_snippet


In [38]:
# A demo of summerisation
question = "When was covid first discovered?"
search_df = passage(question)
print("Top 5 Snippet:")
print(search_df["snippet"].tolist())
print("")
summarizer = pipeline("summarization", model=summariser_model, tokenizer=summariser_tokenizer)
print("Summerisated Snippet:")
print(summarise_passage(search_df["snippet"].tolist()))

Your max_length is set to 250, but you input_length is only 245. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=122)


Top 5 Snippet:
['COVID-19 Dataset: Proteomic data of the COVID-19 patients sera for diagnosis. The outbreak of the COVID-19 pandemic has brought a global crisis. Recently, the sera proteomic data from some COVID-19 cases have been released (Shen et al. 2020) . We also apply MLA-GNN to the classification of COVID-19 patients, contributing to the understanding and auxiliary diagnosis of COVID-19. The dataset contains 34 COVID-19 patients and 36 non-COVID-19 patients, with 791 proteins identified in the sera samples.', 'Predict drug target reactions. There is currently no evidence to support that these drugs may be effective in discouraging Covid-19 . In addition, atazanavir seems effective in covid-19 by demonstrating comprehensive, high-binding antiretroviral approaches to six Covid-19 proteins, including 3C-like proteins and complex replication components.', 'In December 2019, a new coronavirus disease named COVID-19 by the World Health Organization broke out in Wuhan, China. COVID-19 

#### 5.2 BERT models comparsion 

This section is intended to present the comparsion of different BERT models, BERT fine tuned by SQuAD 1.0, BERT fine tuned by SQuAD 2.0, DistilBERT fine tuned by SQuAD 1.0 and DistilBERT fine tuned by SQuAD 2.0. The process of fine-tuning is presented in separate notebooks, named BERT_Squad_1_0.ipynb, BERT_Squad_2_0.ipynb, DistilBERT_Squad_1_0.ipynb and DistilBERT_Squad_2_0.ipynb respectively.

The BERT SQuAD 2.0 has the highest accuracy among all. For model comparsion, please refer to the notebook named "BERT_models_comparison.ipynb".

#### 5.3 Question-Answering

This section loaded the pre-trained fine tuned BERT. The process of fine tuning the BERT was already done in a separate notebook, named "BERT_Squad_2_0.ipynb". Please refer to the coresponding notebook for more information. 

In [39]:
# Load the pre-trained fine tuned BERT
tokenizer = AutoTokenizer.from_pretrained("bert_squad_2.0")

model = AutoModelForQuestionAnswering.from_pretrained("bert_squad_2.0")

In [40]:
# A function to extract the answer from the summerised snippet by BERT
def question_answering(query, summerised_passage):
    question_answerer = pipeline("question-answering", tokenizer=tokenizer, model=model)
    answer_dict = question_answerer(question=query, context=summerised_passage)

    return answer_dict["answer"]

In [50]:
# A demo of the question-answering machine
question = "When was covid first discovered?"
result_dataframe = passage(question)
summerised_passage = summarise_passage(search_df["snippet"].tolist())
answer = question_answering(question, summerised_passage)
answer

Your max_length is set to 250, but you input_length is only 245. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=122)


'December 2019'

### 6. Simple user interface

In [42]:
# function takes the textsnippet and converts it into speech using gTTS library, saves the audio as a file
# and plays the audio utilizing threading so that the jupyter become responsive while the audio is playing in the background

def play_snippet(text_snippet, language='en'):
    """
    Converts the text_snippet into audio file using gTTS library

    Parameters:
    text_snippet: string that is required to convert into speech
    
    Displays the audio_file in a widget
    """
    # the text and language is passed to gTTS engine with slow=False i.e,converted audio will have a high speed
    myobj = gTTS(text=text_snippet, lang=language, lang_check=False, slow=False)
    
    # Saving the converted audio in a mp3 file named
    myobj.save("audio_text.mp3")
    
    # Loading the audio file
    audio_file = 'audio_text.mp3'
    
    # creating the audio widget with autoplay=True
    audio_widget = Audio(filename=audio_file, autoplay=True)
    
    # display the audio-widget
    display(audio_widget)

In [43]:
# Function to find a relevant snippet for a query and return processed snippet in a dataframe with query
def find_snippet(query):
    """
    Finds the relevant sentence for a query from a collection of documents

    Parameters:
    query: a string for which answer is required
    
    Returns:
    paper_details : a list with two parameters .i.e, [paper_id,paper_title]
    query_ans_df : a dataframe with two columns .i.e,['Query', 'Answer'] and a row with query and answer as content
    """
    # modify according to our functions
    #---------------------------------------Frad part----------------------------------------------------
    # pass the query to other function that retrieve relevant snippet
    # expected output of that function: ["paper_id","title"],"snippet"
    result_dataframe = passage(query)
    summerised_passage = summarise_passage(search_df["snippet"].tolist())
    
    #======================================Denis part=====================================================
    # pass that snippet to form a one sentence answer using bert
    # function call for now keeping result as snippet
    answer = question_answering(query, summerised_passage)
    #-------------------------------------------------------------------------------------------
    
    
    return result_dataframe.loc[:, "Title": "Cord_uid"], answer


In [44]:
#---------------------------------repllace this cell----------------------------------------------------------------------------
# creating query text widget
query_text = widgets.Text(
    value='',
    description='Query Text:',
    continuous_update=False
)


# creating sound checkbox widget
sound_checkbox = widgets.Checkbox(
    value=False,
    description='Sound:',
    continuous_update=False
)

# creating paper details checkbox widget
paper_checkbox = widgets.Checkbox(
    value=False,
    description='Paper details:',
    continuous_update=False
)


output_text = widgets.Output()
output_table = widgets.Output()


# creating submit button widget
submit_button = widgets.Button(
    description='Submit'
)

#-----------------------------------------------------------------------------------------------------
# Function to handle button click event and trigger the event to serach for the relevant texts
def on_button_clicked(button):
    """
    Triggers the event to search an answer for the query from all documents after submit button is presed

    Parameters:
    button: button event
    
    Displays:
    MP3 file which is autoplayed if sound_checkbox is ticked
    paper_details : a list with two parameters .i.e, [paper_id,paper_title]
    query_ans_df : a dataframe with two columns .i.e,['Query', 'Answer'] and a row with query and answer as content
    """
    
    # grabbing the values from the forms displayed
    query = query_text.value
    sound_option = sound_checkbox.value
    paper_option = paper_checkbox.value
    
    # Displaying the system status: 0 = processing, 1 = processed
    system_status(0)
    
   # calling find_snippet function with the values from the user
    paper_details_df, answer = find_snippet(query)
    
    # time sleep just to delay the system status change for now 
    time.sleep(5)
    query_answer_list = [query,answer]
    
    # Displaying the system status: 0 = processing, 1 = processed
    system_status(1,query_answer_list)
    
    # Display results in table
    with output_table:   
        
        # clearing previous results
        clear_output(wait=True)
        
        # Sound_option status check
        if sound_option:
            processed_text_for_speech = "Query Entered:"+query+"\n Answer Extracted :"+answer
            
            # creating a new thread to play audio so that audio won't be stopped because of other codes below it
            thread = threading.Thread(target=play_snippet(processed_text_for_speech))
            # starting the thread
            thread.start()
        # give source details for the answer if requested
        if paper_option:
            
            print("The answer is extracted from paper:")
            # Applying the style to the dataframe: 
            # Defining the table style to make the column headers bold
            table_style = [{'selector': 'th','props': [('font-weight', 'bold')]}]

            # Applying the table style to the DataFrame
            styled_df = paper_details_df.style.set_table_styles(table_style).hide(axis='index')

            # Displaying the styled dataframe to make it easier to visualize
            display(styled_df)        
    
    
#----------------------------------------------------------------------------------

# Function to Displaying the System status
def system_status(stage,query_answer=0):
    formattted_message = ''
    
    status = ["Sound: Enabled!","Sound: Disabled!"]
    message = ["\n\t\t\tProcessing .....","\n\t\t\tProcessed!"]
    
    # lambda function for sound_checkbox status
    get_choice = lambda x: 0 if x else 1

    # checking the status of the checkbox to get a value for choice for status 
    choice = get_choice(sound_checkbox.value)
    
    # inside the output_text widget
    with output_text:
        # clearing previous output_text
        clear_output(wait=True)
        
        # displaying System status
        print (status[choice]+message[stage])
        if query_answer:    
            formattted_message += "<u><i>"+query_answer[0]+"</u> <div style='text-align:left'>  <b> &nbsp;&emsp;&emsp;&emsp;"+query_answer[1]+"</b> </i>  </div>"
            display(HTML(formattted_message ))

#-------------------------------------------------------------------------------------------------------------
# function to handle submit button
submit_button.on_click(on_button_clicked)

#------------------------------------------------------------------------------------------------------------
# displaying widgets for the utility
display(query_text)
display(sound_checkbox)
display(paper_checkbox)
display(submit_button)
display(output_text)
display(output_table)


Text(value='', continuous_update=False, description='Query Text:')

Checkbox(value=False, description='Sound:')

Checkbox(value=False, description='Paper details:')

Button(description='Submit', style=ButtonStyle())

Output()

Output()

Your max_length is set to 250, but you input_length is only 245. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=122)


### 7. Reference

[1] https://github.com/egerber/spacy-entity-linker

[2] https://github.com/facebookresearch/fastText/#references

[3] https://huggingface.co/facebook/bart-large-cnn

[4] https://huggingface.co/bert-base-uncased