# Introduction

**Submitted by Merck & Co., Inc.** 

The CORD-19 Research Database is a growing collection of 50k+ scientific papers relating to information on variants of coronavirus. This enormous dataset poses challenges in terms of finding the right kind of information quickly for the purpose of finding useful and actionable information on today's COVID-19 pandemic. 

Here we focus efforts on the "Vaccines" task due to our specific interests in helping to find the best treatment possible and due to this task being particularly difficult compared to most others because of the need for domain expertise and knowledge. However, we note that our approach is applicable to all tasks.

The data science community has gathered together in order to build tools and technologies for the sole purpose of searching, ranking, extracting and aggregating results from this database. However, we found that the extraction + aggregation step of this challenge poses the greatest difficulty. 

We noticed that even using state of the art NLP technology can only bring us partially to the solution. Therefore, our approach focuses on leveraging existing knowledge + NLP to get us 80% of the way there, and then utilizing human review to help with the final step of aggregating and summarizing the most useful information. Our philosophy is that using human review can get us to the last 20% of the process.

**NOTE: It is highly recommended to use a GPU for this notebook to speed up some of the BERT related processes**

## Specific Technical Challenges
We summarize below the specific challenges we personally encountered and noticed among other submissions.
1. Search and rank of information is often limited to titles and information extraction is limited to abstract due to scalability issues
2. The BM25 ranking algorithm is a popular way to quickly search full text, but is heavily dependent on using the correct keywords, terms and the full body text being of good quality (when often it is not)
3. Deeper context from the full text is usually not available, often lacking the necessary supplementary information to understand the extracted information fully
4. Significant human review often not leveraged, resulting in systematic biases that may not be corrected or noticed. Additionally, question Answering or BERT based summarization is not feasible due to lack of corpus specific training and/or very slow inference times. Older non-deep learning methods do not output great results either.
5. Relationships between articles are often not considered, which could be an indicator of the quality and strength of the article results (especially if the article was heavily cited).

## Our Contribution Toward Resolving the Technical Challenges
1. Since titles can be misleading, we focus our search and rank on exclusively the abstract. We accomplish this by embedding each sentence of the abstract and doing a semantic similarity search via a pretrained BioBERT deep learning model on these sentences.
2. Like many others, we also use the BM25 algorithm to search the full text. However, we use the results of the previous step (similarity search on abstract sentences) to guide our BM25 query. We take the non-stopword tokens from the similarity results and feed them into the BM25 algorithm to search for the relevant pieces of full body text in order to provide deeper context. We also remove sentences from the full body that are less than 6 words, as they are likely to contain labelings that were accidentally extracted from the PDFs. We believe that using the same language from the abstract leads to more relevant BM25 results. 
3. We provide deeper context to our results by returning multiple results from the full body instead of just outputting the top result. This is achieved by taking the top 1.5 standard deviations of BM25 scores relative to the average score. Additionally, we identify mentioned antiviral agents or related terms and associate them with each literature to provide context on which antiviral agents are being studied. Finally, we create a **knowledge graph**, the first version of which identifies quickly papers that make references to others. In this way, we can see how many times a specific paper was cited in others.
4. We then do a human manual review of our query results to carefully aggregate and summarize the most useful information pertaining to the question. We believe this is the most necessary step that current approaches cannot do automatically. 
5. We create a knowledge graph that calculates the relationships between articles. Here we contribute to a first implementation on looking at which articles cite each other.

## Approach
Here, we propose a searching and ranking framework using a combination of BioBERT sentence embeddings and BM25 ranking algorithm to guide reviewers quickly to the relevant pieces of information relative to a particular query. We specifically use a **two-stage process** in order to do 1) an initial search on sentence embeddings on abstract content, and 2) subsequently perform BM25 search of the full text based on the copied language from the first stage results.

Specifically:
0. An initial exhaustive keyword search is done to filter on the 50k+ articles pertaining to vaccines and vaccine development based on prior knowledge.
1. User provides a query related to the question of interest (i.e. "Antiviral effects of Covid").
2. Sentence embeddings are generated for the user query via pre-trained BioBERT for sentence similarity.
3. All abstracts are then split into individual sentences and embedded via the same pre-trained BioBERT for sentence similarity.
4. Using cosine distances, the embedded query is then compared against the corpus of abstract sentences to find the most similar sentence.
5. All results above a pre-specified similarity threshold (0.65 is the default) are returned (**These are the stage-1 results**). If a known keyword is included in the query, an additional filtering step is done to gaurantee that these results contain that keyword.
6. The **stage 1** results are then word tokenized, processed, and then fed into the fast Okapi BM25 ranking algorithm to search for the relevant sections of the full text that may give deeper context to the initial results (**These are the stage-2 results**). We also include metadata on the number of times the found article was cited to give context on the importance of the article
7. The output is either printed or saved, and human review is performed on the results to pick and extract out the most relevant pieces of text for final aggregation and summarization.


## Future Direction
### Knowledge Graphs
*Why a knowledge graph* - One of the initial tools in our kit we wanted to explore was instantiating a knowledge graph to represent the myriad information encoded in the corpus of text. In our early review of previous submissions we noted that many approaches were treating articles as independent objects of information. Due to the fact that science is an iterative and progressive process that builds upon historical and recent works - we wanted to capture that context and apply it in our analysis of the corpus. An initial knowledge graph was explored and built representing a variety of different entities in the corpus (authors, articles, citations, text). However, due to technical and time constraints we opted for a 100% in-memory option using networkx on a subset on entities and articles to accomplish some representation of citations.

*Technical challenges* - Due to a lack of memory efficient graph storage methods in kaggle, a lot of time was spent exploring options that would work around this constraint. Almost a full library of code was written to encode and store a knowledge graph in sqlite. While progress was made and certain entities robustly represented and encoded, it was evident that any robust graph algorithm would not run acceptably on such a data structure so the approach was shelved. Any team considering such an option must think hard about recursive querying and efficiency in such a structure to achieve something workable.

*Future work* - Here are some of the things we're hoping a knowledge graph can help us explore in the future alongside other experiments to better answer the task at hand:

- Does this work belong to a series of works?
- Can we find the articles that represent the "supporting knowledge" for a given article?
- Can we determine unique work given the context of all other works?
- Is there a network of authors contributing to the same domain?
- Can we attribute scientific rigor to certain articles, authors and apply that "trust" in the final aggregation step of the solution?
- Can we represent articles with their body text and identify similar content in other articles to help find parallel / adjacent / orthogonal work?
- Can we represent articles by the language used to cite them?

### Question Answering on Full Body Text
Question Answering is often not scalable due to the slow inference terms and the sheer number of text available in the full body. However, given if the BM25 searching algorithm can roughly find related content relative to the query, we can in the future go a step further and generate questions based on the query to extract the exact piece of information.

### Summarization for Auto Report Generation
Often times multiple pieces of text in the results are outputted based on competing similarity scores. The manual review will need to compare the multiple outputs and choose which ones that actually pertain to answering the question well. In the future, we could use a series of deep learning summarization or sentence embedding models to summarize all the results and cluster them based on their similarity. In this way, we can lesson the reading time for reviewers, and even work toward auto-report generation capabilities if paired with question answering capabilities.

# Highlighted Results
Here, we present our highlighted findings. These highlights were found by using our approach to query on multiple questions pertaining to many aspects of the task. This gets us 80% to the answer. Then, we finish the last 20% by doing a careful human review of the results and compiling them in a summary.

## TASK: Effectiveness of drugs being developed and tried to treat COVID-19 patients
**Clinical/Observational Studies**
* Lopinavir and Ritonavir seem to be common antiviral agents against COVID-19. Studies often compare new drug effectiveness against Lopinavir and Ritonavir
* Favipriavir was shown to be more effective than Lopinavir/Ritonavir control arms. In a separate study, it was also shown to be more effective than arbidol control arm
* Hydroxychloroquine was shown to be effective for recoverying of pneumonia effects of COVID-19
* Azithromycin added to Hydroxychloroquine was shown to be significantly more efficient for virus elimination, possibly because azithromycin was shown to have similar effects as hydroxychloroquine 
* Danoprevir boosted by ritonavir was shown to be safe and well tolerated in all patients
* Early and short doses of a corticosteroid called methylprednisolone was shown to be effective in treatment COVID-19

**Notes based on Review Articles**
* There have been a number of reports stating that non-steroidal anti-inflammatory drugs (NSAIDs) and corticosteroids may exacerbate symptoms in COVID-19 patients. Proper use of low-dose corticosteroids may bring survival advantages for critically ill patients, but this treatment should be strictly performed.
* Although SARS-CoV-2 replication is not entirely suppressed by interferons, viral titers are decreased by several orders of magnitude. It may be useful in the early stages of infection

## TASK: Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
**In Vitro Studies**
* Nelfinavir acts as an HIV Protease Inibitor
* Azithromycin and Ciprofloxacin have chloroquine effects and may act as alternatives to hydroxychloroquine/chloroquine
* Sofosbuvir, Tenofovir, and Alovudine are polymerases that block Sars-Cov-2 incorporation via RdRp
* Tenofovir and Emtricitabine terminates SARS-CoV-2 RdRp catalyzed reaction and can act as preventative treatments (PreP)
* Terfiflunomide and Leflunomide were shown to have solid antiviral reduction compared to favipiravir, a drug that is already undegoing clinical trials
* Darunavir was shown to have no activity against SARS-COV-2 during In Vitro studies

**Simulations and Modeling**
* Atazanavir, Efavirenz, Dolutegravir, and Saquinavir were shown to be potential candidates of treating COVID-19 based on simulations and modeling

**Notes based on Review Articles**
* Niclosamide was able to inhibit SARS-CoV replication and totally abolished viral antigen synthesis at a concentration of 1.56 μM
* Tocilizumab is a blocker of IL-6R, which can effectively block IL-6 signal transduction pathway

## TASK: Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models
In querying the CORD-19 corpus for diagnostic assays to evaluate immune responses to COVID-19, it is apparent that there are several assays that are currently being developed or ameliorated. Among these are the gold standards such as ELISA and PCR which have been leveraged both alone, and as per the results below, in combination with other antibody testing methodologies such as IgG/IgM to gain more confidence in CoV-2 infection diagnosis. Among the results that were returned based on the query, there were few assays that seem to be novel in their nature, such as sereological colirometric assays and fluorescence immunochromatographic tests. 

The results summarized below were manually curated from over 100 hits for the search query used. The curation mainly took into consideration what virus the assay was addressing, as several hits were referencing viruses not in the CoV family. Additional refinement of these results will include queries to better assess the suitable animal models as well as expand to gain insight into assays from a process development standpoint as opposed to simply a diagnostic one. 
### ELISA
* A newly-developed ELISA assay for IgM and IgG antibodies against N protein of SARS-CoV-2 were used to screen the serums of admitted hospital patients with confirmed or suspected SARS-CoV-2 infection. Of the 238 patients, 194 (81.5%) were detected to be antibody (IgM and/or IgG) positive, which was significantly higher than the positive rate of viral RNA (64.3%). There was no difference in the positive rate of antibody between the confirmed patients (83.0%, 127/153) and the suspected patients (78.8%, 67/85) whose nucleic acid tests were negative.

### IgG/IgM combined test
* The sensitivity and specificity of this ease-of-use IgG/IgM combined test kit were adequate, plus short turnaround time, no specific requirements for additional equipment or skilled technicians, all of these collectively contributed to its competence for mass testing. At the current stage, it cannot take the place of SARA-CoV-2 nucleic acid RT-PCR, but can be served as a complementary option for RT-PCR. The combination of RT-PCR and IgG-IgM combined test kit could provide further insight into SARS-CoV-2 infection diagnosis.

### Serological assay
* Because most patients have rising antibody titres 10 days after symptom onset, collection of serial serum samples in the convalescent phase would be more useful. Serum IgG amounts can rise at the same time or earlier than those of IgM against SARS-CoV-2. Posterior oropharyngeal saliva samples are a non-invasive specimen more acceptable to patients and health-care workers. Unlike severe acute respiratory syndrome, patients with COVID-19 had the highest viral load near presentation, which could account for the fast-spreading nature of this epidemic. This finding emphasises the importance of stringent infection control and early use of potent antiviral agents, alone or in combination, for high-risk individuals. Serological assay can complement RT-qPCR for diagnosis.

### Antibodies assays
* Combined use of antibodies assay and qRT-PCR at the same time was able to improve the sensitivities of pathogenic-diagnosis, especially for the throat swabs group at the later stage of illness. Moreover, most of these cases with undetectable viral RNA in throat swabs specimens at the early stage of illness were able to be IgM/IgG seropositive after 7 days.

### Gold immunochromatography assay
* The colloidal gold immunochromatography assay (GICA) is a rapid diagnostic tool for novel coronavirus disease 2019 (COVID-19) infections. However, with significant numbers of false negatives, improvements to GICA are needed.

### Reverse transcription loop-mediated isothermal amplification (RT-LAMP) assay
* This assay detected SARS-CoV-2 in the mean (±SD) time of 26.28 ± 4.48 min and the results can be identified with visual observation. 

### dPCR assays
* dPCR could be a confirmatory method for suspected patients diagnosed by RT-qPCR. Furthermore, dPCR is more sensitive and suitable for low virus load specimens from the both patients under isolation and those under observation who may not be exhibiting clinical symptoms. 
* Another study showed the overall accuracy of dPCR for clinical detection was 96.3%. dPCR was shown to be powerful in detecting asymptomatic patients and suspected patients. Digital PCR is capable of checking the negative results caused by insufficient sample loading by quantifying internal reference gene from human RNA in the PCR reactions. Multi-channel fluorescence dPCR system (FAM/HEX/CY5/ROX) is able to detect more target genes in a single multiplex assay, providing quantitative count of viral load in specimens, which is a powerful tool for monitoring COVID-19 treatment.

### Novel luciferase immunosorbent assays (LISA)
* The S1-, RBD-, and NP-LISAs were more sensitive than the NTD- and S2-LISAs for the detection of anti-MERS-CoV IgG. These LISAs proved their applicability and reliability for detecting anti-MERS-CoV IgG in samples from camels, monkeys, and mice, among which the RBD-LISA exhibited excellent performance."


### Rapid serological colorimetric test
* Rapid serological test showed a sensitivity of 30% and a specificity of 89% with respect to the standard assay but, interestingly, these performances improve after 8 days of symptoms appearance. After 10 days of symptoms the predictive value of rapid serological test is higher than that of standard assay. It may detect previous exposure to the virus in currently healthy persons.

### Fluorescence immunochromatographic assay
* Fluorescence immunochromatographic assay experiments were done for detecting nucleocapsid protein of SARS-CoV-2 in nasopharyngeal swab samples and urine within 10 minutes, and evaluated its significance in diagnosis of COVID-19. We measured nucleocapsid protein in nasopharyngeal swab samples in parallel with the nucleic acid test. 100% of nucleocapsid protein positive and negative participants accord with nucleic acid test for same samples.

### RT-qPCR
* Using flu and RSV clinical specimens,researchers have collected evidence that the RT-qPCR assay can be performed directly on patient sample material from a nasal swab immersed in virus transport medium (VTM) without an RNA extraction step. This approach was used to test for the direct detection of SARS-CoV-2 reference materials spiked in VTM. The data, while preliminary, suggest that using a few microliters of these untreated samples still can lead to sensitive test results. If RNA extraction steps can be omitted without significantly affecting clinical sensitivity, the turn-around time of COVID-19 tests and the backlog we currently experience can be reduced drastically.

### Novel in vivo cell-based assay
* Reseachers developed a novel in vivo cell-based assay for examining this interaction between the N-protein and packaging signal RNA for SARS-CoV, as well as other viruses within the coronaviridae family. The N-protein specifically recognizes the SARS-CoV packaging signal with greater affinity compared to signals from other coronaviruses or non-coronavirus species. These results describe, for the first time, in vivo evidence for an interaction between the SARS-CoV N-protein and its packaging signal RNA, and demonstrate the feasibility of using this cell-based assay to further probe viral RNA-protein interactions in future studies.

## TASK: Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers

Prophylaxis describes the efforts and measures to prevent infections and diseases. We summarize some of our findings on these efforts to develop prophylaxis clinical studies by categorizing them into general findings, findings related to α1-AR antagonists, and findings related to risk varied by prognostic factors.

### General findings

* Risk-adapted treatment strategy may be a useful tool for the treatment of COVID-19 patients. This strategy is associated with significant clinical manifestations alleviation and clinical imaging recovery.
* Harmonization of clinically heterogeneous endpoints within and between trials can lead to faster decision making and better management of COVID-19. 
* Early detection of elevations in serum CRP, combined with a clinical COVID-19 symptom presentation may be used as a surrogate marker for presence and severity of disease.
* There are multiple parameters of the clinical course and management of the COVID-19 that need optimization. A hindrance to this development is the vast amount of misinformation present due to scarcely sourced manuscript preprints and social media.
* Emphasize evidence-based medicine to evaluate the frequency of presentation of various symptoms to create a stratification system of the most important epidemiological risk factors for COVID-19.
* Vitamin C (L-ascorbic acid) has a pleiotropic physiological role, but there is evidence supporting the protective effect of high dose intravenous vitamin C (HDIVC) during sepsis induced ARDS.
* Epigenetic control of the ACE2 gene might be a target for prevention and therapy in COVID-19. 

### α1-AR antagonists
Preliminary findings offer a rationale for studying α1-AR antagonists in the prophylaxis of patients with COVID-19 cytokine storm syndrome (CSS) and acute respiratory distress syndrome (ARDS).
* Mortality of COVID-19 seems driven by acute respiratory distress syndrome (ARDS)
* Emerging evidence suggests that a subset of COVID-19 is characterized by the development of a CSS. 
* Pre-clinical mouse data suggests that α1-AR antagonists may be a candidate for the treatment of COVID-19.
* Using the Truven Health MarketScan Research DataBase, male men who were prescribed α1-AR antagonists in the previous year had lower odds of the composite of need for invasive mechanism ventilation and mortality compared to non-users (AOR 0.80, 95% CI 0.69-0.94, p=0.008) 

### Relative Risk of COVID-19 for Patients Varies by Prognostic Factors
COVID-19 patient outcomes vary by patient characteristics and are important considerations for COVID-19 prophylaxis. Potential important factors include interleukin-6, B lymphocyte proportion, lactate, and CD8+ T cells. 
* Compared with patients without pneumonia, those with pneumonia were 15 years older and had a higher rate of hypertension, higher frequencies of having a fever and cough, and higher levels of interleukin-6, B lymphocyte proportion, and low account of CD8+ T cells. 
* Multivariate Cox regression analysis indicated that circulating interleukin-6 and lactate independently predicted COVID-19 progression, with a hazard ratio (95%CI) of 1.052 (1.000-1.107) and 1.082 (1.013-1.155), respectively. 




# Prerequisites

We use the following main libraries:
* **Transformers** - The wildly popular Transformers library from HuggingFace provides use with the ability to download pretrained model and use them out the box for NLP tasks such as sentence similarity
* **Sentence Transformers** - An extension of the Transformer library with the specific goal of training and providing models for sentence similarity tasks
* **NTLK** - A popular NLP and text processing library, which we use to process, clean, and tokenize the text data (For Stage 2 results)
* **rank_bm25** - A library to perform the BM25 ranking algorithm
* **langdetect** - A library to help filter out non-english articles
* **networkx** - NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

In [1]:
%%capture
!pip install sentence-transformers
!pip install transformers --upgrade
!pip install langdetect
!pip install rank_bm25
!pip install networkx

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pathlib import Path, PurePath #Easy path directory
import json # Reads json
import os
import multiprocessing as mp

import transformers # NLP task pipeline
from transformers import pipeline, AutoModelWithLMHead, AutoModelForQuestionAnswering, AutoTokenizer, AutoModel, AutoTokenizer, AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer # For downloading pretrained models
from sentence_transformers import SentenceTransformer, models # For sentence embeddings trained on semantic similarity

from langdetect import detect
import re

import scipy
import statistics 
from rank_bm25 import BM25Okapi
import nltk
from nltk.corpus import stopwords

# For network analysis on citations
import networkx as nx
from itertools import chain
from tqdm.auto import tqdm
tqdm.pandas()

nltk.download("punkt")
nltk.download('stopwords')
pd.set_option('display.max_columns', None)  

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Creating a List of Keywords

We leverage prior and continuous knowledge about vaccines to do an initial keyword filter of the documents. These keywords containg words pertaining to treatments and antiviral agents that may be vaccine candidates. Additionally, other vaccine related keywords such as "assay", "phrophylaxis", and "antibody dependent enhancement" were included. During our review, if we noticed new keywords that may be helpful in doing an initial filter, then we update the keyword list.

Initially, most of these keywords were obtained via a continuously updated report from the National University of Singapore (https://sph.nus.edu.sg/covid-19/research/).

**NOTE: We may update this step to instead leverage recent release of specter embeddings in the official CORD-19 database to do our initial filtering**

In [3]:
keywords = [
    'remdesivir',
    'azithromycin',
    'ciprofloxacin',
    'lopinavir',
    'ritonavir',
    'interferon',
    'chloroquine',
    'hydroxychloroquine',
    'darunavir',
    'cobicistat',
    'emtricitabine',
    'nelfinavir',
    'tenofovir',
    'saquinavir',
    'azuvudine',
    'favipiravir',
    'umifenovir',
    'oseltamivir',
    'baloxavir',
    'methylprednisolone',
    'ribvarin',
    'sofosbuvir',
    'beclabuvir',
    'galidesivir',
    'simeprevir',
    'nitazoxanide',
    'niclosamide',
    'naproxen',
    'clarithromycin',
    'minocyclinethat',
    'human monoclonal antibody',
    'tocilizumab',
    'sarilumab',
    'leronlimab',
    'foralumab',
    'camrelizumab',
    'ifx-1',
    'ifx',
    'arbidol',
    'fingolimod',
    'brilacidin',
    'sirolimus',
    'danoprevir',
    'rintatolimod',
    'cynk-001',
    'cynk',
    'tmprss2',
    'jak',
    'zinc',
    'quercetin',
    'convalescent plasma',
    'nanoviricide',
    'corticosteroids',
    'bevacizumab',
    'bxt-25',
    'bxt',
    'angiotension',
    'rhace2',
    'pirfenidone',
    'thalidomide',
    'brohexine hydrochloride',
    'dehydroandrographolide succinate',
    'antibody dependent enhancement',
    'antibody-dependent enhancement',
    ' ade ',
    'prophylaxis',
    'prophylactic',
    'vaccine',
    'assay',
    'elisa',
    'th1',
    'th2',
    'elispot',
    'cytometry',
    'ctc'
]


# Create Helper Functions

We define our functions that aid in data processing, text cleaning, and BM25 search 

* **get_deeper_context()** - This function performs the BM25 algorithm on the full text against a query of tokenized words and returns the relevant sections of text that meets a threshold criteria
* **get_sorted_indices()** - This function takes the BM25 scores, finds all the BM25 scores that meet a threshold crtieria, and returns the indices that map to the section of the full text
* **strip_characters(), clean(), tokenize(), preprocess()** - These functions all roll up to preprocess() in order to remove unnecessary punctuation, lower case the text, tokenize the text, and remove stopwords.
* **get_keywords()** - This function finds all the keywords in the title and the full text and returns them.


In [4]:
# Functions

# This is a list of English stopwords that do not contribute to useful information and helps to generate useful tokens for searching
english_stopwords = list(set(stopwords.words('english')))

# This function performs the BM25 algorithm on the full text against a query of tokenized words and returns the relevant sections of text that meets a threshold criteria
def get_deeper_context(row):
    # Grab variables from the input data row
    sha = row['sha']
    results = row['result']
    
    # Sometimes outputs will contain multiple results, that is separated by the "\n\n" delimiter. So we make sure to split the results and process them separately.
    results = results.split("\n\n")
    
    # We take each results and preprocess them so that useful words are tokenized. These will help perform our BM25 search
    tokenized_results = [preprocess(result) for result in results]
    
    # Create variables to be stored later
    paragraphs = []
    tokenized_paragraphs = []
    candidate_paragraphs = []
    candidates = ''
    
    # Scanning through data to read the relevant file based on the Sha and acquire the full text
    for path in Path(directory).rglob('*.json'):
        if sha in path.name:
            data = json.loads(open(path.absolute()).read())
            # Grabs the full body text (which is a list of paragraphs)
            body_text = data['body_text']

            # Loops through each paragraph and appends them into a list
            for i, paragraph in enumerate(body_text):
                text = " ".join(paragraph['text'].split()).replace(" ,",",")
                paragraphs.append(text)
                tokenized_paragraphs.append(preprocess(text))
            break

    try: 
        # Feed the paragraphs into the BM25 API
        bm25 = BM25Okapi(tokenized_paragraphs)
        # Loop through the tokenized results and get BM25 scores to see which paragraphs were most relevant to the query.
        for tokenized_result in tokenized_results:
            doc_scores = bm25.get_scores(tokenized_result) # BM25 scores
            candidate_paragraphs.append(get_sorted_indices(doc_scores)) # Saving the indices of relevant paragraphs into a list

        # Deduping and sorting the list by index number
        candidate_paragraphs = [item for sublist in candidate_paragraphs for item in sublist]
        candidate_paragraphs = list(set(candidate_paragraphs))
        candidate_paragraphs.sort()

        # Combine the relevant paragraphs into a single string.
        for index in candidate_paragraphs:
            candidates = candidates+"Paragraph: "+str(index)+"\n"
            candidates = candidates+paragraphs[index]+"\n \n"
    except:
        candidates = "NA"
        
    # Saving the results into their own column
    row['context'] = candidates

    return row

# This function takes the BM25 scores, produces the mean and standard deviation of the score, and outputs the relevant scores only if it is 1.5 standard deviations way from the mean
def get_sorted_indices(l):
    std = statistics.stdev(l) # Standard deviation
    mean = statistics.mean(l) # Mean
    threshold = mean+(std*1.5) # 1.5 standard deviation treshold
    max_score = max(l) # Max scire
    
    indices = []

    # Looping through the scores and applying the threshold
    for index, score in enumerate(l):
        if score >= threshold:
            indices.append(index)
    
    indices.sort()
    
    return indices

# This function strips characters such as apostrophes, etc.
def strip_characters(text):
    t = re.sub('\(|\)|:|,|;|\.|’|”|“|\?|%|>|<', '', text)
    t = re.sub('/', ' ', t)
    t = t.replace("'",'')
    return t

# This function calls the strip_characters() function and also lowercases all text
def clean(text):
    t = text.lower()
    t = strip_characters(t)
    return t

# This function takes a text and cleans and tokenizes it, while removing the stopwords.
def tokenize(text):
    words = nltk.word_tokenize(text)
    return list(set([word for word in words 
                     if len(word) > 1
                     and not word in english_stopwords
                     and not (word.isnumeric() and len(word) is not 4)
                     and (not word.isnumeric() or word.isalpha())] ))

# This is the wrapper function that incorporates the previous functions to clean and tokenize text
def preprocess(text):
    t = clean(text)
    tokens = tokenize(t)
    return tokens


# This function takes in a piece of text and extracts out the pre-specified keywords found in the text.
def get_keywords(row):
    found_keywords = []
        
    # Looping through each row and column of the data
    for col in row.iteritems():
        # Checking if the column is the title or abstract
        if ("title" in col[0]) | ("abstract" in col[0]):
            text = col[1].lower() # lowercass
            text = " ".join(text.split()) # removes useless whitespace
            
            # Loops through the known keywords and detects if it is found in the text
            for keyword in keywords:
                if keyword in text:
                    found_keywords.append(keyword)
            
    # De-duplicates the keywords found
    found_keywords = set(found_keywords)
    
    # If no keywords found, return NA
    if len(found_keywords) == 0:
        row['keywords'] = 'NA'
    else:
        row['keywords'] = "; ".join(found_keywords)
        
    return row


# Initial Filtering of Articles via Keyword Search

We perform the following filtering, cleaning and processing steps:
1. Read the metadata CSV
2. Filter for articles newer than 2019-11-01 (COVID-19 date)
3. Filter articles that don't contain full text
4. Filter articles that don't contain titles or abstracts
5. We found two articles that broke many formattings and encodings of our report, and filtered them out
6. Clean the text data by removing unnecessary white space
7. Filter for titles or abstracts that contain the pre-specified keyword terms and for each article, save the relevant keywords found
8. Drop non-english articles

In [5]:
# Read the metadata
all_data = []
directory = '/kaggle/input/CORD-19-research-challenge/'
metadata = pd.read_csv(directory+"metadata.csv")
print("Total number of articles")
print(metadata.shape)

# Filter for articles newer than 2019-11-01
date = '2019-11-01'
metadata = metadata[metadata['publish_time'] >= date]
print("Filter for articles after date: "+date)
print(metadata.shape)

# Filter articles that don't have full text
metadata = metadata[metadata['has_pdf_parse']]
print("Filter for full text")
print(metadata.shape)

# Filter articles that don't have title or abstract
metadata = metadata[metadata['title'].str.len() > 0]
metadata = metadata[metadata['abstract'].str.len() > 0]
print("Filter for non-empty title and abstract")
print(metadata.shape)

# Filter certain articles that seem to be of bad quality and messes up export formating
metadata = metadata[metadata['sha'] != 'a5293bb4f17ad25a72133cdd9eee8748dd6a4b8d']
metadata = metadata[metadata['sha'] != 'b30770ae30b35cdfaf0a173863e74e93edbb0329']

# Clean text data
metadata['title'] = metadata['title'].apply(lambda x: " ".join(x.split()))
metadata['abstract'] = metadata['abstract'].apply(lambda x: " ".join(x.split()))

# Filter for titles and abstracts that have mention of one of the keywords
keyword_query = "|".join(keywords)
metadata = metadata[metadata['title'].str.contains(keyword_query, flags=re.IGNORECASE, regex=True) | 
                    metadata['abstract'].str.contains(keyword_query, flags=re.IGNORECASE, regex=True)]

# Finds the keywords found in each article and makes a column out of it
metadata = metadata.apply(get_keywords, axis=1)
print("Filter for terms relating to treatments")
print(metadata.shape)

# Drop non-english articles
for index, row in metadata.iterrows():
    title = row['title']
    lang = detect(title)
    if lang != 'en':
        metadata.drop(index, inplace=True)

print("Filter for English articles")
print(metadata.shape)

# Resets the index
metadata = metadata.reset_index(drop=True)

Total number of articles
(51078, 18)
Filter for articles after date: 2019-11-01
(6029, 18)
Filter for full text
(4312, 18)
Filter for non-empty title and abstract
(3457, 18)
Filter for terms relating to treatments
(820, 19)
Filter for English articles
(796, 19)


# Getting popularity of article via Citations
Leveraging the amount of times an article was cited is a useful way of getting a rough ranking on the articles for manual review.

## Approach
To get these numbers and maintain a result that doesn't blow up our RAM allocation we built a simple strategy to get citations for our relevant articles.

1. To maintain memory efficiency, process the json articles in chunks. At any given time we won't have more than :chunk_size: articles read into memory as dictionaries.
2. Using as many processes as we can, search each articles json dict for citations.
3. With the relevant articles in hand from the above process, extract only those citations pointing to an article in our list. We could speed up this check by including the dates that our relevant articles were published and eliminating those from the checking set that were published after the article we're extracting citations from. If publishing dates are distributed normally this could speed up the process significantly.
4. Finally, combine the sharded results and build a directed graph using networkx.
5. With the relevant citation graph in hand, produce a function to lookup the number of times an article was cited using its title.

You'll notice that even with this strategy, the graph takes up a significant amount of memory. If we want to build a more robust knowledge graph in the future we need a way around this. Our team explored using sqlite but abandoned the approach due to time. It may be something we explore more for submission 2.

In [6]:
def open_json(file_path):
    """
    Helper function to open json file
    """
    with open(file_path, 'r') as f:
        json_dict = json.load(f)
    return json_dict


def json_path_generator(data_path=os.path.abspath("/kaggle/input"), limit=None):
    """
    Helper function to get all the paths of json files in the input directory.
    """
    return_files = []
    for dirname, _, filenames in os.walk(data_path):
        for filename in filenames:
            if filename[-5:] == ".json":
                return_files.append(os.path.join(dirname, filename))
                if limit is not None and type(limit) == int:
                    if len(return_files) >= limit:
                        return return_files
    return return_files


def get_json_dicts(paths, progress_bar=None):
    """
    Helper function to open a list of paths as json dicts. Careful about memory usage here.
    Optionally takes a tqdm bar as input to show progress of loading.
    """
    json_dicts = []
    # (I) Max of 2 or number of cpus minus 1, then min of (I) or the number of paths. If limit is used and its small,
    # avoid excessive pool sizes.
    process_num = min(max(2, os.cpu_count()), len(paths))
    with mp.Pool(process_num) as pool:
        for result in pool.imap_unordered(open_json, paths):
            json_dicts.append(result)
            if progress_bar is not None:
                progress_bar.update(1)

    return json_dicts


# Get articles cited by our articles
def get_article_citations(article_dict, select_articles=None):
    """
    Function to extract set of (from, to) citation edges from articles.
    :select_articles: A set of articles to check so that only citations recorded are those
                      where the 'to' node is in the set. Used in this block to limit our citation graph
                      to only those citations of the articles we've deemed relevant.
    """
    article_title = article_dict.get("metadata", {}).get("title", "")
    article_citations = set()
    # Get citations and their ids
    bib_entries = article_dict.get("bib_entries", {})
    for entry in bib_entries:
        entry_dict = bib_entries.get(entry, {})
        entry_title = entry_dict.get("title", "")
        if select_articles is not None and type(select_articles) == set:
            # Check that article being cited is in our list of articles to look for
            if entry_title in select_articles:
                article_citations.add((article_title, entry_title))
        else:
            article_citations.add((article_title, entry_title))
    return list(article_citations)


def get_article_citations_meta(arg_list):
    """
    Meta function of get_article_citations so that we can parallelize with multiple arguments.
    First argument is a single arguments dictionary.
    Second argument is the optional set of article titles to limit citations.
    """
    return get_article_citations(arg_list[0], arg_list[1])


def divide_chunks(l, n): 
    """
    Function shameless taken from stackoverflow to split a list (l) into sublists of size n.
    """
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n]


def process_all_articles(tasks: list, function, progress_bar=None):
    """
    A wrapper function to call a single function over a set of tasks with a multiprocessing pool.
    """
    results = []
    process_num = min(max(2, os.cpu_count()), len(tasks))
    with mp.Pool(process_num) as pool:
        for result in pool.imap_unordered(function, tasks):
            results.append(result)
            if progress_bar is not None:
                progress_bar.update(1)
    return results


def add_citation_edges(graph: nx.DiGraph, edges: list):
    """
    Function to record edges in the networkx DiGraph object.
    """
    all_citation_edges = list(set(chain.from_iterable(edges)))
    graph.add_edges_from(all_citation_edges)


def build_citation_graph(filter_articles=None, paths=None, limit=None, chunk_size=5000):
    """
    Function to build citation graph from beginning to end.
    :filter_articles: Set of article titles to limit citations to. 
                      Citations will only be recorded if the 'to' article's title is in the set.
    :paths: Subset of paths to operate over. If None operate over all.
    :limit: Can limit number of paths to this given number.
    :chunk_size: Number of article dicts to hold in memory at once to process. 5000 seems like a decent choice.
    """
    # Get paths of articles
    if paths is None:
        paths = json_path_generator(limit=limit)
    # Split the paths into chunks for memory efficiency
    chunked_paths = list(divide_chunks(paths, chunk_size))
    # Build the citation graph
    graph = nx.DiGraph()
    functions = [
        get_article_citations_meta
    ]
    function_progress_bar = tqdm(total=len(functions), leave=False, position=1, desc="Function progress on chunk")
    if limit is None:
        task_num = chunk_size
    else:
        task_num = min(chunk_size, limit)
    task_progress_bar = tqdm(total=task_num, leave=False, position=2, desc="Task progress bar")
    all_results = list()
    for paths in tqdm(chunked_paths, leave=False, position=0, desc="Chunk progress"):
        task_progress_bar.reset(total=len(paths))
        path_dicts = get_json_dicts(paths)
        tasks = [[x, filter_articles] for x in path_dicts]
        function_progress_bar.reset()
        for func in functions:
            func_name = func.__name__
            function_progress_bar.set_description("Calling " + func_name)
            results = process_all_articles(tasks, func, task_progress_bar)
            # Combine list of sets
            result_edges = list(set(chain.from_iterable(results)))
            all_results.append(result_edges)
            function_progress_bar.update(1)
    # Update the graph object
    add_citation_edges(graph, all_results)
    function_progress_bar.close()
    task_progress_bar.close()
    print("Done")
    return graph


relevant_article_citation_graph = build_citation_graph(set(metadata.title.tolist()), chunk_size=5000)

HBox(children=(FloatProgress(value=0.0, description='Function progress on chunk', max=1.0, style=ProgressStyle…

HBox(children=(FloatProgress(value=0.0, description='Task progress bar', max=5000.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Chunk progress', max=12.0, style=ProgressStyle(descriptio…

Done


In [7]:
def get_number_citations(article_title: str, citation_graph: nx.DiGraph) -> int:
    """
    Function to get the number of citations received by an article.
    :article_title: Title of article to check for numbre of citations received.
    :citation_graph: nx.DiGraph instance with edges denoting the number of citations
                     built with the orientation (citing article, cited article).
    """
    num_citations = citation_graph.in_degree(article_title)
    if type(num_citations) is not int:
        return 0
    return num_citations

# Calculating the number of times an article was cited in each of the articles we filtered for
metadata['number_citations'] = metadata.title.apply(lambda title: get_number_citations(title, relevant_article_citation_graph))

# Saving the initial filtering in a file
metadata.to_pickle('/kaggle/working/metadata.pkl')
metadata.to_csv('/kaggle/working/metadata.csv', index=False)

# Printing out most popular citations in our filter search
metadata.sort_values('number_citations', ascending=False).head(10)[['title', 'number_citations']]

Unnamed: 0,title,number_citations
634,Middle East respiratory syndrome,273
660,Detection of 2019 novel coronavirus (2019-nCoV...,64
66,Comparative therapeutic efficacy of remdesivir...,34
697,SARS-CoV-2 Cell Entry Depends on ACE2 and TMPR...,31
376,The novel coronavirus 2019 (2019-nCoV) uses th...,27
311,Hydroxychloroquine and azithromycin as a treat...,23
728,Potent binding of 2019 novel coronavirus spike...,20
713,Prophylactic and therapeutic remdesivir (GS-57...,19
641,Severe acute respiratory syndrome coronavirus ...,15
784,"Structure, Function, and Antigenicity of the S...",13


## Embed Sentences within Abstracts

Here we take all the abstracts we found and tokenize them into individual sentences. We also filter out short sentences (sentences less than 6 words) as a cleaning step to filter out and labels.

We download a pre-trained BioBERT model from the Transformers library and load the model into the SentenceTransformer class in order to be able to output sentence embeddings of text optimized for semantic similarity comparison.

Finally, we take all our sentences and produce sentence embeddings using the pretrained BioBERT model optimized for semantic similarity.

**NOTE: Recommend to use GPU to significantly speed up this process.**

In [8]:
# Create folder to store out BioBERT model
if not os.path.exists('/kaggle/working/model'):
    os.makedirs('/kaggle/working/model')
    
# Make dataframe of sentences from abstract
sent_dict = {'sha':[],'sentence':[]}

# Loop through our filtered list from the metadata
for index, row in metadata.iterrows():
    sha = row['sha']
    abstract = row['abstract']
    
    # Take the abstract and tokenize them on the sentence level
    sentences = nltk.tokenize.sent_tokenize(abstract)
    
    # Loop through the abstract sentences
    for sentence in sentences:
        # Make sure sentence is at least 6 words (to filter out useless labelings or headings)
        sentence_split = sentence.split()
        if len(sentence_split) > 5:
            sent_dict['sha'].append(sha)
            sent_dict['sentence'].append(sentence)

# Convert our list of abstract sentences to a dataframe
df_sentences = pd.DataFrame(sent_dict)
df_sentences.head()

# Download and setup the model
tokenizer = AutoTokenizer.from_pretrained("gsarti/biobert-nli")
model = AutoModelWithLMHead.from_pretrained("gsarti/biobert-nli")

# Initialize and save the model
model.save_pretrained("/kaggle/working/model")
tokenizer.save_pretrained("/kaggle/working/model")
embedding = models.BERT("/kaggle/working/model",max_seq_length=128,do_lower_case=True)
pooling_model = models.Pooling(embedding.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[embedding, pooling_model])
model.save("/kaggle/working/model")
encoder = SentenceTransformer("/kaggle/working/model")

# Perform the sentence embedding conversion
sentences = df_sentences['sentence'].tolist()
sentence_embeddings = encoder.encode(sentences)
df_sentences['embeddings'] = sentence_embeddings

# Save the sentence embeddings dataframe.
df_sentences.to_pickle('/kaggle/working/sentence_embeddings.pkl')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1017.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=136.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433288887.0, style=ProgressStyle(descri…




Batches: 100%|██████████| 912/912 [00:26<00:00, 34.64it/s]


# Quick Check on Keywords Found

We perform a quick check to see which keywords have been successfully identified and extracted. We found that among our list, 50+ of the keywords were found.

In [9]:
# Some analysis
found_keywords = []
for keyword in metadata['keywords']:
    found_keywords.append(keyword.split("; "))

found_keywords = set([item for sublist in found_keywords for item in sublist])

print(found_keywords)
print("Number of keywords found: "+str(len(found_keywords)))

{'convalescent plasma', 'saquinavir', 'arbidol', 'elispot', 'antibody dependent enhancement', 'ciprofloxacin', 'cobicistat', 'tocilizumab', 'vaccine', 'sofosbuvir', 'remdesivir', 'assay', 'nitazoxanide', 'elisa', 'corticosteroids', 'methylprednisolone', 'quercetin', 'oseltamivir', 'ctc', 'prophylaxis', 'bevacizumab', 'azithromycin', 'tenofovir', 'umifenovir', 'niclosamide', 'emtricitabine', 'prophylactic', 'baloxavir', 'nelfinavir', ' ade ', 'fingolimod', 'sirolimus', 'hydroxychloroquine', 'cytometry', 'ritonavir', 'lopinavir', 'favipiravir', 'th2', 'darunavir', 'chloroquine', 'zinc', 'antibody-dependent enhancement', 'human monoclonal antibody', 'jak', 'th1', 'danoprevir', 'galidesivir', 'interferon', 'tmprss2'}
Number of keywords found: 49


# Query and Results

Finally, we are ready to query and find information. We achieve this by wrapping a our approach in a function called **execute_query()**.

This function takes in several parameters: 
1. The query
2. The metadata from our intial filtering
3. The sha number of a particular article for the option of filtering on a single literature
4. Similarity score threshold, which is defaulted to 0.65.

Within the **execute_query()** function, the steps are:
1. Take the query and produce a sentence embeddings from it
2. If there is no query and a sha is provided, then set the similarity threshold to 0, as it means we want to output the full text of a single article
3. Calculate the cosine distance between the query and the abstract sentences to produce simlarity scores
4. Grab all the similarity results above the specified similarity threshold (**Stage 1**)
5. Use the keywords from the results, process them, and feed them to the BM25 algorithm in order to extract the relevant pieces of text from the full body to provide a richer and deeper context of the results (**Stage 2**)
6. Print out or save the results

In [10]:
# Make querys against the abstract sentence embeddings to identify candidates

def execute_query(query, metadata, sha = '', similarity_threshold = 0.65, print_output=True):
    
    # 1. Take the query and produce a sentence embeddings from it
    query = [query]
    query_embedding = encoder.encode(query)
    
    # See if the query contains any of our keywords
    query_keywords = list(set(query[0].split()) & set(keywords))
    
    # 2. If there is no query and a sha is provided, then set the similarity threshold to 0, as it means we want to output the full text of a single article
    similarity_threshold = 0 if ((query[0] == "") and len(sha)>0) else similarity_threshold

    # 3. Calculate the cosine distance between the query and the abstract sentences to produce simlarity scores
    distances = scipy.spatial.distance.cdist(query_embedding, sentence_embeddings, "cosine")[0]
    results = zip(range(len(distances)), distances) # Pair the indices with the cosine distance
    results = sorted(results, key=lambda x: x[0]) # Sort them by index (this is needed to match the cosine scores with the results)

    # 4. Grab all the similarity results above the specified similarity threshold  
    result_dict = {'sha':[],'result':[]}
    
    # Loop through the results of the cosine distance calculations
    for idx, distance in results:
        # The similarity score is 1-distance (so that higher score = better)
        similarity_score = 1-distance
        
        # Get the Sha and the sentence from our sentence dataframe
        sentence = df_sentences['sentence'].iloc[idx].strip()
        sha_id = df_sentences['sha'].iloc[idx].strip()

        # If the similarity score of the sentence is below the threshold, then ignore it
        if similarity_score < similarity_threshold:
            continue
            
        # If a single sha id was provided, make sure to skip all the other articles that don't match that id
        if len(sha) > 0 and sha_id !=sha:
            continue

        # If known keywords were found in the query, then make sure that the abstract contains that keyword
        if len(query_keywords) > 0:
                # Get the abstract from the sha id
                abstract = metadata[metadata['sha'] == sha_id]['abstract'].item().lower()
                # Determine if the keyword is in the abstract, if so then add that result
                if any(keyword in abstract for keyword in query_keywords):
                    result_dict['sha'].append(sha_id)
                    result_dict['result'].append(sentence)
        # If instead a single Sha id was provided, then then if a query was provivded or not and return the results
        elif len(sha) > 0 and sha_id == sha:
            result_dict['sha'].append(sha_id)
            # If a query was not provided, then the result is blank (and later on the full body text will be returned)
            if query[0] == "":
                result_dict['result'].append("")
            # Otherwise, return the relevant result
            else:
                result_dict['result'].append(sentence)
        # If no known keywords or single sha was identified, just return all available matches
        else:
            result_dict['sha'].append(sha_id)
            result_dict['result'].append(sentence)
    
    # Convert the stage 1 results to a dataframe
    temp_result = pd.DataFrame(result_dict)

    # 5. Use the keywords from the results, process them, and feed them to the BM25 algorithm in order to extract the relevant pieces of text from the full body to provide a richer and deeper context of the results (**Stage 2**)
    
    # If there are multiple results found, then merge them together in the same string using the "\n\n" delimeter (So that context from all results are obtained)
    temp_result = temp_result.groupby('sha')['result'].apply("\n\n".join).reset_index()
    # Apply the get_deeper_context() method to all the results in order to grab the deeper context
    temp_result = temp_result.apply(get_deeper_context, axis=1)
    # Save all the rsults in a dataframe
    report = pd.merge(temp_result, metadata[['sha','title','publish_time','abstract','keywords','journal','number_citations']], how='inner')
    report['query'] = query[0]

    # If the print_output paramter is specified (default = True), then print the results in the console
    if print_output:
        # Print results
        print("======= REPORT =========")

        for index, row in report.iterrows():
            title = row['title']
            sha = row['sha']
            result = row['result']
            num_citations = row['number_citations']
            abstract = row['abstract']
            context = row['context']
            keywords_found = row['keywords']
            publish_date = row['publish_time']

            print("Query: "+query[0])
            print("Sha: "+sha+"\n")
            print("Title: "+title+"\n")
            print("Published Date: "+publish_date+"\n")
            print("Number of times cited: "+str(num_citations)+"\n")
            print("Antiviral Related Terms: "+keywords_found+"\n")
            if (len(result.replace("\n","")) > 0):
                print("Results: \n"+result+"\n")
            print("Abstract: ")
            print(abstract+"\n")
            print("Context: ")
            print(context+"\n")
            print("----------------")
    
    return report
                

## Executing on a broad query

Here we demonstrate an example of a query. 

We execute on the query "antiviral treatments for COVID-19" to answer one of the task. The query will default to printing the output so that the user can scroll through the results and do a quick evaluate. The results can also be saved into a dataframe and outputted via csv.


In [11]:
execute_query("antiviral treatments for COVID-19", metadata, similarity_threshold=0.70)

Batches: 100%|██████████| 1/1 [00:00<00:00, 58.67it/s]


Query: antiviral treatments for COVID-19
Sha: 0370abacf3b212bd6bded8f23c8c904000f6c2e6

Title: Risk-adapted Treatment Strategy For COVID-19 Patients

Published Date: 2020-03-27

Number of times cited: 0

Antiviral Related Terms: methylprednisolone

Results: 
The objective of this study is to investigate the short-term effect of risk-adapted treatment strategy on patients with COVID-19.

Abstract: 
Abstract Background There are no clear expert consensus or guidelines on how to treat 2019 coronavirus disease (COVID-19). The objective of this study is to investigate the short-term effect of risk-adapted treatment strategy on patients with COVID-19. Methods We collected the medical records of 55 COVID-19 patients for analysis. We divided these patients into mild, moderate and severe groups, and risk-adapted treatment approaches were given according to the illness severity. Results Twelve patients were in mild group and 22 were in moderate group (non-severe group, n=34), and 21 patients wer

Unnamed: 0,sha,result,context,title,publish_time,abstract,keywords,journal,number_citations,query
0,0370abacf3b212bd6bded8f23c8c904000f6c2e6,The objective of this study is to investigate ...,"Paragraph: 1\nHowever, until now, there are no...",Risk-adapted Treatment Strategy For COVID-19 P...,2020-03-27,Abstract Background There are no clear expert ...,methylprednisolone,International Journal of Infectious Diseases,0,antiviral treatments for COVID-19
1,06a1002f9fbea7179ac3572843f66b14568af6e4,2019-nCov Mpro is a potential drug target to c...,Paragraph: 16\n2019-nCov caused more than 80 d...,Nelfinavir was predicted to be a potential inh...,2020-01-28,Abstract2019-nCov has caused more than 80 deat...,nelfinavir,,4,antiviral treatments for COVID-19
2,0c9d951acb01cb541671b3065b882bbcb61f9523,"In addition, registered trials investigating t...",Paragraph: 6\nThere is no current evidence fro...,"The epidemiology, diagnosis and treatment of C...",2020-03-28,"Abstract In December 2019, the outbreak of the...",convalescent plasma; hydroxychloroquine; vacci...,International Journal of Antimicrobial Agents,0,antiviral treatments for COVID-19
3,188e7ff1e260864c89f266b5597de26d69a84660,WHO has named the novel coronavirus disease as...,Paragraph: 0\nThe World Health Organization (W...,Clinical trials on drug repositioning for COVI...,2020-03-20,The World Health Organization (WHO) was inform...,oseltamivir; hydroxychloroquine; arbidol; rito...,Rev Panam Salud Publica,0,antiviral treatments for COVID-19
4,188f3e97042155ac1709aa8b74c0755760c3b50d,This review focuses on the effects of these dr...,Paragraph: 1\nThe current review focuses on th...,Associations between immune-suppressive and st...,2020-03-27,BACKGROUND: Cancer and transplant patients wit...,jak,Ecancermedicalscience,0,antiviral treatments for COVID-19
5,19a08cc1423d6006b2251bdcfd142f88db7002e7,Many drugs showed potential for COVID-19 therapy.,Paragraph: 1\nThis promotes the importance of ...,Genetic Profiles in Pharmacogenes Indicate Per...,2020-03-30,Background: The coronavirus disease 2019 (COVI...,lopinavir; chloroquine; interferon,,0,antiviral treatments for COVID-19
6,1f5c1597a84ed1d4f84c488cd19098a091a3d513; 29d1...,"Indeed, age and disease severity may be correl...",,"Asymptomatic carrier state, acute respiratory ...",2020,Since the emergence of coronavirus disease 201...,chloroquine; remdesivir,"Journal of Microbiology, Immunology and Infection",0,antiviral treatments for COVID-19
7,23e7355b5e4e0209f64c9d8d5772092a53b72686,Aims: Studies have indicated that chloroquine ...,Paragraph: 2\nThe recent publication of result...,Efficacy of hydroxychloroquine in patients wit...,2020-03-30,Aims: Studies have indicated that chloroquine ...,hydroxychloroquine; chloroquine,,1,antiviral treatments for COVID-19
8,2f547947bf87380c7fab13ba2c663bbbe9e643ec,"Here, we report the epidemiological and virolo...",Paragraph: 0\nThe recent outbreak of COVID-19 ...,"COVID-19: Epidemiology, Evolution, and Cross-D...",2020-03-21,The recent outbreak of COVID-19 in Wuhan turne...,vaccine,Trends in Molecular Medicine,0,antiviral treatments for COVID-19
9,5998e7fb99b5c58f7563017ad679fbe2f9974a0c,OBJECTIVE: To determine the relative impact of...,"Paragraph: 3\nNotwithstanding, despite its in ...",Chloroquine and hydroxychloroquine for the tre...,2020-04-08,OBJECTIVE: To determine the relative impact of...,hydroxychloroquine; chloroquine,,0,antiviral treatments for COVID-19


## Executing on a paper specific query

Sometimes, we want to be able to zoom in on a specific detail and a more detailed exploration. 

In this sample, we find a paper about a hydroxychloroqine study and want to view the full paper details. To accomplish this, we enter the sha number into the parameters and leave the query blank. Note, however, that you can still enter keywords in the query section to get a targeted search

In [12]:
execute_query("", metadata, sha='23e7355b5e4e0209f64c9d8d5772092a53b72686')

Batches: 100%|██████████| 1/1 [00:00<00:00, 19.09it/s]


Query: 
Sha: 23e7355b5e4e0209f64c9d8d5772092a53b72686

Title: Efficacy of hydroxychloroquine in patients with COVID-19: results of a randomized clinical trial

Published Date: 2020-03-30

Number of times cited: 1

Antiviral Related Terms: hydroxychloroquine; chloroquine

Abstract: 
Aims: Studies have indicated that chloroquine (CQ) shows antagonism against COVID-19 in vitro. However, evidence regarding its effects in patients is limited. This study aims to evaluate the efficacy of hydroxychloroquine (HCQ) in the treatment of patients with COVID-19. Main methods: From February 4 to February 28, 2020, 62 patients suffering from COVID-19 were diagnosed and admitted to Renmin Hospital of Wuhan University. All participants were randomized in a parallel-group trial, 31 patients were assigned to receive an additional 5-day HCQ (400 mg/d) treatment, Time to clinical recovery (TTCR), clinical characteristics, and radiological results were assessed at baseline and 5 days after treatment to evalu

Unnamed: 0,sha,result,context,title,publish_time,abstract,keywords,journal,number_citations,query
0,23e7355b5e4e0209f64c9d8d5772092a53b72686,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n,Paragraph: 0\nCoronaviruses are enveloped posi...,Efficacy of hydroxychloroquine in patients wit...,2020-03-30,Aims: Studies have indicated that chloroquine ...,hydroxychloroquine; chloroquine,,1,


## Executing on many queries

We execute on several queries and save the output. We leave it up to the user to explore which queries are most appropriate.

Some of our queries include:
* antiviral treatments for COVID-19
* {insert_drug_name} for treatment of COVID-19 (i.e. hydroxychloroquine for treatment of COVID-19)
* preventative clinical studies for COVID-19
* anti-viral prophylaxis studies for COVID-19
* prophylaxis studies for COVID-19
* diagnostic assay for COV response
* immunoassay for antibody or cell response
* ELISA or flow cytometry assay for cov
* mouse or ferret model for assay evaluation

**Note: the code below only executes subset of queries that we evaluated.**

In [13]:
queries = [
    'antiviral treatments for COVID-19',
    'hydroxychloroquine for treatment of COVID-19',
    'preventative clinical studies for COVID-19',
    'anti-viral prophylaxis studies for COVID-19',
    'prophylaxis studies for COVID-19',
    'diagnostic assay for COV response',
    'immunoassay for antibody or cell response',
    'ELISA or flow cytometry assay for cov',
    'mouse or ferret model for assay evaluation'
]

for query in queries:
    result = execute_query(query, metadata, print_output = False)
    result.to_csv("/kaggle/working/query - "+query+".csv", index=False)

Batches: 100%|██████████| 1/1 [00:00<00:00, 67.50it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 74.33it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 72.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 64.57it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 70.38it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 63.31it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 72.27it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 75.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 75.80it/s]


# Manual Review and Summary
We go through the query results manually to make sure relevant information is found and compiled. Overall, we spent half a day reviewing these queries and generating a compiled output. 

We save this and it can be found in our Github Repository https://github.com/Weilin37/CORD-19-Kaggle-Challenge/tree/master/Report. Each report is formatted differently due to the work being split up amongst different people. Additionally, not all tasks are suitable for a standardized output.

We present our summaries in the beginning of our notebook, sectioned "Highlighted Results". Please take a look!

Thank you for your time and please let us know your comments and questions!