<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/covid19_related_article_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Related Article Search on COVID-19 Dataset

This notebook demonstrates related article search on the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/03/27) from AI2.
That is, the "query" is an article of interest, and the system's task is to retrieve articles that are related or similar.

Search is performed on contextual representations produced by BioBERT, which converts sparse bag-of-words representation into dense vectors.
Here, we use techniques described in the following paper to perform approximate nearest-neighbor search _directly using Lucene_.

> Tommaso Teofili and Jimmy Lin. [Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors.](https://arxiv.org/abs/1910.10208) arXiv:1910.10208, October 2019.

Our rationale for doing this, as opposed using a dedicated search library like [hnsw](https://github.com/nmslib/hnswlib) is to provide a uniform stack for eventual deployment in an end-to-end search application (which is the argument made in the above paper).

We're separately working on releasing the code for building the index, but here we provide some initial results.

Setup...

In [0]:
%%capture
%cd
!apt-get install maven -qq >& /dev/null
!git clone https://github.com/castorini/anserini.git >& /dev/null
%cd anserini
!mvn clean package appassembler:assemble -q -Dmaven.javadoc.skip=true >& /dev/null

More setup...

In [0]:
%%capture
!pip install pyserini==0.8.1.0
!pip install transformers

import json
import os
import numpy
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's fetch the pre-built the index for related article search, along with the index over title and abstracts:

In [0]:
%%capture
!wget https://www.dropbox.com/s/daivdi7ui7f2bdn/lucene-covid-scibert-fw-2020-03-27.tgz
!tar xvfz lucene-covid-scibert-fw-2020-03-27.tgz
!wget https://www.dropbox.com/s/j1epbu4ufunbbzv/lucene-index-covid-2020-03-27.tar.gz
!tar xvfz lucene-index-covid-2020-03-27.tar.gz

Let's first search for some interesting articles:

In [0]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-03-27/')
hits = searcher.search('effectiveness of chloroquine for covid-19')

# Prints the first 5 hits
for i in range(0, 5):
    print(f'{i+1:2} {hits[i].docid:6} {hits[i].lucene_document.get("doi")} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} ')

 1 45689  10.1016/j.jcrc.2020.03.005 11.67410 A systematic review on the efficacy and safety of chloroquine for the treatment of COVID-19 
 2 44547  10.1016/j.ijantimicag.2020.105938 10.94430 New insights on the antiviral effects of chloroquine against coronavirus: what to expect for COVID-19? 
 3 2059   10.5582/bst.2020.01056 9.79870 COVID-19: Real-time dissemination of scientific information to fight a public health emergency of international concern 
 4 2020   10.3760/cma.j.issn.1001-0939.2020.0019 9.52330 Expert consensus on chloroquine phosphate for the treatment of novel coronavirus pneumonia 
 5 42555  10.1016/j.dsx.2020.03.012 9.41580 Contentious Issues and Evolving Concepts in the Clinical Presentation and Management of Patients with COVID-19 Infectionwith Reference to Use of Therapeutic and Other Drugs used in Co-morbid Diseases (Hypertension, Diabetes etc) 


The first hit looks interesting. Let's search for related articles now.

Currently, it's a command-line tool (direct Python integration in Pyserini is coming soon...):

In [0]:
!target/appassembler/bin/ApproximateNearestNeighborSearch -input sb.txt -path lucene-index-covid-scibert-fw -encoding fw -fw.q 80 -word 10.1016\/j.jcrc.2020.03.005 -depth 5

Loading model sb.txt
Reading index at lucene-index-covid-scibert-fw
5 nearest neighbors of '10.1016/j.jcrc.2020.03.005':
1. 10.1016/j.jcrc.2020.03.005 (33.373)
2. 10.1101/2020.02.20.20025593 (27.445)
3. 10.1371/journal.pmed.0030343 (27.387)
4. 10.1101/2020.03.10.20033522 (27.378)
5. 10.1371/journal.pmed.0040119 (27.213)
Search time: 550ms


Let's see what those related articles are... (currently, we're using crossref to fetch the articles; we're working on more seamless integration with the Pyserini, coming soon).

In [0]:
%%capture
!pip install crossrefapi

In [0]:
dois = ['10.1016/j.jcrc.2020.03.005', '10.1101/2020.02.20.20025593', '10.1371/journal.pmed.0030343', 
        '10.1101/2020.03.10.20033522', '10.1371/journal.pmed.0040119']

from crossref.restful import Works
works = Works()

print('related papers to "'+works.doi(dois[0])['title'][0]+'"\n')
i = 1
for d in dois[1:]:
  print(str(i)+'. '+works.doi(d)['title'][0])
  i += 1

related papers to "A systematic review on the efficacy and safety of chloroquine for the treatment of COVID-19"

1. The efficacy of convalescent plasma for the treatment of severe influenza
2. SARS: Systematic Review of Treatment Effects
3. Protocol for a randomized controlled trial testing inhaled nitric oxide therapy in spontaneously breathing patients with COVID-19
4. Transparent Development of the WHO Rapid Advice Guidelines


Note, specifically, that we obtain a different set of results than the initial search results. This is _by design_ since related article search is a diffrent task. The results may be related in a manner that is different from the original search query.

Are these results any good? They look reasonable, based on initial feedback we've gotten. However, here's where you can help... we're looking for domain experts to give us more feedback!
If you've got any ideas, please contact us at jimmylin@uwaterloo.ca or tommaso.teofili@gmail.com!