<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_covid19_default.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on COVID-19 Dataset (Title + Abstract Index)


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/04/10) from AI2.
Here, we'll be working with the title + abstract index.
We have a [separate notebook](https://github.com/castorini/anserini-notebooks/blob/master/pyserini_covid19_paragraph.ipynb) that demonstrates working with the paragraph index, which will likely yield better search results.
However, this index is smaller and easier to manipulate, so it's a good starting point.

First, install Python dependencies:

In [0]:
%%capture
!pip install pyserini==0.9.0.0

import json
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/j55t617yhvmegy8/lucene-index-covid-2020-04-10.tar.gz
!tar xvfz lucene-index-covid-2020-04-10.tar.gz

Sanity check of index size (should be 1.5G):

In [3]:
!du -h lucene-index-covid-2020-04-10

1.6G	lucene-index-covid-2020-04-10


You can use `pysearch` to search over an index. Here's the basic usage:

In [4]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-04-10/')
hits = searcher.search('incubation period of COVID-19', 10)

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} {hits[i].lucene_document.get("doi")}')

 1 vkgnwxzc 11.50990 Estimate the incubation period of coronavirus 2019 (COVID-19) 10.1101/2020.02.24.20027474
 2 slapc5xt 11.13590 A Chinese Case of COVID-19 Did Not Show Infectivity During the Incubation Period: Based on an Epidemiological Survey 10.3961/jpmph.20.048
 3 djq0lvr2 11.07600 Is a 14-day quarantine period optimal for effectively controlling coronavirus disease 2019 (COVID-19)? 10.1101/2020.03.15.20036533
 4 8anqfkmo 10.93740 The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application 10.7326/m20-0504
 5 it4ka7v0 10.91810 Estimation of incubation period distribution of COVID-19 using disease onset forward time: a novel cross-sectional and forward follow-up study 10.1101/2020.03.06.20032417
 6 ibx89gqw 10.78020 Estimating the distribution of the incubation period of 2019 novel coronavirus (COVID-19) infection between travelers to Hubei, China and non-travelers 10.1101/2020.02.13.20022822
 7 tne83uu0 10.5464

From the hits array, use `.lucene_document` to access the underlying indexed Lucene `Document`, and from there, call `.get(field)` to fetch specific fields, like "title", "doc", etc.
The complete list of available fields is [here](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/CovidGenerator.java#L46).

For hit #2, "A Chinese Case of COVID-19 Did Not Show Infectivity During the Incubation Period", we don't have the full text, but we can access available information via `.raw`.

In [5]:
hit2_json = json.loads(hits[1].raw)
print(json.dumps(hit2_json, indent=4))

{
    "abstract": "Controversy remains over whether the novel coronavirus 2019 (COVID-19) virus may have infectivity during the incubation period before the onset of symptoms. The author had the opportunity to examine the infectivity of COVID-19 during the incubation period by conducting an epidemiological survey on a confirmed patient who had visited Jeju Island during the incubation period. The epidemiological findings support the claim that the COVID-19 virus does not have infectivity during the incubation period.",
    "authors": "Bae, Jong-Myon",
    "cord_uid": "slapc5xt",
    "doi": "10.3961/jpmph.20.048",
    "full_text_file": "",
    "has_pdf_parse": "False",
    "has_pmc_xml_parse": "False",
    "journal": "Journal of Preventive Medicine and Public Health",
    "license": "unk",
    "Microsoft Academic Paper ID": "2029228287",
    "pmcid": "",
    "publish_time": "2020-03-02",
    "pubmed_id": "32114755",
    "sha": "",
    "source_x": "WHO",
    "title": "A Chinese Case of C

For hit #4, "The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases", we have the full text, which we can also fetch via `.raw`:


In [0]:
hit4_json = json.loads(hits[3].raw)

# Uncomment to print... warning, it's long! :)
# print(json.dumps(hit4_json, indent=4))

Finally, if you want to create a DataFrame comprising all the results, here's a snippet of code to do so:

In [7]:
import pandas as pd

ranks = list(range(1, len(hits)+1))
docids = [ hit.docid for hit in hits]
scores = [ hit.score for hit in hits]
titles = [ hit.lucene_document.get('title') for hit in hits]
dois = [ hit.lucene_document.get('doi') for hit in hits]
raw = [ hit.raw for hit in hits ]
data = {'rank': ranks, 'docid': docids, 'score': scores, 'title': titles, 'doi': dois, 'raw': raw} 

df = pd.DataFrame(data)
df

Unnamed: 0,rank,docid,score,title,doi,raw
0,1,vkgnwxzc,11.5099,Estimate the incubation period of coronavirus ...,10.1101/2020.02.24.20027474,"{\n ""paper_id"": ""c1ae608c7ffb926a0f50a6a34c..."
1,2,slapc5xt,11.1359,A Chinese Case of COVID-19 Did Not Show Infect...,10.3961/jpmph.20.048,"{""abstract"":""Controversy remains over whether ..."
2,3,djq0lvr2,11.076,Is a 14-day quarantine period optimal for effe...,10.1101/2020.03.15.20036533,"{\n ""paper_id"": ""d09c0f71b1a404a592d0dcad2c..."
3,4,8anqfkmo,10.9374,The Incubation Period of Coronavirus Disease 2...,10.7326/m20-0504,"{\n ""paper_id"": ""ce8609a60724d457d5b5916d57..."
4,5,it4ka7v0,10.9181,Estimation of incubation period distribution o...,10.1101/2020.03.06.20032417,"{\n ""paper_id"": ""6d3b3f4ab80a61c45f82c61c6c..."
5,6,ibx89gqw,10.7802,Estimating the distribution of the incubation ...,10.1101/2020.02.13.20022822,"{""abstract"":""Objectives: Amid the continuing s..."
6,7,tne83uu0,10.5464,Epidemiologic Characteristics of COVID-19 in G...,10.1101/2020.03.01.20028944,"{\n ""paper_id"": ""7e9cd4bbf0fba1cc0bcded4004..."
7,8,aul6ahww,10.527,Transmission of COVID-19 in the terminal stage...,10.1016/j.ijid.2020.03.027,"{\n ""paper_id"": ""b9ffacdcfda1d28c4e66e502c2..."
8,9,ls2z145c,10.4981,Study on assessing early epidemiological param...,10.3760/cma.j.cn112338-20200205-00069,"{""abstract"":""Objective: To study the early dyn..."
9,10,9jatvium,10.4587,COVID-19 in a patient with long-term use of gl...,10.1016/j.clim.2020.108413,"{\n ""paper_id"": ""aecea2084e32367c875a9266e8..."
