<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_covid19_default.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on COVID-19 Dataset (Title + Abstract Index)


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/04/03) from AI2.
Here, we'll be working with the title + abstract index.
We have a [separate notebook](https://github.com/castorini/anserini-notebooks/blob/master/pyserini_covid19_paragraph.ipynb) that demonstrates working with the paragraph index, which will likely yield better search results.
However, this index is smaller and easier to manipulate, so it's a good starting point.

First, install Python dependencies:

In [0]:
%%capture
!pip install pyserini==0.8.1.0

import json
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/d6v9fensyi7q3gb/lucene-index-covid-2020-04-03.tar.gz
!tar xvfz lucene-index-covid-2020-04-03.tar.gz

Sanity check of index size (should be 1.5G):

In [3]:
!du -h lucene-index-covid-2020-04-03

1.5G	lucene-index-covid-2020-04-03


You can use `pysearch` to search over an index. Here's the basic usage:

In [4]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-04-03/')
hits = searcher.search('incubation period of COVID-19')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} {hits[i].lucene_document.get("doi")}')

 1 vkgnwxzc 11.80000 Estimate the incubation period of coronavirus 2019 (COVID-19) 10.1101/2020.02.24.20027474
 2 slapc5xt 11.41820 A Chinese Case of COVID-19 Did Not Show Infectivity During the Incubation Period: Based on an Epidemiological Survey 10.3961/jpmph.20.048
 3 djq0lvr2 11.35980 Is a 14-day quarantine period optimal for effectively controlling coronavirus disease 2019 (COVID-19)? 10.1101/2020.03.15.20036533
 4 8anqfkmo 11.21980 The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application 10.7326/m20-0504
 5 it4ka7v0 11.18060 Estimation of incubation period distribution of COVID-19 using disease onset forward time: a novel cross-sectional and forward follow-up study 10.1101/2020.03.06.20032417
 6 ibx89gqw 11.04560 Estimating the distribution of the incubation period of 2019 novel coronavirus (COVID-19) infection between travelers to Hubei, China and non-travelers 10.1101/2020.02.13.20022822
 7 tne83uu0 10.8369

From the hits array, use `.lucene_document` to access the underlying indexed Lucene `Document`, and from there, call `.get(field)` to fetch specific fields, like "title", "doc", etc.
The complete list of available fields is [here](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/CovidGenerator.java#L46).

For hit #2, "A Chinese Case of COVID-19 Did Not Show Infectivity During the Incubation Period", we don't have the full text, but we can access available information via `.raw`.

In [5]:
hit2_json = json.loads(hits[1].raw)
print(json.dumps(hit2_json, indent=4))

{
    "abstract": "Controversy remains over whether the novel coronavirus 2019 (COVID-19) virus may have infectivity during the incubation period before the onset of symptoms. The author had the opportunity to examine the infectivity of COVID-19 during the incubation period by conducting an epidemiological survey on a confirmed patient who had visited Jeju Island during the incubation period. The epidemiological findings support the claim that the COVID-19 virus does not have infectivity during the incubation period.",
    "authors": "Bae, Jong-Myon",
    "cord_uid": "slapc5xt",
    "doi": "10.3961/jpmph.20.048",
    "full_text_file": "",
    "has_pdf_parse": "False",
    "has_pmc_xml_parse": "False",
    "journal": "Journal of Preventive Medicine and Public Health",
    "license": "unk",
    "Microsoft Academic Paper ID": "2029228287",
    "pmcid": "",
    "publish_time": "2020-03-02",
    "pubmed_id": "32114755",
    "sha": "",
    "source_x": "WHO",
    "title": "A Chinese Case of C

For hit #4, "The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases", we have the full text, which we can also fetch via `.raw`:


In [6]:
hit4_json = json.loads(hits[3].raw)
print(json.dumps(hit4_json, indent=4))

{
    "paper_id": "ce8609a60724d457d5b5916d57a31dea0ffb831b",
    "metadata": {
        "title": "The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application",
        "authors": [
            {
                "first": "Stephen",
                "middle": [
                    "A"
                ],
                "last": "Lauer",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            {
                "first": "Kyra",
                "middle": [
                    "H"
                ],
                "last": "Grantz",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            {
                "first": "Qifang",
                "middle": [],
                "last": "Bi",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            