<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_covid19_paragraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on COVID-19 Dataset (Paragraph Index)


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/03/27) from AI2.
Here, we'll be working with the paragraph index.
We have [another notebook](https://colab.research.google.com/drive/1mrapJp6-RIB-3u6FaJVa4WEwFdEBOcTe) for working with the simpler title + abstract index.

First, install Python dependencies

In [0]:
%%capture
!pip install pyserini==0.8.1.0

import json
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/o95pehyzem0yalp/lucene-index-covid-paragraph-2020-03-27.tar.gz
!tar xvfz lucene-index-covid-paragraph-2020-03-27.tar.gz

Sanity check of index size (should be 5.9G):

In [3]:
!du -h lucene-index-covid-paragraph-2020-03-27

5.9G	lucene-index-covid-paragraph-2020-03-27


Now, a bit of explanation of how the index is organized.
For each source article, we create a paragraph-level index as follows, for a hypothetical article with id `docid`, in the index there'll be:

+ `docid`: title + abstract
+ `docid.00001`: title + abstract + 1st paragraph
+ `docid.00002`: title + abstract + 2nd paragraph
+ `docid.00003`: title + abstract + 3rd paragraph
+ ...

That is, each article is chopped up into individual paragraphs.
Each paragraph is indexed as a "document" (with the title and abstract). 
The suffix of the `docid`, `.XXXXX` identifies which paragraph is being indexed (numbered sequentially).

You can use `pysearch` to search over an index. Here's the basic usage:

In [4]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-paragraph-2020-03-27/')
hits = searcher.search('integration degradation synthesis host immune')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:11} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} {hits[i].lucene_document.get("doi")}')

 1 32461.00026 9.37640 Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions 10.1016/s0065-2776(06)92006-9
 2 8874.00030  9.27450 Shutoff of Host Gene Expression in Influenza A Virus and Herpesviruses: Similar Mechanisms and Common Themes 10.3390/v8040102
 3 32461.00010 9.18380 Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions 10.1016/s0065-2776(06)92006-9
 4 7291.00047  8.79220 Cullin E3 Ligases and Their Rewiring by Viral Factors 10.3390/biom4040897
 5 42268.00031 8.78100 The human gut microbiota and virome: Potential therapeutic implications 10.1016/j.dld.2015.07.008
 6 10737.00026 8.73240 The TRIMendous Role of TRIMs in Virus–Host Interactions 10.3390/vaccines5030023
 7 32461.00025 8.71720 Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions 10.1016/s0065-2776(06)92006-9
 8 12715.00025 8.70150 Viral-Mediated mRNA Degradation for Pathogenesis 10.3390/biomedicines6040111
 9 10737.0

From the hits array, use `.lucene_document` to access the underlying indexed Lucene `Document`, and from there, call `.get(field)` to fetch specific fields, like "title", "doc", etc.
The complete list of available fields is [here](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/CovidGenerator.java#L46).

Note that the hit #1 (`32461.00026`) and hit #3 (`32461.00025`) are from the same article.
Use `.contents` of the hit to see exactly what was indexed.

For hit #1:

In [5]:
hits[0].contents.split('\n')

['Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions',
 "Abstract Relatively small genomes and high replication rates allow viruses and bacteria to accumulate mutations. This continuously presents the host immune system with new challenges. On the other side of the trenches, an increasingly well‐adjusted host immune response, shaped by coevolutionary history, makes a pathogen's life a rather complicated endeavor. It is, therefore, no surprise that pathogens either escape detection or modulate the host immune response, often by redirecting normal cellular pathways to their advantage. For the purpose of this chapter, we focus mainly on the manipulation of the class I and class II major histocompatibility complex (MHC) antigen presentation pathways and the ubiquitin (Ub)‐proteasome system by both viral and bacterial pathogens. First, we describe the general features of antigen presentation pathways and the Ub‐proteasome system and then address how they 

For hit #3:

In [6]:
hits[2].contents.split('\n')

['Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions',
 "Abstract Relatively small genomes and high replication rates allow viruses and bacteria to accumulate mutations. This continuously presents the host immune system with new challenges. On the other side of the trenches, an increasingly well‐adjusted host immune response, shaped by coevolutionary history, makes a pathogen's life a rather complicated endeavor. It is, therefore, no surprise that pathogens either escape detection or modulate the host immune response, often by redirecting normal cellular pathways to their advantage. For the purpose of this chapter, we focus mainly on the manipulation of the class I and class II major histocompatibility complex (MHC) antigen presentation pathways and the ubiquitin (Ub)‐proteasome system by both viral and bacterial pathogens. First, we describe the general features of antigen presentation pathways and the Ub‐proteasome system and then address how they 

The first two lines contain the title and abstract, respectively, and they are exactly the same for both, since they're from the same article.

To access the full text, we need to fetch the "base" document, which is `32461` (without the `.XXXXX` suffix).
This is to avoid wasting space by repeatedly storing the full text.

We can use the `searcher` to fetch the document, and then fetch the underlying raw article JSON, as follows:

In [7]:
article = json.loads(searcher.doc('32461').lucene_document().get('raw'))
print(json.dumps(article, indent=4))

{
    "paper_id": "276cfd64618fb1aebf7050880805c550149b310f",
    "metadata": {
        "title": "Antigen Presentation and the Ubiquitin-Proteasome System in Host-Pathogen Interactions",
        "authors": [
            {
                "first": "Joana",
                "middle": [],
                "last": "Loureiro",
                "suffix": "",
                "affiliation": {
                    "laboratory": "",
                    "institution": "Whitehead Institute",
                    "location": {
                        "addrLine": "9 Cambridge Center",
                        "settlement": "Cambridge",
                        "region": "Massachusetts"
                    }
                },
                "email": ""
            },
            {
                "first": "Hidde",
                "middle": [
                    "L"
                ],
                "last": "Ploegh",
                "suffix": "",
                "affiliation": {
                    "labor