# ir_datasets - Tutorial - CLI

**NOTE: This tutorial is for the command-line interface. See the other tutorial for Python.**

## Getting Started

We'll start out by installing the package. The package is available on pypi,
so you can install it with your favorite package manager.

In [1]:
!pip install ir_datasets



## export

The `ir_datasets export` command outputs data to stdout as TSV,
JSON, and other formats.

The command format is:

```
ir_datasets export <dataset-id> <entity-type>
```

with optional other arguments following entity-type.

`<dataset-id>` is the dataset's identifier, found [in the catalog](https://ir-datasets.com/). `<entity-type>` is one of: `docs`, `queries`, `qrels`, `scoreddocs`.

Let's start by getting the top 10 documents from the `cord19/trec-covid` collection. The first time you run the command, it will automatically download the dataset.


In [2]:
!ir_datasets export cord19/trec-covid docs | head -n 10

[INFO] No fields supplied. Using all fields: ('doc_id', 'title', 'doi', 'date', 'abstract')
ug7v899j	Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia	10.1186/1471-2334-1-6	2001-07-04	OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (3

You can export in other formats too. Here's an exporting in JSON-Lines.

In [3]:
!ir_datasets export cord19/trec-covid docs --format jsonl | head -n 10

{"doc_id": "ug7v899j", "title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia", "doi": "10.1186/1471-2334-1-6", "date": "2001-07-04", "abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of

If you do not want all the fields, you can specify which ones with `--fields`:

In [4]:
!ir_datasets export cord19/trec-covid docs --format jsonl --fields doc_id date | head -n 10

{"doc_id": "ug7v899j", "date": "2001-07-04"}
{"doc_id": "02tnwd4m", "date": "2000-08-15"}
{"doc_id": "ejv2xln0", "date": "2000-08-25"}
{"doc_id": "2b73a28n", "date": "2001-02-22"}
{"doc_id": "9785vg6d", "date": "2001-05-11"}
{"doc_id": "zjufx4fo", "date": "2001-12-17"}
{"doc_id": "5yhe786e", "date": "2001-03-08"}
{"doc_id": "8zchiykl", "date": "2001-05-02"}
{"doc_id": "8qnrcgnk", "date": "2003-08-07"}
{"doc_id": "jg13scgo", "date": "2003-09-01"}


The export command works the same way for `queries`, `qrels`, and `scoreddocs` (where available). By default, `qrels` and `scoreddocs` output in the TREC format. But you can choose to export as tsv or jsonl as well.

In [5]:
!ir_datasets export cord19/trec-covid queries --fields query_id title | head -n 10

1	coronavirus origin
2	coronavirus response to weather changes
3	coronavirus immunity
4	how do people die from the coronavirus
5	animal models of COVID-19
6	coronavirus test rapid testing
7	serological tests for coronavirus
8	coronavirus under reporting
9	coronavirus in Canada
10	coronavirus social distancing impact


In [6]:
!ir_datasets export cord19/trec-covid qrels | head -n 10

1 4.5 005b2j4b 2
1 4 00fmeepz 1
1 0.5 010vptx3 2
1 2.5 0194oljo 1
1 4 021q9884 1
1 1 02f0opkr 1
1 3.5 047xpt2c 0
1 1 04ftw7k9 0
1 1 05qglt1f 0
1 3 05vx82oo 0


If you're savvy at the command line, piping can let you capture some dataset statistics pretty easily. Here's an example giving the label proportions using `awk`:

In [7]:
!ir_datasets export cord19/trec-covid qrels | awk '{a[$4]+=1; s+=1}END{for (x in a){print x, a[x], a[x]/s}}'

-1 2 2.88525e-05
0 42652 0.615309
1 11055 0.159482
2 15609 0.22518


## lookup

You can look up documents by their ID with the `ir_datasets lookup` command. The command format is:

```
ir_datasets lookup <dataset-id> <doc-ids> ...
```

These lookups are generally O(1) and memory-efficient.

In [8]:
!ir_datasets lookup cord19/trec-covid 005b2j4b 00fmeepz 010vptx3

[INFO] No fields supplied. Using all fields: ('doc_id', 'title', 'doi', 'date', 'abstract')
005b2j4b	Monophyletic Relationship between Severe Acute Respiratory Syndrome Coronavirus and Group 2 Coronaviruses	10.1086/382892	2004-05-01	Although primary genomic analysis has revealed that severe acute respiratory syndrome coronavirus (SARS CoV) is a new type of coronavirus, the different protein trees published in previous reports have provided no conclusive evidence indicating the phylogenetic position of SARS CoV. To clarify the phylogenetic relationship between SARS CoV and other coronaviruses, we compiled a large data set composed of 7 concatenated protein sequences and performed comprehensive analyses, using the maximum-likelihood, Bayesian-inference, and maximum-parsimony methods. All resulting phylogenetic trees displayed an identical topology and supported the hypothesis that the relationship between SARS CoV and group 2 CoVs is monophyletic. Relationships among all major groups wer

You can also specify the fields to return.

In [9]:
!ir_datasets lookup cord19/trec-covid 005b2j4b 00fmeepz 010vptx3 --fields doc_id title

005b2j4b	Monophyletic Relationship between Severe Acute Respiratory Syndrome Coronavirus and Group 2 Coronaviruses
00fmeepz	Comprehensive overview of COVID-19 based on current evidence
010vptx3	The SARS, MERS and novel coronavirus (COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned?


And of course, you can do all sorts of fancy piping here as well. Let's find all highly-relevant documents for Query 50:

In [10]:
!ir_datasets lookup cord19/trec-covid $(ir_datasets export cord19/trec-covid qrels | awk '$1==50&&$4==2{printf "%s ", $3}') --fields doc_id title

1v0f2dtx	SARS-CoV-2 mRNA Vaccine Development Enabled by Prototype Pathogen Preparedness
3a6l4ktt	mRNA Vaccines: Possible Tools to Combat SARS-CoV-2
6emy92i5	mRNA Vaccines: Possible Tools to Combat SARS-CoV-2
7q6xi2xx	An Evidence Based Perspective on mRNA-SARS-CoV-2 Vaccine Development
akbq0ogs	Phase 1/2 Study to Describe the Safety and Immunogenicity of a COVID-19 RNA Vaccine Candidate (BNT162b1) in Adults 18 to 55 Years of Age: Interim Report
dcg6ui9d	An Evidence Based Perspective on mRNA-SARS-CoV-2 Vaccine Development
g1j8wk11	Immune-mediated approaches against COVID-19
gidlrnu8	Deconvoluting Lipid Nanoparticle Structure for Messenger RNA Delivery
ino9srb6	An overview on COVID-19: reality and expectation
kf7yz3oz	Vaccines and Therapies in Development for SARS-CoV-2 Infections.
oiu80002	Self-amplifying RNA SARS-CoV-2 lipid nanoparticle vaccine candidate induces high neutralizing antibody titers in mice
ozf05l65	Preparing for Pandemics: RNA Vaccines at the Forefront
q77da2y3	Designing 

## doc_fifos

For indexing using some tools (e.g., Anserini), it is helpful to have multiple concurrent document streams. You can do this with the `ir_datasets doc_fifos` command. Note that this command only works on posix systems (e.g., unix, macos).

This command runs until all the documents are exhausted, so you need to run it in the background or elsewhere. So it's not condusive to show in a Colab setting.

In [None]:
!ir_datasets doc_fifos cord19/trec-covid