# ir_datasets - Tutorial

## Getting Started

We'll start out by installing the package. The package is available on pypi,
so you can install it with your favorite package manager.

In [1]:
!pip install ir_datasets



You can now load up your favorite dataset. You can find the full listing of datasets [here](https://ir-datasets.com/all.html). Here's an example for `cord19/trec-covid`:

In [2]:
import ir_datasets
dataset = ir_datasets.load('cord19/trec-covid')

## Documents

`doc` entities map a `doc_id` to one or more text fields.

Let's see how many documents are in this collection. The first time you run this command, it will need to download and process the collection, which may take some time:

In [3]:
dataset.docs_count()

192509

Now let's see some docments. You can iterate through the documents in the collection using `docs_iter`. Since there's so many, we'll just look at the top 10:

In [4]:
for doc in dataset.docs_iter()[:10]:
  print(doc)

Cord19Doc(doc_id='ug7v899j', title='Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia', doi='10.1186/1471-2334-1-6', date='2001-07-04', abstract='OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patie

You can see each document is represented as a `Cord19Doc`, which is a `namedtuple`. Named tuples are a light-weight data structure that consists of a pre-defined sequence of named fields.

If you want more information aobut what document fields are available in this collection, you can
[check the documentation](https://ir-datasets.com/cord19.html#cord19) or inspect the dataset's `docs_cls()`:

In [5]:
dataset.docs_cls()

ir_datasets.datasets.cord19.Cord19Doc

In [6]:
dataset.docs_cls()._fields

('doc_id', 'title', 'doi', 'date', 'abstract')

In [7]:
dataset.docs_cls().__annotations__

OrderedDict([('doc_id', str),
             ('title', str),
             ('doi', str),
             ('date', str),
             ('abstract', str)])

Did you notice the `[:10]` above? We can do all sorts of fancy slicing on document iterators. Here, we select every other document from the top 10:

In [8]:
for doc in dataset.docs_iter()[:10:2]:
  print(doc)

Cord19Doc(doc_id='ug7v899j', title='Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia', doi='10.1186/1471-2334-1-6', date='2001-07-04', abstract='OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patie

Or the last 10 documents:

In [9]:
for doc in dataset.docs_iter()[-10:]:
  print(doc)

Cord19Doc(doc_id='7e8r61e7', title='Can Pediatric COVID-19 Testing Sensitivity Be Improved With Sequential Tests?', doi='10.1213/ane.0000000000004982', date='2020-05-26', abstract='')
Cord19Doc(doc_id='r3ud8t8w', title='rAre graphene and graphene-derived products capable of preventing COVID-19 infection?', doi='10.1016/j.mehy.2020.110031', date='2020-06-24', abstract="The Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) causes the new coronavirus disease 2019 (COVID-19). This disease is a severe respiratory tract infection that spread rapidly around the world. In this pandemic situation, the researchers' effort is to understand the targets of the virus, mechanism of their cause, and transmission from animal to human and vice-versa. Therefore, to support COVID-19 research and development, we have proposed approaches based on graphene and graphene-derived nanomaterials against COVID-19.")
Cord19Doc(doc_id='6jittbis', title='Heterogeneity and plasticity of porcine alveolar mac

You can also select by percentages, e.g., `[:1/3]` slects the first third, `[1/3:2/3]` selects the second third, and `[2/3:]` selects the final third. This is hany when splitting document processing across processes, machines, or GPUs.

These slices are smart: they avoid processing each document in the collection and jump to the right position in the source files to process.

Now let's say you know a document'd ID and want to find its text. You can use `docs_store()` to accomplish this.

In [10]:
docstore = dataset.docs_store()
docstore.get('3wuh6k6g')

Cord19Doc(doc_id='3wuh6k6g', title='Understand Research Hotspots Surrounding COVID-19 and Other Coronavirus Infections Using Topic Modeling', doi='10.1101/2020.03.26.20044164', date='2020-03-30', abstract='Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which eventually results in the current outbreak of novel coronavirus disease (COVID-19) around the world. The research community is interested to know what are the hotspots in coronavirus (CoV) research and how much is known about COVID-19. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using a topic modeling analysis. Methods: We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains all the 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we traine

Or, a list of IDs. Maybe you're re-ranking these documents.

In [11]:
docstore.get_many(['ax6v6ham', '44l5q07k', '8xm0kacj', '3wuh6k6g', 'fiievwy7'])

{'3wuh6k6g': Cord19Doc(doc_id='3wuh6k6g', title='Understand Research Hotspots Surrounding COVID-19 and Other Coronavirus Infections Using Topic Modeling', doi='10.1101/2020.03.26.20044164', date='2020-03-30', abstract='Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which eventually results in the current outbreak of novel coronavirus disease (COVID-19) around the world. The research community is interested to know what are the hotspots in coronavirus (CoV) research and how much is known about COVID-19. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using a topic modeling analysis. Methods: We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains all the 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeli

If you don't care about the order they are returned in, you can use `get_many_iter()`. This avoids keeping all the results in memory, and reads them in the order in which they appear on disk.

In [12]:
for doc in docstore.get_many_iter(['ax6v6ham', '44l5q07k', '8xm0kacj', '3wuh6k6g', 'fiievwy7']):
  print(doc)

Cord19Doc(doc_id='3wuh6k6g', title='Understand Research Hotspots Surrounding COVID-19 and Other Coronavirus Infections Using Topic Modeling', doi='10.1101/2020.03.26.20044164', date='2020-03-30', abstract='Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which eventually results in the current outbreak of novel coronavirus disease (COVID-19) around the world. The research community is interested to know what are the hotspots in coronavirus (CoV) research and how much is known about COVID-19. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using a topic modeling analysis. Methods: We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains all the 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we traine

## Queries

`queries` (topics) map a `query_id` to one or more text fields. Akint to `docs`, you can iterate over queries for a collection using `queries_iter()`:

In [13]:
for query in dataset.queries_iter():
  print(query)

TrecQuery(query_id='1', title='coronavirus origin', description='what is the origin of COVID-19', narrative="seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans")
TrecQuery(query_id='2', title='coronavirus response to weather changes', description='how does the coronavirus respond to changes in the weather', narrative='seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions')
TrecQuery(query_id='3', title='coronavirus immunity', description='will SARS-CoV2 infected people develop immunity? Is cross protection possible?', narrative='seeking studies of immunity developed due to infection with SARS-CoV2 or cross protection gained due to infection with other coronavirus types')
TrecQuery(query_id='4', title='how do people die from the coronavirus', description='w

Iterables of namedtuples are handy structures because they are lightweight and do not load all the content into memory. But in case you need that, you can easily convert them into other data structures. Here's an example building a Pandas DataFrame of the queries:

In [14]:
import pandas as pd
pd.DataFrame(dataset.queries_iter())

Unnamed: 0,query_id,title,description,narrative
0,1,coronavirus origin,what is the origin of COVID-19,seeking range of information about the SARS-Co...
1,2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,seeking range of information about the SARS-Co...
2,3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,seeking studies of immunity developed due to i...
3,4,how do people die from the coronavirus,what causes death from Covid-19?,Studies looking at mechanisms of death from Co...
4,5,animal models of COVID-19,what drugs have been active against SARS-CoV o...,Papers that describe the results of testing d...
5,6,coronavirus test rapid testing,what types of rapid testing for Covid-19 have ...,Looking for studies identifying ways to diagno...
6,7,serological tests for coronavirus,are there serological tests that detect antibo...,Looking for assays that measure immune respons...
7,8,coronavirus under reporting,how has lack of testing availability led to un...,Looking for studies answering questions of imp...
8,9,coronavirus in Canada,how has COVID-19 affected Canada,"seeking data related to infections (confirm, s..."
9,10,coronavirus social distancing impact,has social distancing had an impact on slowing...,seeking specific information on studies that h...


Again, we can [check the documentation](https://ir-datasets.com/cord19.html#cord19/trec-covid) for information about what fields are available. Or we can use `queries_cls()`:

In [15]:
dataset.queries_cls()

ir_datasets.formats.trec.TrecQuery

In [16]:
dataset.queries_cls()._fields

('query_id', 'title', 'description', 'narrative')

In [17]:
dataset.queries_cls().__annotations__

OrderedDict([('query_id', str),
             ('title', str),
             ('description', str),
             ('narrative', str)])

## Query Relevance Assessments

`qrels` (query relevance assessments/judgments) map a `query_id` and `doc_id` to a relevance score.

You probably guessed it; we can fetch qrels for a dataset with `qrels_iter()`. There's a lot of them, so we'll just show them in a DataFrame to start with:

In [18]:
pd.DataFrame(dataset.qrels_iter())

Unnamed: 0,query_id,doc_id,relevance,iteration
0,1,005b2j4b,2,4.5
1,1,00fmeepz,1,4
2,1,010vptx3,2,0.5
3,1,0194oljo,1,2.5
4,1,021q9884,1,4
...,...,...,...,...
69313,50,zvop8bxh,2,5
69314,50,zwf26o63,1,5
69315,50,zwsvlnwe,0,5
69316,50,zxr01yln,1,5


What does relevance=0, 1, and 2 mean? You can find out with `qrels_defs`:

In [19]:
dataset.qrels_defs()

{0: 'Not Relevant: everything else.',
 1: 'Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.',
 2: 'Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.'}

Of course we can also get information about the `TrecQrel` type using `qrels_cls()`:

In [20]:
dataset.qrels_cls()

ir_datasets.formats.trec.TrecQrel

In [21]:
dataset.qrels_cls()._fields

('query_id', 'doc_id', 'relevance', 'iteration')

In [22]:
dataset.qrels_cls().__annotations__

OrderedDict([('query_id', str),
             ('doc_id', str),
             ('relevance', int),
             ('iteration', str)])

## Wrapping Up

So that's the core functionality. You can find more information in the [documentation](https://ir-datasets.com/).