# ir_datasets - Adding Datasets

This tutorial covers the process for adding a new dataset to the `ir_datasets` package.

This tutorial is for datasets that are inteded to be added to the main package. For an example of an extension, see [this example extension](https://github.com/seanmacavaney/dummy-irds-ext).

Before starting, we recommend [opening an issue](https://github.com/allenai/ir_datasets/issues/new/choose) so various decisions about how to support the dataset can be discussed.

There are four files involved in adding a dataset to the `ir_datasets` package:
 - `ir_datasets/datasets/[dataset-id].py` - Contains the definition of the dataset and any specialized code for handling it.
 - `ir_datasets/etc/downloads.json` - Contains information about how to download and verify dataset source files.
 - `ir_datasets/docs/[dataset-id].yaml` - Contains documentation of the dataset.
 - `test/integration/[dataset-id].py` - Contains automated tests to ensure the dataset is processed as expected.
 
We will now show examples of each of these files for a toy dataset called `dummy`, with files hosted here: https://github.com/seanmacavaney/dummy-irds-ext/tree/master/data

File: `ir_datasets/datasets/dummy.py`

```python
import ir_datasets
from ir_datasets.formats import TsvDocs, TsvQueries, TrecQrels

# A unique identifier for this dataset. This should match the file name (with "-" instead of "_")
NAME = 'dummy'

# What do the relevance levels in qrels mean?
QREL_DEFS = {
    1: 'relevant',
    0: 'not relevant',
}

# This message is shown to the user before downloads are started
DUA = 'Please confirm that you agree to the data usage agreement at <https://some-url/>'

# An initialization function is used to keep the namespace clean
def _init():
    # The directory where this dataset's data files will be stored
    base_path = ir_datasets.util.home_path() / NAME
    
    # Load an object that is used for providing the documentation
    documentation = YamlDocumentation(f'docs/{NAME}.yaml')
    
    # A reference to the downloads file, under the key "dummy". (DLC stands for DownLoadable Content)
    dlc = DownloadConfig.context(NAME, base_path, dua=DUA)
    
    # How to process the documents. Since they are in a typical TSV format, we'll use TsvDocs.
    # Note that other dataset formats may require you to write a custom docs handler (BaseDocs).
    # Note that this doesn't process the documents now; it just defines how they are processed.
    docs = TsvDocs(dlc['docs'], namespace=NAME, lang='en')
    
    # How to process the queries. Similar to the documents, you may need to write a custom
    # queries handler (BaseQueries).
    queries = TsvQueries(dlc['queries'], namespace=NAME, lang='en')
    
    # Qrels: The qrels file is in the TREC format, so we'll use TrecQrels to process them
    qrels = TrecQrels(dlc['qrels'], QREL_DEFS)
    
    # Package the docs, queries, qrels, and documentation into a Dataset object
    dataset = Dataset(docs, queries, qrels, documentation('_'))
    
    # Register the dataset in ir_datasets
    ir_datasets.registry.register(NAME, dataset)
    
    return dataset # used for exposing dataset to the namespace

dataset = _init()
```

Note that you also need to add this file to `ir_datasets/datasets/__init__.py`:

```python
from . import dummy
```

File: `ir_datasets/etc/downloads.json`

(add lines like these to the file)

```json
"dummy": {
  "docs": {
    "url": "https://raw.githubusercontent.com/seanmacavaney/dummy-irds-ext/master/data/docs.tsv",
    "expected_md5": "c7bb5a1a3a07d51de50e8414245c2be4",
    "cache_path": "docs.tsv"
  },
  "queries": {
    "url": "https://raw.githubusercontent.com/seanmacavaney/dummy-irds-ext/master/data/queries.tsv",
    "expected_md5": "08ba86d990cbe6890f727946346964db",
    "cache_path": "queries.tsv"
  },
  "qrels": {
    "url": "https://raw.githubusercontent.com/seanmacavaney/dummy-irds-ext/master/data/qrels",
    "expected_md5": "79ed359fe0afa0f67eb39f468d162920",
    "cache_path": "qrels"
  }
}
```

File: `ir_datasets/docs/dummy.yaml`

```yaml
_: # matches documentation key above
  pretty_name: 'Dummy' # a more human-readable way to present this dataset than the dataset-id
  desc: '
<p>
HTML-encoded and human-readable information about this dataset.
Include a brief description of the dataset.
Be sure to include important decisions made when processing it.
Also, link to more information, e.g. websites, papers, etc.
</p>
<ul>
  <li><a href="https://github.com/seanmacavaney/dummy-irds-ext">Link to the source</a></li>
</ul>' 
  bibtex: |
    @misc{dummy,
      title={Dummy: a made-up dataset},
      year={2021}
    }
```

To generate the HTML documentation files, run `python -m ir_datasets documentation`

File: `test/integration/dummy.py`

```python
from ir_datasets.formats import GenericQuery, GenericDoc, TrecQrel
from .base import DatasetIntegrationTest

class TestDummy(DatasetIntegrationTest):
    def test_docs(self):
        # Test that the dataset 'dummy' has 15 documents, and test the specific docs at indices 0, 9, and 14
        self._test_docs('dummy', count=15, items={
            0: GenericDoc('T1', 'CUT, CAP AND BALANCE. TAXED ENOUGH ALREADY!'),
            9: GenericDoc('T10', 'Perhaps this is the kind of thinking we need in Washington ...'),
            14: GenericDoc('T15', "I've been visiting Trump Int'l Golf Links Scotland and the course will be unmatched anywhere in the world. Spectacular!"),
        })

    def test_queries(self):
        # Test that the dataset 'dummy' has 4 queries, and test the specific queries at indices 0 and 3
        self._test_queries('dummy', count=4, items={
            0: GenericQuery('1', 'republican party'),
            3: GenericQuery('4', 'media'),
        })

    def test_qrels(self):
        # Test that the dataset 'dummy' has 60 qrels, and test the specific qrels at indices 0, 9, and 59
        self._test_qrels('dummy', count=60, items={
            0: TrecQrel('1', 'T1', 0, '0'),
            9: TrecQrel('1', 'T10', 0, '0'),
            59: TrecQrel('4', 'T15', 0, '0'),
        })
```

Note that within a DatasetIntegrationTest, you can use `self._build_test_docs('dummy')`, `self._build_test_queries('dummy')`, `self._build_test_qrels('dummy')` to generate sample test cases. But be sure to check that the tests they generate are properly processed, and feel free to add additional test cases, especially to test dataset-specific handlers.