In [1]:
from pprint import pprint
from bigbio.dataloader import BigBioConfigHelpers
from bigbio.utils.constants import Tasks

# BigBioConfigHelpers

Start by creating an instance of BigBioConfigHelpers. This will help locate and filter datasets available in the BigBIO package. 

In [2]:
conhelps = BigBioConfigHelpers()
print("found {} dataset configs from {} datasets".format(
    len(conhelps),
    len(conhelps.available_dataset_names)
))

found 419 dataset configs from 123 datasets


Each dataset has at least one source config and at least one bigbio config. Source configs attempt to preserve the original structure of the dataset while bigbio configs are normalized into one of several bigbio [task schemas](https://github.com/bigscience-workshop/biomedical/blob/master/task_schemas.md). Some datasets have several source configs and/or bigbio configs (e.g. multi-lingual datasets or datasets supporting multiple cross-validation folds). This is why the number of configs is greater than twice the number of datasets.

### Examine One Helper

conhelps is list-like and elements can be accesed with integer indices. Lets examine one.

In [3]:
pprint(conhelps[0])

BigBioConfigHelper(script='/home/galtay/repos/biomedical/bigbio/hub/hub_repos/an_em/an_em.py',
                   dataset_name='an_em',
                   tasks=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>,
                          <Tasks.COREFERENCE_RESOLUTION: 'COREF'>,
                          <Tasks.RELATION_EXTRACTION: 'RE'>],
                   languages=['English'],
                   config=BigBioConfig(name='an_em_source',
                                       version=1.0.4,
                                       data_dir=None,
                                       data_files=None,
                                       description='AnEM source schema',
                                       schema='source',
                                       subset_id='an_em'),
                   is_local=False,
                   is_pubmed=True,
                   is_bigbio_schema=False,
                   bigbio_schema_caps=None,
                   is_large=False,
                   is_

### Show All Available Datasets

In [4]:
print(conhelps.available_dataset_names)

['an_em', 'anat_em', 'ask_a_patient', 'bc5cdr', 'bc7_litcovid', 'bio_sim_verb', 'bio_simlex', 'bioasq_2021_mesinesp', 'bioasq_task_b', 'bioasq_task_c_2017', 'bioinfer', 'biology_how_why_corpus', 'biomrc', 'bionlp_shared_task_2009', 'bionlp_st_2011_epi', 'bionlp_st_2011_ge', 'bionlp_st_2011_id', 'bionlp_st_2011_rel', 'bionlp_st_2013_cg', 'bionlp_st_2013_ge', 'bionlp_st_2013_gro', 'bionlp_st_2013_pc', 'bionlp_st_2019_bb', 'biored', 'biorelex', 'bioscope', 'biosses', 'blurb', 'cantemist', 'cas', 'cellfinder', 'chebi_nactem', 'chemdner', 'chemprot', 'chia', 'citation_gia_test_collection', 'codiesp', 'cpi', 'ctebmsp', 'ddi_corpus', 'distemist', 'drugprot', 'ebm_pico', 'ehr_rel', 'essai', 'euadr', 'evidence_inference', 'gad', 'genetag', 'genia_ptm_event_corpus', 'genia_relation_corpus', 'genia_term_corpus', 'geokhoj_v1', 'gnormplus', 'hallmarks_of_cancer', 'hprd50', 'iepa', 'jnlpba', 'linnaeus', 'lll', 'mayosrs', 'med_qa', 'medal', 'meddialog', 'meddocan', 'medhop', 'medical_data', 'mediqa_n

### Show Helpers for Specific Dataset

We can also get the helpers for a specific dataset using the dataset name. 

In [5]:
bc5cdr_helpers = conhelps.for_dataset("bc5cdr")
print(len(bc5cdr_helpers))
pprint(bc5cdr_helpers[0].config)
pprint(bc5cdr_helpers[1].config)

2
BigBioConfig(name='bc5cdr_source',
             version=1.5.16,
             data_dir=None,
             data_files=None,
             description='BC5CDR source schema',
             schema='source',
             subset_id='bc5cdr')
BigBioConfig(name='bc5cdr_bigbio_kb',
             version=1.0.0,
             data_dir=None,
             data_files=None,
             description='BC5CDR simplified BigBio schema',
             schema='bigbio_kb',
             subset_id='bc5cdr')


# Loading Datasets

Each config helper provides a wrapper to the [load_dataset](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset) function from Huggingface's [datasets](https://huggingface.co/docs/datasets/) package. This wrapper will automatically populate the first two arguments of load_dataset,

* path: path to the dataloader script
* name: name of the dataset configuration

If you have a specific dataset and config in mind, you can,

* fetch the helper from conhelps with the for_config_name method
* load the dataset using the load_dataset wrapper

In [6]:
bc5cdr_source = conhelps.for_config_name("bc5cdr_source").load_dataset()
bc5cdr_bigbio = conhelps.for_config_name("bc5cdr_bigbio_kb").load_dataset()

Found cached dataset bc5cdr (/home/galtay/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_source/1.5.16/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset bc5cdr (/home/galtay/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_bigbio_kb/1.0.0/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f)


  0%|          | 0/3 [00:00<?, ?it/s]

This wrapper function will pass through any other kwargs you may need to use. For example data_dir for datasets that are not public,

In [7]:
# note this will not work unless you have the n2c2 dataset locally

#n2c2_2011_source = (
#    conhelps.
#    for_config_name("n2c2_2011_source").
#    load_dataset(data_dir="/path/to/n2c2_2011/data")
#)

# Filter and Load Multiple Datasets

You can use any attribute of a BigBioConfigHelper to filter the collection. Here are some examples,

### Source schema for n2c2 datasets

In [8]:
n2c2_source_helpers = conhelps.filtered(
    lambda x:
        x.dataset_name.startswith("n2c2")
        and not x.is_bigbio_schema
)

### BigBIO schema datasets that are public and support textual entailment

In [9]:
entailment_helpers = conhelps.filtered(
    lambda x:
        x.is_bigbio_schema
        and not x.is_local
        and Tasks.TEXTUAL_ENTAILMENT in x.tasks
)

### BigBIO schema datasets that are public and not "large"

In [10]:
bb_public_helpers = conhelps.filtered(
    lambda x:
        x.is_bigbio_schema
        and not x.is_local
        and not x.is_large
)

### Loading filtered datasets

Note that the `filtered` method returns another instance of `BigBioConfigHelpers`. This means you can iterate over any of the helpers defined above and load all of the datasets. 

In [11]:
print(len(bb_public_helpers))

163


In [12]:
# NOTE the first time you run this cell, the public datasets will be downloaded and cached.
# As an example we just load the first 10 datasets 

bb_public_datasets = {}
for helper in bb_public_helpers[:10]:
    bb_public_datasets[helper.config.name] = helper.load_dataset()

Found cached dataset an_em (/home/galtay/.cache/huggingface/datasets/bigbio___an_em/an_em_bigbio_kb/1.0.0/6531a2dc4f5c90c7ee6ebe9ac031d53c179f04090e41ab5259420ea5b6abb8c6)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset anat_em (/home/galtay/.cache/huggingface/datasets/bigbio___anat_em/anat_em_bigbio_kb/1.0.0/26e5c978c82396e267dec0e126b2c1b1e8a63a5dc5ef51225225870a2a3e3d88)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset ask_a_patient (/home/galtay/.cache/huggingface/datasets/bigbio___ask_a_patient/ask_a_patient_bigbio_kb/1.0.0/ec2512406bc8163402097ad96a2851e0525fe0dd080650077188bd53bdb15745)


  0%|          | 0/30 [00:00<?, ?it/s]

Found cached dataset bc5cdr (/home/galtay/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_bigbio_kb/1.0.0/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset bc7_litcovid (/home/galtay/.cache/huggingface/datasets/bigbio___bc7_litcovid/bc7_litcovid_bigbio_text/1.0.0/772ea31933b422bf949a43dd6d13f6155d5cc2058e180f2616cede06040a8c11)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset bio_sim_verb (/home/galtay/.cache/huggingface/datasets/bigbio___bio_sim_verb/bio_sim_verb_bigbio_pairs/1.0.0/8ab449c032843c956c6ffdbdecfc16971d825d7c6d167765343ac197f1e8ccff)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset bio_simlex (/home/galtay/.cache/huggingface/datasets/bigbio___bio_simlex/bio_simlex_bigbio_pairs/1.0.0/43ae58c304e1b1283aeb72fb60f2ab506fe04bc7c98b6ea86e0a0e3ea799d6f5)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset bioasq_2021_mesinesp (/home/galtay/.cache/huggingface/datasets/bigbio___bioasq_2021_mesinesp/bioasq_2021_mesinesp_subtrack1_all_bigbio_text/1.0.0/2efb6ae440bb40e4b5821e1ca05e07f40cd0c5a3bcb48579569d1b40c9c6f724)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset bioasq_2021_mesinesp (/home/galtay/.cache/huggingface/datasets/bigbio___bioasq_2021_mesinesp/bioasq_2021_mesinesp_subtrack1_only_articles_bigbio_text/1.0.0/2efb6ae440bb40e4b5821e1ca05e07f40cd0c5a3bcb48579569d1b40c9c6f724)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset bioasq_2021_mesinesp (/home/galtay/.cache/huggingface/datasets/bigbio___bioasq_2021_mesinesp/bioasq_2021_mesinesp_subtrack2_bigbio_text/1.0.0/2efb6ae440bb40e4b5821e1ca05e07f40cd0c5a3bcb48579569d1b40c9c6f724)


  0%|          | 0/3 [00:00<?, ?it/s]

# Dataset Metadata

Each BigBioConfigHelper provides a get_metadata method that will calculate schema specific metadata for configs implementing a BigBIO schema. For example,

In [13]:
conhelps.for_config_name('bc5cdr_bigbio_kb').get_metadata()

Found cached dataset bc5cdr (/home/galtay/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_bigbio_kb/1.0.0/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f)


  0%|          | 0/3 [00:00<?, ?it/s]

{'train': BigBioKbMetadata(samples_count=500, passages_count=1000, passages_char_count=652177, passages_type_counter={'title': 500, 'abstract': 500}, entities_count=9570, entities_normalized_count=9599, entities_type_counter={'Chemical': 5207, 'Disease': 4363}, entities_db_name_counter={'MESH': 9599}, entities_unique_db_ids_count=1328, events_count=0, events_type_counter={}, events_arguments_count=0, events_arguments_role_counter={}, coreferences_count=0, relations_count=15072, relations_type_counter={'CID': 15072}, relations_db_name_counter={}, relations_unique_db_ids_count=0),
 'test': BigBioKbMetadata(samples_count=500, passages_count=1000, passages_char_count=676751, passages_type_counter={'title': 500, 'abstract': 500}, entities_count=9928, entities_normalized_count=9919, entities_type_counter={'Chemical': 5394, 'Disease': 4534}, entities_db_name_counter={'MESH': 9919}, entities_unique_db_ids_count=1315, events_count=0, events_type_counter={}, events_arguments_count=0, events_argu