Skip to content

chanzuckerberg/ChemDisGene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChemDisGene: A Biomedical Relation Extraction Dataset

This is a public release of the ChemDisGene dataset, a collection of Biomedical research abstracts annotated with mentions of Chemical, Disease and Gene/Gene-product entities, and pairwise relationships between those entities. CZI Science is releasing this data to promote NLP research on Relation Extraction from Biomedical text.

Project Status

This project is stable and maintained, but not actively under development.

Introduction

The ChemDisGene dataset contains two corpora:

  • CTD-derived: A corpus of ~80k abstracts, with entity mentions from PubTator Central, and automatically aligned noisy relationship labels derived from CTD.
  • Curated: A corpus of 523 abstracts, with entity mentions from PubTator Central, and relationship labels manually curated by a team of biologists.

Details on how this corpus was derived from CTD and further curated are in the accompanying paper (see below).

Annotation guidelines developed for the curation task are included here.

Directory Structure

Cloning the Repository

The data files are stored using git lfs. Follow these instructions to install git lfs on your machine. Then issue the following command:

$ git lfs clone https://github.com/chanzuckerberg/ChemDisGene.git

This will create a new directory called ChemDisGene and clone the repository into that directory.

Entities and Relationships

Entity mentions are linked to the following public ontologies:

The dataset uses a total of 18 relation types, based on the classes described below (definitions are from CTD Glossary). Some of these classes are further qualified by a degree.

  • Chemical-Disease:

    • marker/mechanism: A chemical that correlates with a disease.
    • therapeutic: A chemical that has a known or potential therapeutic role in a disease.
  • Chemical-Gene: Each qualified by a degree.

    • activity: An elemental function of a molecule. Degrees: increases, decreases, or affects when the direction is not indicated.
    • binding: A molecular interaction. Degrees: affects.
    • expression: Expression of a gene product. Degrees: increases, decreases, affects.
    • localization: Part of the cell where a molecule resides. Degrees: affects.
    • metabolic processing: The biochemical alteration of a molecule’s structure (not including changes in expression, stability, folding, localization, splicing, or transport). Degrees: increases, decreases, affects.
    • transport: The movement of a molecule into or out of a cell. Degrees: increases, decreases, affects.
  • Gene-Disease:

    • marker/mechanism: A gene that may be a biomarker of a disease or play a role in the etiology of a disease.
    • therapeutic: A gene that is or may be a ther- apeutic target in the treatment a disease.

ChemDisGene Data Format

The CTD-derived corpus is derived from the February 2021 dump of CTD, and is partitioned into train, dev and test subsets based on paper publication year: 2018 for dev and the years 2019, 2020 for test. Each split consists of an abstracts file containing the text of the abstracts and all the entity mentions, and a relationships file containing all the identified relationships.

The Curated corpus in ChemDisGene contains 271 documents from the test split of the CTD-derived corpus, and another set of older 252 abstracts taken from DrugProt that had also been annotated by CTD. These 252 documents are not in the CTD-derived corpus.

The Curated corpus consists of the following files:

  • abstracts with entity mentions
  • approved relationships derived from CTD
  • additional approved relationships
  • file containing PubMed® IDs of the DrugProt abstracts

All 'abstract' files are in PubTator format. All relationship files also follow the PubTator format for relationships. Both are described below.

The PubTator format for Abstracts and Entity Mentions

Abstracts are annotated in PubTator format. Each document ends with a blank line, and is represented as (without the spaces):

PMID | t | Title text
PMID | a | Abstract text
PMID TAB StartIndex TAB EndIndex TAB MentionTextSegment TAB EntityType TAB EntityID
...

The first two lines present the Title and Abstract texts (no line-breaks or tabs in the text). Subsequent lines present the mentions, one per line. The StartIndex and EndIndex are 0-based character indices into the document text, constructed by concatenating the Title and Abstract, separated by a SPACE character. The MentionTextSegment is the actual mention between those character positions. The EntityID is the unique identifier for that entity in the corresponding ontology, and the EntityType is one of "Chemical", "Disease" or "Gene". If the entity is linked to more than one ID, then that field contains a |-separated list of all the IDs. Occasionally, PubTator Central identifies the type of a mention but is unable to link it. This is indicated by the special ID "-".

Here is an example:

10226872|t|Disodium cromoglycate does not prevent terbutaline-induced desensitization of beta 2-adrenoceptor-mediated cardiovascular in vivo functions in human volunteers.
10226872|a|In humans, prolonged administration of the beta 2-adrenoceptor agonist terbutaline leads to a desensitization of beta 2-adrenoceptor-mediated cardiovascular responses, which can be blunted by concomitant administration of the antianaphylactic drug ketotifen. This study investigated the effect of disodium cromoglycate, another antiallergic drug, on terbutaline-induced desensitization of beta-adrenoceptor-mediated cardiovascular and noncardiovascular responses. In a double-blind, placebo-controlled, randomized design, nine healthy male volunteers received disodium cromoglycate (4 x 200 mg/day, p.o.) or placebo for 3 weeks with terbutaline (3 x 5 mg/day, p.o.) administered concomitantly during the last 2 weeks. beta 2-Adrenoceptor cardiovascular function was assessed by the increase in heart rate and reduction of diastolic blood pressure induced by an incremental intravenous infusion of the unselective beta-adrenoceptor agonist isoprenaline; beta 1-adrenoceptor cardiovascular function was assessed by exercise-induced tachycardia. Tremulousness was monitored as a beta 2-adrenoceptor-mediated noncardiovascular effect. After 2 weeks' administration of terbutaline, there was a marked and significant (p < 0.001) attenuation of isoprenaline-induced tachycardia (mean percentage attenuation, 53.3%) and of the isoprenaline-induced decrease in diastolic blood pressure (mean percentage attenuation, 55.6%). Exercise-induced tachycardia also was significantly (p < 0.001) blunted, but the magnitude of this attenuation was only very small (mean attenuation, 5.6%). Disodium cromoglycate affected neither the rightward shift of beta 2-adrenoceptor-mediated responses nor the small rightward shift in beta 1-adrenoceptor-mediated exercise tachycardia after 2 weeks' administration of terbutaline. Tremulousness observed during the first few days of terbutaline administration disappeared after 4 to 8 days, indicating development of desensitization of beta 2-adrenoceptor-mediated noncardiovascular responses. This was not prevented by disodium cromoglycate. These results confirm that long-term beta 2-adrenoceptor agonist therapy leads to a desensitization of beta 2-adrenoceptor-mediated cardiovascular and noncardiovascular effects in humans in vivo. However, unlike ketotifen, cromolyn sodium is not able to attenuate this desensitization.
10226872        0       21      Disodium cromoglycate   Chemical        MESH:D004205
10226872        39      50      terbutaline     Chemical        MESH:D013726
10226872        78      97      beta 2-adrenoceptor     Gene    154
10226872        204     223     beta 2-adrenoceptor     Gene    154
...

In this example, the Title is 160 characters long. The first mention occurs in the title, and is for the Chemical "Disodium cromoglycate" whose MeSH® id is MESH:D004205. The fourth mention in the list is from the body of the abstract, and is for the Gene "beta 2-adrenoceptor" linked to the NCBI Gene id "154". There is a previous mention of the same gene, in the title.

Format for Relationships

Each line in a relationships file indicates one relationship in one of the corresponding documents. This is a TAB-separated file, with the following columns:

  • PubMedID: PubMed® ID for the document in which this relationship occurs.
  • Relation-Type: One of 18 relation type labels
  • Subject-Entity-ID: Unique ontology identifier for the subject entity
  • Object-Entity-ID: Unique ontology identifier for the subject entity

Relationships are associated with the document, but not specific entity mentions (distant labeling). Here is an example showing three relationships associated with the abstract with PubMed® ID "10226872":

10226872        chem_disease:marker/mechanism   MESH:D007545    MESH:D013610
10226872        chem_gene:affects^binding       MESH:D013726    154
10226872        chem_gene:increases^activity    MESH:D013726    154

The first relationship in the example is for the relation type "Chemical-Disease: marker/mechanism", with argument entities "MESH:D007545" (a Chemical) and "MESH:D013610" (a Disease). The relation type in the second relationship is "Chemical-Gene: affects-binding", where "affects" is the degree of "binding".

Looking up Entities in their Ontologies

The following URL formats can be used to look up details about each entity ({entid} is the identifier without the ontology name):

  • MeSH®: https://meshb.nlm.nih.gov/record/ui?ui={entid}
  • OMIM®: https://www.omim.org/entry/{entid}
  • NCBI Gene: https://www.ncbi.nlm.nih.gov/gene/{entid}

For example, here is the link for the Chemical "Isoproterenol": MESH:D007545.

How to Cite

If you use ChemDisGene, please cite the following paper:

Dongxu Zhang, Sunil Mohan, Michaela Torkar and Andrew McCallum. A Distant Supervision Corpus for Extracting Biomedical Relationships Between Chemicals, Diseases and Genes. In LREC 2022. [arXiv]

@InProceedings{zhang-etal:2022:LREC,
  author = {Dongxu Zhang and Sunil Mohan and Michaela Torkar and Andrew McCallum},
  title = {A Distant Supervision Corpus for Extracting Biomedical Relationships Between Chemicals, Diseases and Genes},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  month = {June},
  year = {2022},
  address = {Marseille, France},
  publisher = {European Language Resources Association},
}

License

This data is being released under the CC0 license.

The abstracts in the dataset were selected from those available from PubMed® / Medline® during February 2021. Users are referred to that source for the most current and accurate version of the text for the corresponding papers.

Entity mentions were obtained using PubTator Central, and the relationships are based on the Comparative Toxicogenomics Database.

Curated Chemical–Gene, Chemical–Disease and Gene–Disease interactions data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. [February, 2021].

Feedback, Questions

If you have any comments, questions or issues, please post a note in GitHub issues.

Security Issues

Reporting security issues

About

Bio relation extraction labeled dataset

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages