Revise implementation of BioRED #853

mariosaenger · 2023-01-05T10:24:59Z

This PR improves the implementation of the BioRed corpus:

In the previous implementation a unique entity was created per entity mention and database identifier. This was fixed to a single entity mention having multiple database ids.
Furthermore, the name of the database a entity is linked to was added
BioRed only provides abstract-level annotations for entity-linked relation pairs rather than materializing links between all surface form mentions of relation. Analogous to BC5CDR we enumerate all mention pairs concerning the entities in the triple.

…ty normalization

galtay · 2023-01-06T03:56:42Z

@davidkartchner can you take a look here and comment on the database names? is there a way we can make these database names consistent with the ones you are using?

@mariosaenger is this PR sensitive to the exact db name? i.e. would it break anything if we use MESH instead of "ChemicalEntity": "Medical Subject Headings (MESH)"

davidkartchner · 2023-01-06T04:25:25Z

@galtay @mariosaenger In order to be consistent with other datasets, I would use the following mapping from type to database:

TYPE_TO_DATABASE = {
        "CellLine": "Cellosaurus",
        "ChemicalEntity": "MESH",
        "DiseaseOrPhenotypicFeature": "MESH" or "OMIM",
        "GeneOrGeneProduct": "NCBIGene",
        "OrganismTaxon": "NCBITaxon",
        "SequenceVariant": "dbSNP" or "custom",
    }

For cases where an entity type can be linked to multiple databases, it is especially important to correctly specify which database the identifier is coming from to effectively train an entity normalization model later on. For DiseaseOrPhenotypicFeature or SequenceVariant, you can probably determine which database it links to with some basic checks on the string format (e.g. all OMIM normalization have "OMIM" prepended to their identifier). An example can be found at https://huggingface.co/datasets/bigbio/ncbi_disease/blob/main/ncbi_disease.py#L233

mariosaenger · 2023-01-09T15:11:39Z

@galtay @davidkartchner thanks for the feedback. I revised the implementation to adhere to the database naming scheme.

In a future PR, we could possibly also consider standardising the naming scheme of all NEN data sets via constants.

galtay · 2023-01-12T05:46:08Z

thanks @mariosaenger . yes I agree, it would be nice to have a check for standardized database names in the unit tests.

Revise implementation of BioRED: fix entity modeling and improve enti…

b84993c

…ty normalization

mariosaenger requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners January 5, 2023 10:24

Harmonize database names to existing naming scheme

c57b48a

galtay approved these changes Jan 12, 2023

View reviewed changes

galtay merged commit 827fbda into bigscience-workshop:main Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise implementation of BioRED #853

Revise implementation of BioRED #853

mariosaenger commented Jan 5, 2023

galtay commented Jan 6, 2023

davidkartchner commented Jan 6, 2023 •

edited

mariosaenger commented Jan 9, 2023

galtay commented Jan 12, 2023

Revise implementation of BioRED #853

Revise implementation of BioRED #853

Conversation

mariosaenger commented Jan 5, 2023

galtay commented Jan 6, 2023

davidkartchner commented Jan 6, 2023 • edited

mariosaenger commented Jan 9, 2023

galtay commented Jan 12, 2023

davidkartchner commented Jan 6, 2023 •

edited