Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise implementation of BioRED #853

Merged
merged 2 commits into from Jan 12, 2023

Conversation

mariosaenger
Copy link
Collaborator

This PR improves the implementation of the BioRed corpus:

  • In the previous implementation a unique entity was created per entity mention and database identifier. This was fixed to a single entity mention having multiple database ids.
  • Furthermore, the name of the database a entity is linked to was added
  • BioRed only provides abstract-level annotations for entity-linked relation pairs rather than materializing links between all surface form mentions of relation. Analogous to BC5CDR we enumerate all mention pairs concerning the entities in the triple.

@galtay
Copy link
Collaborator

galtay commented Jan 6, 2023

@davidkartchner can you take a look here and comment on the database names? is there a way we can make these database names consistent with the ones you are using?

@mariosaenger is this PR sensitive to the exact db name? i.e. would it break anything if we use MESH instead of "ChemicalEntity": "Medical Subject Headings (MESH)"

@davidkartchner
Copy link
Contributor

davidkartchner commented Jan 6, 2023

@galtay @mariosaenger In order to be consistent with other datasets, I would use the following mapping from type to database:

TYPE_TO_DATABASE = {
        "CellLine": "Cellosaurus",
        "ChemicalEntity": "MESH",
        "DiseaseOrPhenotypicFeature": "MESH" or "OMIM",
        "GeneOrGeneProduct": "NCBIGene",
        "OrganismTaxon": "NCBITaxon",
        "SequenceVariant": "dbSNP" or "custom",
    }

For cases where an entity type can be linked to multiple databases, it is especially important to correctly specify which database the identifier is coming from to effectively train an entity normalization model later on. For DiseaseOrPhenotypicFeature or SequenceVariant, you can probably determine which database it links to with some basic checks on the string format (e.g. all OMIM normalization have "OMIM" prepended to their identifier). An example can be found at https://huggingface.co/datasets/bigbio/ncbi_disease/blob/main/ncbi_disease.py#L233

@mariosaenger
Copy link
Collaborator Author

@galtay @davidkartchner thanks for the feedback. I revised the implementation to adhere to the database naming scheme.

In a future PR, we could possibly also consider standardising the naming scheme of all NEN data sets via constants.

@galtay
Copy link
Collaborator

galtay commented Jan 12, 2023

thanks @mariosaenger . yes I agree, it would be nice to have a check for standardized database names in the unit tests.

@galtay galtay merged commit 827fbda into bigscience-workshop:main Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants