Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New NER drugs pipeline #58

Merged
merged 8 commits into from
May 31, 2022
Merged

New NER drugs pipeline #58

merged 8 commits into from
May 31, 2022

Conversation

scossin
Copy link
Contributor

@scossin scossin commented Apr 23, 2022

A new NER pipeline for drug detection

Description

Hello folks,

My first motivation was to understand how this package works and try to add a pipeline.
Congratulations on the work done, the documentation is very good and it was easy to add a pipeline.
At CHU de Bordeaux, we developed Romedi for drug detection (https://www.romedi.fr)
From Romedi we extract a dictionary of brand names and actives ingredients in order to detect drug in textual content.
I added the dictionary (resources/drugs.json) and created a simple pipeline that utilizes the phrase matcher so you can test it on your data. You may have another way of detecting drugs, I will not be offended if you reject this pull request.

Best,

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation (eg new pipeline).

@percevalw
Copy link
Member

Fantastic work, thank you for this very clean PR ! Very much looking forward to using it.
I have a few comments and questions as we haven't fully decided how to interact with concepts and normalized entities yet:

  • the drugs.csv file is static, do you think this resource can be expected to evolve? In which case, we could then do the preprocessing from external files downloaded (and cached) as the user installs the library ?
  • I see that you use the label attribute to store the drug concept. I think we could instead store a name like "drug" in the label, and store the concept in the ent_kb_id field

@bdura
Copy link
Contributor

bdura commented Apr 25, 2022

Thank you for the tremendous addition!

I'm wondering if we should adapt the matchers (be they RegexMatcher or PhraseMatcher) to include the definition of the kb_id_ field and handle the case of single label/multiple subconcepts natively.

We should probably handle this case by hand on this one, to avoid losing too much time. But it would be a nice addition, i'm creating an issue to discuss this further.

@scossin
Copy link
Contributor Author

scossin commented Apr 26, 2022

@percevalw @bdura
Thanks for your quick feedback!

  • the drugs.csv file is static

do you think this resource can be expected to evolve?

Yes, the dictionary will evolve (Romedi has not been updated for a year but we will update it soon, recently marketed drugs are absent from this dictionary)

do the preprocessing from external files downloaded (and cached) as the user installs the library

I think there are three options you need to consider:

  1. Retrieve the dictionary with the SparqlEndpoint of Romedi; you are sure to have the latest version : 1) you need to add another dependency to perform Sparql queries, 2) you add a dependence to an external website that may be blocked inside a hospital 3) there is no warning when Romedi is updated
  2. Cach and download the file on-demand or at installation as you suggest : 1) the file versioning will be done outside of this repository so it will be more be difficult to keep track of the changes ; 2) the external website to download the dictionary could be blocked
  3. Put the file in the repository: it makes the package heavier to download, it will not scale if you want to add other terminologies

My opinion is that option 2 is the best but the decision is up to you.

  • label attribute to store

we could instead store a name like "drug" in the label

I totally agree.

Another point I would like to raise is that I made choices (brand names and ATC labels) when creating the dictionary without discussing the best options.
At CHU de Bordeaux, we map mention of drugs to Romedi's URIs. For example, Glucophage is mapped to http://www.romedi.fr/romedi/BN06012kovrrmbkrvjlvhjdnqc468rkg1e ; if you want a json representation: http://www.romedi.fr/romedi/BN06012kovrrmbkrvjlvhjdnqc468rkg1e?json ; from this URI users can retrieve ATC codes but also other information such as its active ingredient etc... The RDF graph can also be queried with URIs.
This way is more "powerful" but the counterpart is that users need to understand how to retrieve information from an URI. I think the dictionary with ATC codes is easiest to start with but a dictionary with Romedi URIs would be more appropriate for advanced users. In a second step, we could propose another drugs dictionary with Romedi URIs.

@gozat
Copy link
Contributor

gozat commented Apr 26, 2022

@scossin I think you should either put your name in the doc, or refer to the IAM team. It might be easier to find some referent people than the mysterious CHU de Bordeaux's Data Science team that refers to nobody.

@percevalw
Copy link
Member

Thank you for these precisions ! Here are a few elements discussed with @bdura.
At first we can keep the Romedi json file as suggested in your PR and work on fetching these synonyms from outside later.
Regarding the output of the pipeline, we propose to store the concept in the attribute

ent._.value = Concept(
  concept_source="romedi", # name of the terminology
  concept_id="ATC:...", # unique concept id
  preferred_term="preferred lexical variant for this concept",
  url="optional url to get more info about the concept (CIM10, UMLS Browser, Romedi, ...)"
) 

and set the ent.label to drug. To align with spacy, we can also store the concept_id in the ent.kb_id.

Following this scheme means that we cannot store multiple concepts for a given entity, which is often the right choice IMO.

What do you think of this proposal?

@scossin
Copy link
Contributor Author

scossin commented May 2, 2022

@percevalw

we cannot store multiple concepts for a given entity

I agree with you, one single concept for a given entity seems fine to me. The documentation should explain how ambiguous terms are handled (rare scenarios), ex: 'irc' that stands for 'insuffisance renale chronique' and 'insuffisance respiratoire chronique'. I guess one of two concepts will be picked at random by the phrasematcher, the user should know the terminology contains ambiguous terms and that he needs another pipeline to disambiguate those. Is that correct ?

@bdura
Copy link
Contributor

bdura commented May 25, 2022

Hello @scossin,

So sorry for the delay. I've just proposed a very simple terminology matcher (see #75) that could suit our needs. It would only require you to modify the factory to use the TerminologyMatcher instead of the GenericMatcher.

I think we'll be good to go after that! What do you think?

The philosophical questions that we raised can be experimented with once the pipeline is up 🙂

Copy link
Contributor

@bdura bdura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of proposing a change that makes use of the new TerminologyMatcher. If that seems good to you, I think we can merge your feature!

Thanks again!

edsnlp/pipelines/ner/drugs/factory.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdura bdura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to adapt the tests...

tests/pipelines/ner/test_drugs.py Outdated Show resolved Hide resolved
@bdura bdura self-requested a review May 31, 2022 07:32
Copy link
Contributor

@bdura bdura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of accepting the changes to finally merge your PR. Thanks for the great work, and sorry again for the delay!

docs/pipelines/ner/drugs.md Outdated Show resolved Hide resolved
@bdura bdura merged commit f1c9bf0 into aphp:master May 31, 2022
@scossin
Copy link
Contributor Author

scossin commented May 31, 2022

I took the liberty of proposing a change that makes use of the new TerminologyMatcher. If that seems good to you, I think we can merge your feature!

Thanks again!

That's great thanks for making the changes and for merging !

@scossin scossin deleted the drugs_pipeline branch May 31, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants