-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New NER drugs pipeline #58
Conversation
…d names and active ingredients with their ATC codes
…er it in setup.py and add a unit test
Fantastic work, thank you for this very clean PR ! Very much looking forward to using it.
|
Thank you for the tremendous addition! I'm wondering if we should adapt the matchers (be they We should probably handle this case by hand on this one, to avoid losing too much time. But it would be a nice addition, i'm creating an issue to discuss this further. |
@percevalw @bdura
Yes, the dictionary will evolve (Romedi has not been updated for a year but we will update it soon, recently marketed drugs are absent from this dictionary)
I think there are three options you need to consider:
My opinion is that option 2 is the best but the decision is up to you.
I totally agree. Another point I would like to raise is that I made choices (brand names and ATC labels) when creating the dictionary without discussing the best options. |
@scossin I think you should either put your name in the doc, or refer to the IAM team. It might be easier to find some referent people than the mysterious |
Thank you for these precisions ! Here are a few elements discussed with @bdura. ent._.value = Concept(
concept_source="romedi", # name of the terminology
concept_id="ATC:...", # unique concept id
preferred_term="preferred lexical variant for this concept",
url="optional url to get more info about the concept (CIM10, UMLS Browser, Romedi, ...)"
) and set the Following this scheme means that we cannot store multiple concepts for a given entity, which is often the right choice IMO. What do you think of this proposal? |
I agree with you, one single concept for a given entity seems fine to me. The documentation should explain how ambiguous terms are handled (rare scenarios), ex: 'irc' that stands for 'insuffisance renale chronique' and 'insuffisance respiratoire chronique'. I guess one of two concepts will be picked at random by the phrasematcher, the user should know the terminology contains ambiguous terms and that he needs another pipeline to disambiguate those. Is that correct ? |
Hello @scossin, So sorry for the delay. I've just proposed a very simple terminology matcher (see #75) that could suit our needs. It would only require you to modify the factory to use the I think we'll be good to go after that! What do you think? The philosophical questions that we raised can be experimented with once the pipeline is up 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of proposing a change that makes use of the new TerminologyMatcher
. If that seems good to you, I think we can merge your feature!
Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to adapt the tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of accepting the changes to finally merge your PR. Thanks for the great work, and sorry again for the delay!
That's great thanks for making the changes and for merging ! |
A new NER pipeline for drug detection
Description
Hello folks,
My first motivation was to understand how this package works and try to add a pipeline.
Congratulations on the work done, the documentation is very good and it was easy to add a pipeline.
At CHU de Bordeaux, we developed Romedi for drug detection (https://www.romedi.fr)
From Romedi we extract a dictionary of brand names and actives ingredients in order to detect drug in textual content.
I added the dictionary (resources/drugs.json) and created a simple pipeline that utilizes the phrase matcher so you can test it on your data. You may have another way of detecting drugs, I will not be offended if you reject this pull request.
Best,
Checklist