New NER drugs pipeline #58

scossin · 2022-04-23T16:45:03Z

A new NER pipeline for drug detection

Description

Hello folks,

My first motivation was to understand how this package works and try to add a pipeline.
Congratulations on the work done, the documentation is very good and it was easy to add a pipeline.
At CHU de Bordeaux, we developed Romedi for drug detection (https://www.romedi.fr)
From Romedi we extract a dictionary of brand names and actives ingredients in order to detect drug in textual content.
I added the dictionary (resources/drugs.json) and created a simple pipeline that utilizes the phrase matcher so you can test it on your data. You may have another way of detecting drugs, I will not be offended if you reject this pull request.

Best,

Checklist

If this PR is a bug fix, the bug is documented in the test suite.
Changes were documented in the changelog (pending section).
If necessary, changes were made to the documentation (eg new pipeline).

…d names and active ingredients with their ATC codes

…er it in setup.py and add a unit test

percevalw · 2022-04-25T16:17:05Z

Fantastic work, thank you for this very clean PR ! Very much looking forward to using it.
I have a few comments and questions as we haven't fully decided how to interact with concepts and normalized entities yet:

the drugs.csv file is static, do you think this resource can be expected to evolve? In which case, we could then do the preprocessing from external files downloaded (and cached) as the user installs the library ?
I see that you use the label attribute to store the drug concept. I think we could instead store a name like "drug" in the label, and store the concept in the ent_kb_id field

bdura · 2022-04-25T16:30:18Z

Thank you for the tremendous addition!

I'm wondering if we should adapt the matchers (be they RegexMatcher or PhraseMatcher) to include the definition of the kb_id_ field and handle the case of single label/multiple subconcepts natively.

We should probably handle this case by hand on this one, to avoid losing too much time. But it would be a nice addition, i'm creating an issue to discuss this further.

scossin · 2022-04-26T11:28:05Z

@percevalw @bdura
Thanks for your quick feedback!

the drugs.csv file is static

do you think this resource can be expected to evolve?

Yes, the dictionary will evolve (Romedi has not been updated for a year but we will update it soon, recently marketed drugs are absent from this dictionary)

do the preprocessing from external files downloaded (and cached) as the user installs the library

I think there are three options you need to consider:

Retrieve the dictionary with the SparqlEndpoint of Romedi; you are sure to have the latest version : 1) you need to add another dependency to perform Sparql queries, 2) you add a dependence to an external website that may be blocked inside a hospital 3) there is no warning when Romedi is updated
Cach and download the file on-demand or at installation as you suggest : 1) the file versioning will be done outside of this repository so it will be more be difficult to keep track of the changes ; 2) the external website to download the dictionary could be blocked
Put the file in the repository: it makes the package heavier to download, it will not scale if you want to add other terminologies

My opinion is that option 2 is the best but the decision is up to you.

label attribute to store

we could instead store a name like "drug" in the label

I totally agree.

Another point I would like to raise is that I made choices (brand names and ATC labels) when creating the dictionary without discussing the best options.
At CHU de Bordeaux, we map mention of drugs to Romedi's URIs. For example, Glucophage is mapped to http://www.romedi.fr/romedi/BN06012kovrrmbkrvjlvhjdnqc468rkg1e ; if you want a json representation: http://www.romedi.fr/romedi/BN06012kovrrmbkrvjlvhjdnqc468rkg1e?json ; from this URI users can retrieve ATC codes but also other information such as its active ingredient etc... The RDF graph can also be queried with URIs.
This way is more "powerful" but the counterpart is that users need to understand how to retrieve information from an URI. I think the dictionary with ATC codes is easiest to start with but a dictionary with Romedi URIs would be more appropriate for advanced users. In a second step, we could propose another drugs dictionary with Romedi URIs.

gozat · 2022-04-26T12:55:44Z

@scossin I think you should either put your name in the doc, or refer to the IAM team. It might be easier to find some referent people than the mysterious CHU de Bordeaux's Data Science team that refers to nobody.

percevalw · 2022-04-29T12:08:57Z

Thank you for these precisions ! Here are a few elements discussed with @bdura.
At first we can keep the Romedi json file as suggested in your PR and work on fetching these synonyms from outside later.
Regarding the output of the pipeline, we propose to store the concept in the attribute

ent._.value = Concept(
  concept_source="romedi", # name of the terminology
  concept_id="ATC:...", # unique concept id
  preferred_term="preferred lexical variant for this concept",
  url="optional url to get more info about the concept (CIM10, UMLS Browser, Romedi, ...)"
)

and set the ent.label to drug. To align with spacy, we can also store the concept_id in the ent.kb_id.

Following this scheme means that we cannot store multiple concepts for a given entity, which is often the right choice IMO.

What do you think of this proposal?

scossin · 2022-05-02T10:26:30Z

@percevalw

we cannot store multiple concepts for a given entity

I agree with you, one single concept for a given entity seems fine to me. The documentation should explain how ambiguous terms are handled (rare scenarios), ex: 'irc' that stands for 'insuffisance renale chronique' and 'insuffisance respiratoire chronique'. I guess one of two concepts will be picked at random by the phrasematcher, the user should know the terminology contains ambiguous terms and that he needs another pipeline to disambiguate those. Is that correct ?

bdura · 2022-05-25T09:35:54Z

Hello @scossin,

So sorry for the delay. I've just proposed a very simple terminology matcher (see #75) that could suit our needs. It would only require you to modify the factory to use the TerminologyMatcher instead of the GenericMatcher.

I think we'll be good to go after that! What do you think?

The philosophical questions that we raised can be experimented with once the pipeline is up 🙂

bdura

I took the liberty of proposing a change that makes use of the new TerminologyMatcher. If that seems good to you, I think we can merge your feature!

Thanks again!

edsnlp/pipelines/ner/drugs/factory.py

bdura

I forgot to adapt the tests...

tests/pipelines/ner/test_drugs.py

bdura

I took the liberty of accepting the changes to finally merge your PR. Thanks for the great work, and sorry again for the delay!

docs/pipelines/ner/drugs.md

scossin · 2022-05-31T10:12:07Z

I took the liberty of proposing a change that makes use of the new TerminologyMatcher. If that seems good to you, I think we can merge your feature!

Thanks again!

That's great thanks for making the changes and for merging !

sebastien cossin added 3 commits April 23, 2022 13:14

drugs_pipeline: add drugs.json resource containing French drugs: bran…

8609936

…d names and active ingredients with their ATC codes

drugs_pipeline: add the new drugs_pipeline (factory/patterns), regist…

15be308

…er it in setup.py and add a unit test

drugs_pipeline: add the documentation, register it in mkdocs.yml

59805be

bdura mentioned this pull request Apr 25, 2022

Feature request: terminology matcher with normalisation #62

Open

Merge branch 'master' into drugs_pipeline

670eae4

bdura approved these changes May 25, 2022

View reviewed changes

edsnlp/pipelines/ner/drugs/factory.py Outdated Show resolved Hide resolved

bdura suggested changes May 25, 2022

View reviewed changes

tests/pipelines/ner/test_drugs.py Outdated Show resolved Hide resolved

bdura added 3 commits May 31, 2022 09:30

Update edsnlp/pipelines/ner/drugs/factory.py

1d6f8f0

Update tests/pipelines/ner/test_drugs.py

1977912

Merge branch 'master' into drugs_pipeline

9987d93

bdura self-requested a review May 31, 2022 07:32

bdura approved these changes May 31, 2022

View reviewed changes

docs/pipelines/ner/drugs.md Outdated Show resolved Hide resolved

Update docs/pipelines/ner/drugs.md

2fe4906

bdura merged commit f1c9bf0 into aphp:master May 31, 2022

scossin deleted the drugs_pipeline branch May 31, 2022 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New NER drugs pipeline #58

New NER drugs pipeline #58

scossin commented Apr 23, 2022 •

edited by bdura

percevalw commented Apr 25, 2022

bdura commented Apr 25, 2022

scossin commented Apr 26, 2022

gozat commented Apr 26, 2022

percevalw commented Apr 29, 2022

scossin commented May 2, 2022

bdura commented May 25, 2022 •

edited

bdura left a comment

bdura left a comment

bdura left a comment

scossin commented May 31, 2022

New NER drugs pipeline #58

New NER drugs pipeline #58

Conversation

scossin commented Apr 23, 2022 • edited by bdura

Description

Checklist

percevalw commented Apr 25, 2022

bdura commented Apr 25, 2022

scossin commented Apr 26, 2022

gozat commented Apr 26, 2022

percevalw commented Apr 29, 2022

scossin commented May 2, 2022

bdura commented May 25, 2022 • edited

bdura left a comment

Choose a reason for hiding this comment

bdura left a comment

Choose a reason for hiding this comment

bdura left a comment

Choose a reason for hiding this comment

scossin commented May 31, 2022

scossin commented Apr 23, 2022 •

edited by bdura

bdura commented May 25, 2022 •

edited