-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
refacto: unify span_getter & span_setter parameters #203
Co-Authored-By: Perceval Wajsbürt <perceval.wajsburt@aphp.fr> Co-Authored-By: Thomas Petit-Jean <thomas.petitjean@aphp.fr>
- Loading branch information
1 parent
9e64f77
commit ee2ce71
Showing
3 changed files
with
396 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Named Entity Recognition Components | ||
|
||
We provide several Named Entity Recognition (NER) components. | ||
Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories. | ||
In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, measurements, etc. | ||
|
||
## Span setters: where are stored extracted entities ? {: #edsnlp.pipelines.base.SpanSetterArg } | ||
|
||
A component assigns entities to a document by adding them to the `doc.ents` or `doc.spans[group]` attributes. `doc.ents` only supports non overlapping | ||
entities, therefore, if two entities overlap, the longest one will be kept. `doc.spans[group]` on the other hand, can contain overlapping entities. | ||
To control where entities are added, you can use the `span_setter` argument in any of these component. | ||
|
||
::: edsnlp.pipelines.base.SpanSetterArg | ||
options: | ||
heading_level: 2 | ||
show_bases: false | ||
show_source: false | ||
only_class_level: true | ||
|
||
## Available components | ||
|
||
<!-- --8<-- [start:components] --> | ||
|
||
| Component | Description | | ||
|-------------------------------------------------------------------------------------------|---------------------------------------| | ||
| [`eds.covid`](/pipelines/ner/covid) | A COVID mentions detector | | ||
| [`eds.charlson`](/pipelines/ner/scores/charlson) | A Charlson score extractor | | ||
| [`eds.sofa`](/pipelines/ner/scores/sofa) | A SOFA score extractor | | ||
| [`eds.elston_ellis`](/pipelines/ner/scores/elston-ellis) | An Elston & Ellis code extractor | | ||
| [`eds.emergency_priority`](/pipelines/ner/scores/emergency-priority) | A priority score extractor | | ||
| [`eds.emergency_ccmu`](/pipelines/ner/scores/emergency-ccmu) | A CCMU score extractor | | ||
| [`eds.emergency_gemsa`](/pipelines/ner/scores/emergency-gemsa) | A GEMSA score extractor | | ||
| [`eds.tnm`](/pipelines/ner/tnm) | A TNM score extractor | | ||
| [`eds.adicap`](/pipelines/ner/adicap) | A ADICAP codes extractor | | ||
| [`eds.drugs`](/pipelines/ner/drugs) | A drug mentions extractor | | ||
| [`eds.cim10`](/pipelines/ner/cim10) | A CIM10 terminology matcher | | ||
| [`eds.umls`](/pipelines/ner/umls) | An UMLS terminology matcher | | ||
| [`eds.ckd`](/pipelines/ner/disorders/ckd) | CKD extractor | | ||
| [`eds.copd`](/pipelines/ner/disorders/copd) | COPD extractor | | ||
| [`eds.cerebrovascular_accident`](/pipelines/ner/disorders/cerebrovascular-accident) | Cerebrovascular accident extractor | | ||
| [`eds.congestive_heart_failure`](/pipelines/ner/disorders/congestive-heart-failure) | Congestive heart failure extractor | | ||
| [`eds.connective_tissue_disease`](/pipelines/ner/disorders/connective-tissue-disease) | Connective tissue disease extractor | | ||
| [`eds.dementia`](/pipelines/ner/disorders/dementia) | Dementia extractor | | ||
| [`eds.diabetes`](/pipelines/ner/disorders/diabetes) | Diabetes extractor | | ||
| [`eds.hemiplegia`](/pipelines/ner/disorders/hemiplegia) | Hemiplegia extractor | | ||
| [`eds.leukemia`](/pipelines/ner/disorders/leukemia) | Leukemia extractor | | ||
| [`eds.liver_disease`](/pipelines/ner/disorders/liver-disease) | Liver disease extractor | | ||
| [`eds.lymphoma`](/pipelines/ner/disorders/lymphoma) | Lymphoma extractor | | ||
| [`eds.myocardial_infarction`](/pipelines/ner/disorders/myocardial-infarction) | Myocardial infarction extractor | | ||
| [`eds.peptic_ulcer_disease`](/pipelines/ner/disorders/peptic-ulcer-disease) | Peptic ulcer disease extractor | | ||
| [`eds.peripheral_vascular_disease`](/pipelines/ner/disorders/peripheral-vascular-disease) | Peripheral vascular disease extractor | | ||
| [`eds.solid_tumor`](/pipelines/ner/disorders/solid-tumor) | Solid tumor extractor | | ||
| [`eds.alcohol`](/pipelines/ner/behaviors/alcohol) | Alcohol consumption extractor | | ||
| [`eds.tobacco`](/pipelines/ner/behaviors/tobacco) | Tobacco consumption extractor | | ||
|
||
<!-- --8<-- [end:components] --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# Qualifier Overview | ||
|
||
In EDS-NLP, we call _qualifiers_ the suite of components designed to _qualify_ a | ||
pre-extracted entity for a linguistic modality. | ||
|
||
## Available components | ||
|
||
<!-- --8<-- [start:components] --> | ||
|
||
| Pipeline | Description | | ||
|----------------------------------------------------------------|--------------------------------------| | ||
| [`eds.negation`](/pipelines/qualifiers/negation) | Rule-based negation detection | | ||
| [`eds.family`](/pipelines/qualifiers/family) | Rule-based family context detection | | ||
| [`eds.hypothesis`](/pipelines/qualifiers/hypothesis) | Rule-based speculation detection | | ||
| [`eds.reported_speech`](/pipelines/qualifiers/reported-speech) | Rule-based reported speech detection | | ||
| [`eds.history`](/pipelines/qualifiers/history) | Rule-based medical history detection | | ||
|
||
<!-- --8<-- [end:components] --> | ||
|
||
## Rationale | ||
|
||
In a typical medical NLP pipeline, a group of clinicians would define a list of synonyms for a given concept of interest (say, for example, diabetes), and look for that terminology in a corpus of documents. | ||
|
||
Now, consider the following example: | ||
|
||
=== "French" | ||
|
||
``` | ||
Le patient n'est pas diabétique. | ||
Le patient est peut-être diabétique. | ||
Le père du patient est diabétique. | ||
``` | ||
|
||
=== "English" | ||
|
||
``` | ||
The patient is not diabetic. | ||
The patient could be diabetic. | ||
The patient's father is diabetic. | ||
``` | ||
|
||
There is an obvious problem: none of these examples should lead us to include this particular patient into the cohort. | ||
|
||
!!! warning | ||
|
||
We show an English example just to explain the issue. | ||
EDS-NLP remains a **French-language** medical NLP library. | ||
|
||
To curb this issue, EDS-NLP proposes rule-based pipelines that qualify entities to help the user make an informed decision about which patient should be included in a real-world data cohort. | ||
|
||
## Which spans are qualified ? {: #edsnlp.pipelines.base.SpanGetterArg } | ||
|
||
A component get entities from a document by looking up `doc.ents` or `doc.spans[group]`. This behavior is set by the `span_getter` argument in components that support it. | ||
|
||
::: edsnlp.pipelines.base.SpanGetterArg | ||
options: | ||
heading_level: 2 | ||
show_bases: false | ||
show_source: false | ||
only_class_level: true | ||
|
||
## Under the hood | ||
|
||
Our _qualifier_ pipelines all follow the same basic pattern: | ||
|
||
1. The pipeline extracts cues. We define three (possibly overlapping) kinds : | ||
|
||
- `preceding`, ie cues that _precede_ modulated entities ; | ||
- `following`, ie cues that _follow_ modulated entities ; | ||
- in some cases, `verbs`, ie verbs that convey a modulation (treated as preceding cues). | ||
|
||
2. The pipeline splits the text between sentences and propositions, using annotations from a sentencizer pipeline and `termination` patterns, which define syntagma/proposition terminations. | ||
|
||
3. For each pre-extracted entity, the pipeline checks whether there is a cue between the start of the syntagma and the start of the entity, or a following cue between the end of the entity and the end of the proposition. | ||
|
||
Albeit simple, this algorithm can achieve very good performance depending on the modality. For instance, our `eds.negation` pipeline reaches 88% F1-score on our dataset. | ||
|
||
!!! note "Dealing with pseudo-cues" | ||
|
||
The pipeline can also detect **pseudo-cues**, ie phrases that contain cues but **that are not cues themselves**. For instance: `sans doute`/`without doubt` contains `sans/without`, but does not convey negation. | ||
|
||
Detecting pseudo-cues lets the pipeline filter out any cue that overlaps with a pseudo-cue. | ||
|
||
!!! warning "Sentence boundaries are required" | ||
|
||
The rule-based algorithm detects cues, and propagate their modulation on the rest of the [syntagma](https://en.wikipedia.org/wiki/Syntagma_(linguistics)){target=_blank}. For that reason, a qualifier pipeline needs a sentencizer component to be defined, and will fail otherwise. | ||
|
||
You may use EDS-NLP's: | ||
|
||
```{ .python .no-check } | ||
nlp.add_pipe("eds.sentences") | ||
``` | ||
|
||
## Persisting the results | ||
|
||
Our qualifier pipelines write their results to a custom [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes){target=_blank}, defined on both `Span` and `Token` objects. We follow the convention of naming said attribute after the pipeline itself, eg `Span._.negation` for the`eds.negation` pipeline. | ||
|
||
We also provide a string representation of the result, computed on the fly by declaring a getter that reads the boolean result of the pipeline. Following spaCy convention, we give this attribute the same name, followed by a `_`. |
Oops, something went wrong.