refacto: unify span_getter & span_setter parameters #203

Co-Authored-By: Perceval Wajsbürt <perceval.wajsburt@aphp.fr> Co-Authored-By: Thomas Petit-Jean <thomas.petitjean@aphp.fr>
aphp · Sep 13, 2023 · ee2ce71 · ee2ce71
1 parent 9e64f77
commit ee2ce71
Show file tree

Hide file tree

Showing 3 changed files with 396 additions and 7 deletions.
diff --git a/docs/pipelines/ner/overview.md b/docs/pipelines/ner/overview.md
@@ -0,0 +1,56 @@
+# Named Entity Recognition Components
+
+We provide several Named Entity Recognition (NER) components.
+Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories.
+In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, measurements, etc.
+
+## Span setters: where are stored extracted entities ? {: #edsnlp.pipelines.base.SpanSetterArg }
+
+A component assigns entities to a document by adding them to the `doc.ents` or `doc.spans[group]` attributes. `doc.ents` only supports non overlapping
+entities, therefore, if two entities overlap, the longest one will be kept. `doc.spans[group]` on the other hand, can contain overlapping entities.
+To control where entities are added, you can use the `span_setter` argument in any of these component.
+
+::: edsnlp.pipelines.base.SpanSetterArg
+    options:
+        heading_level: 2
+        show_bases: false
+        show_source: false
+        only_class_level: true
+
+## Available components
+
+<!-- --8<-- [start:components] -->
+
+| Component                                                                                 | Description                           |
+|-------------------------------------------------------------------------------------------|---------------------------------------|
+| [`eds.covid`](/pipelines/ner/covid)                                                       | A COVID mentions detector             |
+| [`eds.charlson`](/pipelines/ner/scores/charlson)                                          | A Charlson score extractor            |
+| [`eds.sofa`](/pipelines/ner/scores/sofa)                                                  | A SOFA score extractor                |
+| [`eds.elston_ellis`](/pipelines/ner/scores/elston-ellis)                                  | An Elston & Ellis code extractor      |
+| [`eds.emergency_priority`](/pipelines/ner/scores/emergency-priority)                      | A priority score extractor            |
+| [`eds.emergency_ccmu`](/pipelines/ner/scores/emergency-ccmu)                              | A CCMU score extractor                |
+| [`eds.emergency_gemsa`](/pipelines/ner/scores/emergency-gemsa)                            | A GEMSA score extractor               |
+| [`eds.tnm`](/pipelines/ner/tnm)                                                           | A TNM score extractor                 |
+| [`eds.adicap`](/pipelines/ner/adicap)                                                     | A ADICAP codes extractor              |
+| [`eds.drugs`](/pipelines/ner/drugs)                                                       | A drug mentions extractor             |
+| [`eds.cim10`](/pipelines/ner/cim10)                                                       | A CIM10 terminology matcher           |
+| [`eds.umls`](/pipelines/ner/umls)                                                         | An UMLS terminology matcher           |
+| [`eds.ckd`](/pipelines/ner/disorders/ckd)                                                 | CKD extractor                         |
+| [`eds.copd`](/pipelines/ner/disorders/copd)                                               | COPD extractor                        |
+| [`eds.cerebrovascular_accident`](/pipelines/ner/disorders/cerebrovascular-accident)       | Cerebrovascular accident extractor    |
+| [`eds.congestive_heart_failure`](/pipelines/ner/disorders/congestive-heart-failure)       | Congestive heart failure extractor    |
+| [`eds.connective_tissue_disease`](/pipelines/ner/disorders/connective-tissue-disease)     | Connective tissue disease extractor   |
+| [`eds.dementia`](/pipelines/ner/disorders/dementia)                                       | Dementia extractor                    |
+| [`eds.diabetes`](/pipelines/ner/disorders/diabetes)                                       | Diabetes extractor                    |
+| [`eds.hemiplegia`](/pipelines/ner/disorders/hemiplegia)                                   | Hemiplegia extractor                  |
+| [`eds.leukemia`](/pipelines/ner/disorders/leukemia)                                       | Leukemia extractor                    |
+| [`eds.liver_disease`](/pipelines/ner/disorders/liver-disease)                             | Liver disease extractor               |
+| [`eds.lymphoma`](/pipelines/ner/disorders/lymphoma)                                       | Lymphoma extractor                    |
+| [`eds.myocardial_infarction`](/pipelines/ner/disorders/myocardial-infarction)             | Myocardial infarction extractor       |
+| [`eds.peptic_ulcer_disease`](/pipelines/ner/disorders/peptic-ulcer-disease)               | Peptic ulcer disease extractor        |
+| [`eds.peripheral_vascular_disease`](/pipelines/ner/disorders/peripheral-vascular-disease) | Peripheral vascular disease extractor |
+| [`eds.solid_tumor`](/pipelines/ner/disorders/solid-tumor)                                 | Solid tumor extractor                 |
+| [`eds.alcohol`](/pipelines/ner/behaviors/alcohol)                                         | Alcohol consumption extractor         |
+| [`eds.tobacco`](/pipelines/ner/behaviors/tobacco)                                         | Tobacco consumption extractor         |
+
+<!-- --8<-- [end:components] -->
diff --git a/docs/pipelines/qualifiers/overview.md b/docs/pipelines/qualifiers/overview.md
@@ -0,0 +1,98 @@
+# Qualifier Overview
+
+In EDS-NLP, we call _qualifiers_ the suite of components designed to _qualify_ a
+pre-extracted entity for a linguistic modality.
+
+## Available components
+
+<!-- --8<-- [start:components] -->
+
+| Pipeline                                                       | Description                          |
+|----------------------------------------------------------------|--------------------------------------|
+| [`eds.negation`](/pipelines/qualifiers/negation)               | Rule-based negation detection        |
+| [`eds.family`](/pipelines/qualifiers/family)                   | Rule-based family context detection  |
+| [`eds.hypothesis`](/pipelines/qualifiers/hypothesis)           | Rule-based speculation detection     |
+| [`eds.reported_speech`](/pipelines/qualifiers/reported-speech) | Rule-based reported speech detection |
+| [`eds.history`](/pipelines/qualifiers/history)                 | Rule-based medical history detection |
+
+<!-- --8<-- [end:components] -->
+
+## Rationale
+
+In a typical medical NLP pipeline, a group of clinicians would define a list of synonyms for a given concept of interest (say, for example, diabetes), and look for that terminology in a corpus of documents.
+
+Now, consider the following example:
+
+=== "French"
+
+    ```
+    Le patient n'est pas diabétique.
+    Le patient est peut-être diabétique.
+    Le père du patient est diabétique.
+    ```
+
+=== "English"
+
+    ```
+    The patient is not diabetic.
+    The patient could be diabetic.
+    The patient's father is diabetic.
+    ```
+
+There is an obvious problem: none of these examples should lead us to include this particular patient into the cohort.
+
+!!! warning
+
+    We show an English example just to explain the issue.
+    EDS-NLP remains a **French-language** medical NLP library.
+
+To curb this issue, EDS-NLP proposes rule-based pipelines that qualify entities to help the user make an informed decision about which patient should be included in a real-world data cohort.
+
+## Which spans are qualified ? {: #edsnlp.pipelines.base.SpanGetterArg }
+
+A component get entities from a document by looking up `doc.ents` or `doc.spans[group]`. This behavior is set by the `span_getter` argument in components that support it.
+
+::: edsnlp.pipelines.base.SpanGetterArg
+    options:
+        heading_level: 2
+        show_bases: false
+        show_source: false
+        only_class_level: true
+
+## Under the hood
+
+Our _qualifier_ pipelines all follow the same basic pattern:
+
+1.  The pipeline extracts cues. We define three (possibly overlapping) kinds :
+
+    - `preceding`, ie cues that _precede_ modulated entities ;
+    - `following`, ie cues that _follow_ modulated entities ;
+    - in some cases, `verbs`, ie verbs that convey a modulation (treated as preceding cues).
+
+2.  The pipeline splits the text between sentences and propositions, using annotations from a sentencizer pipeline and `termination` patterns, which define syntagma/proposition terminations.
+
+3.  For each pre-extracted entity, the pipeline checks whether there is a cue between the start of the syntagma and the start of the entity, or a following cue between the end of the entity and the end of the proposition.
+
+Albeit simple, this algorithm can achieve very good performance depending on the modality. For instance, our `eds.negation` pipeline reaches 88% F1-score on our dataset.
+
+!!! note "Dealing with pseudo-cues"
+
+    The pipeline can also detect **pseudo-cues**, ie phrases that contain cues but **that are not cues themselves**. For instance: `sans doute`/`without doubt` contains `sans/without`, but does not convey negation.
+
+    Detecting pseudo-cues lets the pipeline filter out any cue that overlaps with a pseudo-cue.
+
+!!! warning "Sentence boundaries are required"
+
+    The rule-based algorithm detects cues, and propagate their modulation on the rest of the [syntagma](https://en.wikipedia.org/wiki/Syntagma_(linguistics)){target=_blank}. For that reason, a qualifier pipeline needs a sentencizer component to be defined, and will fail otherwise.
+
+    You may use EDS-NLP's:
+
+    ```{ .python .no-check }
+    nlp.add_pipe("eds.sentences")
+    ```
+
+## Persisting the results
+
+Our qualifier pipelines write their results to a custom [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes){target=_blank}, defined on both `Span` and `Token` objects. We follow the convention of naming said attribute after the pipeline itself, eg `Span._.negation` for the`eds.negation` pipeline.
+
+We also provide a string representation of the result, computed on the fly by declaring a getter that reads the boolean result of the pipeline. Following spaCy convention, we give this attribute the same name, followed by a `_`.