In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "..")

In [4]:
from rehoused_nlp import build_nlp, visualize_doc_classification
from medspacy.visualization import visualize_ent, visualize_dep

from helpers import ENT_COLORS

In [5]:
import warnings
warnings.filterwarnings("ignore")

In [6]:
%%capture
nlp = build_nlp()

# Appendix. Customizing ReHouSED NLP
Like any clinical NLP system, the performance of this model will vary greatly based on your data and specific task. The system implemented in this package is an approximation of what was used in the manuscript, but was modified to be more general and remove any specific references to VA documentation practices. If you apply this to a new dataset, you will need to modify the system based on your EHR, the language used in clinical documents, and changed definitions.

## Adding rules using a config file

### Resources files
The majority of logic for the system is contained in the `resources` directory: `rehoused/resources/*`. Each of these files will contain rules corresponding to one of the components described in the notebook. They will mostly be `.json` files and I've attempted to organize them by class.


```
- rehoused_nlp/
    - resources/
        - target_rules/
            - doubling_up.json
            - homeless.json
            - ...
        - callbacks.py
        - concept_tag_rules/
        - configs/
        - context_rules/
        - postprocess_rules.py
        - preprocess_rules.py
        - section_rules/
    - ...
```

We didn't discuss `preprocess_rules` or `callbacks` in these notebooks, but the medspaCy repo contains examples and documentation.

### Config files
The `config` directory contains a JSON file that specifies which rules should be loaded into ReHouSED. This allows the user to add new rules or disable existing rules by specifying a different config file. We'll show an example of how to do this.

First, let's look at what one of the resources files looks like:

In [16]:
import json

fp = "../rehoused_nlp/resources/target_matcher/homeless.json"

with open(fp) as f:
    json_str = json.dumps(json.load(f), indent=4)

print(json_str[:1000])

{
    "target_rules": [
        {
            "category": "EVIDENCE_OF_HOMELESSNESS",
            "literal": "(SCT 266935003)",
            "attributes": {
                "is_historical": true
            }
        },
        {
            "category": "EVIDENCE_OF_HOMELESSNESS",
            "literal": "<RESIDES> ... <HOMELESS_LOCATION>",
            "pattern": [
                {
                    "_": {
                        "concept_tag": "HOMELESS_LOCATION"
                    },
                    "OP": "+"
                }
            ]
        },
        {
            "category": "EVIDENCE_OF_HOMELESSNESS",
            "literal": "admitted from <HOMELESS_LOCATION>",
            "pattern": [
                {
                    "LOWER": {
                        "REGEX": "^admit"
                    }
                },
                {
                    "LOWER": "from"
                },
                {
                    "OP": "?"
                },
               

And what the config file looks like:

In [17]:
fp = "../rehoused_nlp/resources/configs/rehoused_v1_config.json"

with open(fp) as f:
    json_str = json.dumps(json.load(f), indent=4)

print(json_str)

{
    "resources": [
        {
            "concept_tagger": [
                "concept_tagger/employment.json",
                "concept_tagger/family.json",
                "concept_tagger/homeless.json",
                "concept_tagger/sheltered_homeless.json",
                "concept_tagger/unsheltered_homeless.json",
                "concept_tagger/patient.json",
                "concept_tagger/resides.json",
                "concept_tagger/temporary_housing.json",
                "concept_tagger/va_service.json"
            ],
            "context": [
                "context/at_risk.json",
                "context/current.json",
                "context/family.json",
                "context/historical.json",
                "context/housing_related.json",
                "context/hypothetical.json",
                "context/negated_existence.json",
                "context/other.json",
                "context/on_match_rules.json"
            ],
            "sectionizer": [
  

The config file contains a mapping from each medspacy pipeline component to the JSON file containing rules. The exception is the Postprocessor and Preprocessor; those rules are defined in Python and added in the `build_nlp` function directly.

Let's create a new config file and load that with the NLP. We need to specify the folder containing the resources and the path of the config file. I've created some simple rules files here in the `notebooks/` directory:

In [32]:
fp = "./example_custom_resources/example_config.json"

with open(fp) as f:
    json_str = json.dumps(json.load(f), indent=4)

print(json_str)

{
    "resources": [
        {
            "context": [
                "example_context_rules.json"
            ],
            "sectionizer": [
                "example_section_rules.json"
            ],
            "target_matcher": [
                "example_target_rules.json"
            ]
        }
    ]
}


In [33]:
nlp2 = build_nlp(resources_dir="./example_custom_resources/", cfg_file="./example_custom_resources/example_config.json")

Now when we look at each of our components, we only have the rules that we've specified in `example_config.json`.

In [34]:
nlp2.get_pipe("medspacy_target_matcher").rules

[TargetRule(literal="SLC Downtown Shelter for the Homeless", category="TEMPORARY_HOUSING", pattern=None, attributes=None, on_match=None)]

In [35]:
nlp2.get_pipe("medspacy_context").rules

[ConTextRule(literal='in Xxx 20xx', category='HISTORICAL', pattern=[{'LOWER': 'in'}, {'OP': '?'}, {'LOWER': {'REGEX': '20[01]\\d$'}}], direction='BIDIRECTIONAL')]

In [37]:
nlp2.get_pipe("medspacy_sectionizer").rules

[SectionRule(literal="Previous medical information:", category="past_medical_history", pattern=None, on_match=None, parents=None, parent_required=False)]

## Adding rules programatically
The best way to customize rules is to edit or create resource files like the ones listed above. But you can also add them directly to pipeline components, and some components like the `Postprocessor` require this. Each of the examples below will show how to add a rule to the main NLP system that also contains the original rules.

### TargetMatcher

In [39]:
from medspacy.target_matcher import TargetRule

target_matcher = nlp.get_pipe("medspacy_target_matcher")
# Add a phrase for a specific homelessness shelter
rule = TargetRule("SLC Downtown Shelter for the Homeless", "TEMPORARY_HOUSING")
target_matcher.add([rule])

visualize_ent(nlp("He is staying at SLC Downtown Shelter for the Homeless."), colors=ENT_COLORS)

### ConText

In [40]:
from medspacy.context import ConTextRule

context = nlp.get_pipe("medspacy_context")
# Add a phrase for matching dates and considering them historical
rule = ConTextRule("in Xxx 20xx", "HISTORICAL", direction="BIDIRECTIONAL",
                  pattern=[
                      {"LOWER": "in"},
                      {"OP": "?"},
                      {"LOWER": {"REGEX": r"20[01]\d$"}}
                  ]
                  )
context.add([rule])

visualize_dep(nlp("He was homeless in September 2016."))

### Section detection

In [41]:
from medspacy.section_detection import SectionRule

sectionizer = nlp.get_pipe("medspacy_sectionizer")
# Add a specific note header
rule = SectionRule("Previous medical information:", "past_medical_history")
sectionizer.add([rule])

visualize_ent(nlp("Previous medical information: Homelessness"), colors=ENT_COLORS)