In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "../..")

In [None]:
from rehoused import build_nlp, visualize_doc_classification
from medspacy.visualization import visualize_ent, visualize_dep
from rehoused import calculate_rehoused

In [None]:
%%capture
nlp = build_nlp()

# Appendix. Customizing ReHouSED NLP
Like any clinical NLP system, the performance of this model will vary greatly based on your data and specific task. The system implemented in this package is an approximation of what was used in the manuscript, but was modified to be more general and remove any specific references to VA documentation practices. If you apply this to a new dataset, you will need to modify the system based on your EHR, the language used in clinical documents, and changed definitions.

## Resources files
The majority of logic for the system is contained in the `resources` directory: `rehoused/resources/*`. Each of these files will contain rules corresponding to one of the components described in the notebook. They will mostly be `.py` files, although many rules can also be stored as `.json`. files (the exception being `postprocessing` rules and rules which use more advanced callback functions). The subfolder `target_rules` will each contain rules for different entity classes.
```
- rehoused/
    - resources/
        - target_rules/
            - doubling_up.py
            - evidence_of_homelessness.py
            - ...
        - callbacks.py
        - concept_tag_rules.py
        - context_rules.py
        - postprocess_rules.py
        - preprocess_rules.py
        - section_rules.py
    - ...
```

We didn't discuss `preprocess_rules` or `callbacks` in these notebooks, but the medspaCy repo contains examples and documentation.

## Loading rules
The helper function `rehoused_nlp.utils.build_nlp()` handles instantiating the NLP pipeline and adding rules, but you can always manually load a model and add rules yourselves (again, see medspaCy for more examples).

## Adding rules programatically
The best way to customize rules is to edit or create resource files like the ones listed above. But you can also add them directly to pipeline components. Each of the examples below will show how to add a rule to one of the components we discussed.

### TargetMatcher

In [None]:
from medspacy.target_matcher import TargetRule

target_matcher = nlp.get_pipe("target_matcher")
# Add a phrase for a specific homelessness shelter
rule = TargetRule("SLC Downtown Shelter for the Homeless", "TEMPORARY_HOUSING")
target_matcher.add([rule])

visualize_ent(nlp("He is staying at SLC Downtown Shelter for the Homeless."))

### ConText

In [None]:
from medspacy.context import ConTextRule

context = nlp.get_pipe("context")
# Add a phrase for matching dates and considering them historical
rule = ConTextRule("in Xxx 20xx", "HISTORICAL", direction="BIDIRECTIONAL",
                  pattern=[
                      {"LOWER": "in"},
                      {"OP": "?"},
                      {"LOWER": {"REGEX": r"20[01]\d$"}}
                  ]
                  )
context.add([rule])

visualize_dep(nlp("He was homeless in September 2016."))

### Section detection

In [None]:
from medspacy.section_detection import SectionRule

sectionizer = nlp.get_pipe("sectionizer")
# Add a specific note header
rule = SectionRule("Previous medical information:", "past_medical_history")
sectionizer.add([rule])

visualize_ent(nlp("Previous medical information: Homelessness"))

### Postprocessing

In [None]:
from medspacy.postprocess import PostprocessingRule

postprocessor = nlp.get_pipe("postprocessor")
# Add a rule to consider mentions in the 
# rule = SectionRule("Previous medical information:", "past_medical_history")
# sectionizer.add([rule])

# visualize_ent(nlp("Previous medical information: Homelessness"))

In [None]:
rehoused.

In [None]:
doc = nlp("Discharge instructions: Learn more about resources for stable housing.")

In [None]:
visualize_doc_classification(doc)

In [None]:
visualize_ent(nlp("History of present illness: Met today at the clinic. Patient is a 30-year-old gentelman who has experienced homelessness."))

In [None]:
rule = PostprocessingRule(
    patterns=[]
)

In [None]:
df = pd.read_csv("./example_rehoused_longitudinal.tsv", sep="\t")

In [None]:
df

First, we'll process all of these documents with our NLP model:

In [None]:
docs = list(nlp.pipe(df["text"]))

In [None]:
df["doc"] = docs

In [None]:
df["document_classification"] = [doc._.document_classification for doc in docs]

Let's look at how each of these documents were processed. Notice that some examples may not exactly be correct, which could introduce some noise to our classification:

In [None]:
for i, row in df.iterrows():
    print("Time:", row["time_to_index"])
    visualize_doc_classification(row["doc"])
    print("----"*5)
    print()

The helper function `calculate_rehoused` will group the DataFrame into 30-day windows (or whatever time window is specified) and calculate the ReHouSED score at each time point. Because this is a simple step, you could also easily do this manually.

The resulting DataFrame contains a column specifying the time window, how many documents were classfied as **"STABLY_HOUSED"**, how many were classified as **"UNSTABLY_HOUSED"**, and the ReHouSED score calculated as the proportion. Note that although the input had 4 documents per time window, some documents were classified as **"UNKNOWN"** and dropped from the output.

In [None]:
rh = calculate_rehoused(df, window_size=30, patient_col="pt_id")

In [None]:
rh.head()

The ReHouSED score can now be used to show a longitudinal representation of a patient's housing stability. If we plot the ReHouSED score over time, we can see the patient's change from homelessness to stable housing. 

In [None]:
plt.step(rh["time_window"], rh["rehoused"], marker="o", where="post")

plt.xticks([0, 1, 2, 3])
plt.xlabel("Time step")
plt.ylabel("ReHouSED")
plt.title("Patient ReHouSED score over 90 days");

This aggregation method allows for variation between documents. A patient with a large number of documents will likely have a more consistent and correct ReHouSED score. And calculating the ReHouSED score for a large number of patients may be used to do population-level analysis. Future work will look more at how to interpret this score and use it for population-level analysis.