In [None]:
import medspacy
from IPython.display import Image

In [None]:
from medspacy.visualization import visualize_dep, visualize_ent
from medspacy.context import ConTextItem
from medspacy.ner import TargetRule

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
sns.set()

# Homework: Clinical Information Extraction
Over the last two weeks, you've been introduced to a number of tools for extracting information from clinical text:
- A rule-based matcher using the `TargetMatcher` class
- A pre-trained statistical `NER` model for extracting **"PROBLEM"**, **"TREATMENT"**, and **"TEST"** entities
- `ConTextComponent` for extracting contextual information such as negation, uncertainty, and family history

For your homework assignment, we'll put it all together, improve our model, and deploy it on MIMIC data. Here is an outline of this assignment:

- Build an medspaCy model which includes the `TargetMatcher`, statistical `NER`, and `ConTextComponent`
- Load a sample of discharge summaries from MIMIC
- Review the output of your NLP model on a small number of datasets and make imnprovements by adding patterns or ConTextItems
- Deploy your NLP model on the entire dataset and convert it to structured data
- Analyze the classes and spans of text extracted by your model

As usual, let me know on Slack or Canvas if you have any questions or issues. Let's get started!

# I. Build your model
We'll create a new model by loading the various pieces which we have.

### TODO
Load a clinical `nlp` model using spacy.

In [None]:
nlp = medspacy.load("en_info_3700_i2b2_2012", 
                    disable=["tagger", "parser"] # Disable the POS tagger and dependency parser to speed up performance
                   )

In [None]:
nlp.pipe_names

Here are the two components that we will customize:

In [None]:
target_matcher = nlp.get_pipe("target_matcher")

In [None]:
context = nlp.get_pipe("context")

# II. Get Discharge Summaries MIMIC Data
A **discharge summary** is written at the end of a patient's stay in the hospital. It typically contains a summary of the patient, the diagnoses for which they were admitted, and the treatment that they received during their stay. The rich content of these documents makes them an excellent candidate for processing with NLP.

Clinical documents are stored in MIMIC in the table `noteevents`. We will query a number of notes from this table and limit them to discharge summaries through the **"category"** column. 

In [None]:
import pandas as pd
import pymysql
import getpass

In [None]:
# Change to your username
username = ""

conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user=username,passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

In [None]:
query = """
SELECT subject_id, text
FROM noteevents
WHERE category = 'DISCHARGE_SUMMARY'
ORDER BY RAND()
LIMIT 100
"""
df = pd.read_sql(query, conn)

In [None]:
df.head()

In [None]:
len(df)

# 3. Process your texts and review the output
Next, we'll process the discharge summaries and review what our system extracts. Processing full notes is a computationally expensive process, so we'll start by looking at just a few texts before processing the entire batch later.

In [None]:
%%time
texts = df["text"].iloc[:5] # Small sample to start with
docs = list(nlp.pipe(texts))

In [None]:
from medspacy.visualization import visualize_ent, visualize_dep
from medspacy.visualization import MedspaCyVisualizerWidget

In [None]:
w = MedspaCyVisualizerWidget(docs)

In [None]:
# idx = 0
# visualize_ent(docs[idx])

## Optional: Improve your model
As we've seen, our default model is not going to be perfect. If you'd like to spend some time improving your model, go through a few docs above and find mistakes. Then fix them using the methods we saw in previous notebooks.

- **False negatives**: Missing a target entity. This will happen when you see a clinical problem, treatment or test in the text that is not highlighted. You can fix this by **adding patterns** to the `ruler`
- **False positives**: Spans of text which are highlighted but should not be. These are harder to fix. You could write rules to remove an entity from `doc.ents`, but this is a little tricky and difficult to generalize
- **Missing modifiers**: ConText modifiers, such as **"NEGATED_EXISTENCE"** will be highlighted in the text as well. If you see one that is missing, add it to ConText by creating a new `ConTextItem`. You can also visualize what targets the modifiers are applied to by using the `visualize_dep` function.
    - **A note about `visualize_dep`**: This function works best on a *single* sentence rather than an entire doc. So instead of calling `visualize_dep(doc)`, manually add some text, process it with the nlp, and then view the output by calling:  `visualize_dep(nlp("..."))`
    
Edit the cells below to add `TargetRules` and `ConTextItems` to fix mistakes you find in the texts.

In [None]:
from medspacy.ner import TargetRule
from medspacy.context import ConTextItem

In [None]:
target_matcher = nlp.get_pipe("target_matcher")

target_rules = [
    # TargetRule(...),
]

In [None]:
target_matcher.add(target_rules)

In [None]:
context = nlp.get_pipe("context")

context_rules = [
    # ConTextItem(...)
]

In [None]:
context.add(context_rules)

Once you've added new rules, go back to the cells at the beginning of this section, reprocess your docs, and reload your visualizer.

### Now go back, reprocess the doc, and see if your changes worked!

# 4. Deploy your model and convert text to structured data
Now that you've fine-tuned and improved your model, we're ready to run it on the entire dataset and analyze it! In this step, we'll show how you can use NLP to convert text to **structured** data, which you can then analyze in the same way that we previously analyzed structured EHR data like **labs** and **vitals**. We'll now extract all of the entities from our docs and convert them into a pandas DataFrame.

Start by creating a list called `docs` which contains all the `doc` objects created by our model. We can do this by calling `nlp.pipe()` on the column of the DataFrame containing the text notes and then converting it to a list. This might take a minute or two. We'll measure how long it takes by using the `%%time` magic function.

In [None]:
%%time
texts = df["text"] # Process all of the texts
docs = list(nlp.pipe(texts))

Now we'll add the processed `docs` to our DataFrame:

In [None]:
df["doc"] = docs

In [None]:
df.head()

## Convert to structured data
One of the primary tasks of NLP is to take **unstructured, free-text data** and convert it to **structured data** which can be analyzed using methods which we've done previously. In this next section, we'll convert all of the entities in our docs into a DataFrame. 

Below is a helper function which will take our DataFrame with one row per document and return a new DataFrame with one row per entity, along with attributes of the entities as columns. Here are the attributes we want to save for each ent:

- `subject_id`: The patient identifier so that we can do patient-level analysis. This is stored in the **"subject_id"** column of `df`
- `text`: The text which is included in span. To normalize all of these phrases, we'll lowercase it by accessing the `ent.lower_` attribute
- `sent`: The text of the sentence containing the entity. This will be helpful later if we want to look at the context of an entity. This is available as a string in the attribute `ent.sent.text`
- `label`: The label assigned by our NER model or entity ruler. You can access this through the `ent.label_` attribute
- `is_negated`, `is_historical`, `is_uncertain`, `is_family`, and `is_hypothetical`: Each of the attributes extracted by cycontext. We will use this later to analyze what conditions occur in a patient's family history or by excluding conditions which were never experienced

In [None]:
def create_ents_df(df):
    ent_dicts = []
    for i, row in df.iterrows():
        ent_dicts += process_row(row)
    return pd.DataFrame(ent_dicts)
        
def process_row(row):
    ent_dicts = []
    for ent in row["doc"].ents:
        ent_dicts.append(ent_to_dict(row["subject_id"], ent))
    return ent_dicts
        
def ent_to_dict(subject_id, ent):
    d = {}
    d["subject_id"] = subject_id
    d["text"] = ent.lower_
    d["label"] = ent.label_
    d["sent"] = ent.sent.text
    
    # ConText attributes
    d["is_negated"] = ent._.is_negated
    d["is_historical"] = ent._.is_historical
    d["is_uncertain"] = ent._.is_uncertain
    d["is_family"] = ent._.is_family
    d["is_hypothetical"] = ent._.is_hypothetical
    
    return d

Run the cell below to generate the new DataFrame:

In [None]:
ents_df = create_ents_df(df)

In [None]:
ents_df.head()

# 5. Analysis
Now, we can analyze our extracted dataset using pandas and matplotlib. Go through each of the sections below and follow the instructions to analyze the text.

## I. Label distribution
Let's see how any **problems**, **treatments**, and **tests** are extracted. Plot the count of entity labels in the dataset. Generate a bar graph.

In [None]:
ents_df.____("label").size().____.bar()

## II. Treatment texts
Let's see what treatments are being used in these patient visits.

### TODO
- Using boolean indexing, create a DataFrame called `treatments` which contains only **TREATMENT** entities
- Then identify the 10 most common texts in that DataFrame by calling `treatments["text"].value_counts().head(10)`. 
    - This is similar to `treatments.groupby("text").size()`, but it will sort it and select the 10 most frequent. Save the output of this as `common_treatment_texts`. 
- Then, plot a horizontal bar graph of `common_treatment_texts`. (Horizontal because that will make the labels easier to read)

In [None]:
treatments = ents_df[ents_df["label"] == ____]

In [None]:
common_treatment_texts = treatments["text"].value_counts().head(10)
common_treatment_texts

In [None]:
____.____.barh()

## III. Problems relevant to a visit
As we saw in the previous notebook, many of the conditions mentioned in a document were not actually experienced by a patient during the hospital stay. That is why we ran **context** to generate the attributes such as **is_negated**. Let's now look at all problems in the dataset which are **relevant** to the dataset, meaning that all of the context attributes are `False` (ie., the problem is **not** historical, **not** negated, etc.)

### TODO
- Using boolean indexing, creating a new DataFrame called `problems` where the **label** is **"PROBLEM"** 
- Next, filter the rows to show those where all of the ConText attributes are False. 
- Save this as a DataFrame called `relv_problems` .
- Plot the 10 most frequent spans of text

In [None]:
problems = ____

In [None]:
relv_problems = problems[
    (problems["is_negated"] == False)
        &
    (problems["is_historical"] == False)
        &
    (problems["is_uncertain"] == False)
        &
    (problems["is_family"] == False)
        &
    (problems["is_hypothetical"] == False)
]

In [None]:
relv_problems.head()

In [None]:
len(relv_problems)

Now plot the 10 most common spans of text:

In [None]:
____[____].value_counts().head(10).____.barh()

## IV. Patients with a family history of cancer
In addition to **excluding** conditions which are not experienced by a patient, context can also help us target conditions which occurred in a patient's family history. While these conditions may not directly affect a patient, they are important to a patient's health because they might suffer from a heightened risk for this condition or other complications.

In cycontext, we can detect this by using the `is_family` attribute. 

**Note:** In cycontext, modifiers like **"girlfriend"** or **"husband"** are also considered **"FAMILY"**. In a real analysis of a patient's family history you would restrict the lexicon to a smaller number of modifiers which are actually family members.

Let's now find patients with family history of cancer and see what types of cancer they have.

### TODO
- Using boolean indexing, creating a new DataFrame called `problems` where the **label** is **"PROBLEM"** 
- Filter the problems to rows where: 
    - `is_family` is `True`, meaning that someone other than the patient experienced this
    - The `text` attribute contains the word **"cancer"**. You can do this by using `problems["text"].str.contains("keyword")` in your filtering
- Called this filtered DataFrame `fh_cancer`
- Once you have created a filtered dataset, use the medspaCy widget to look at some of sentences containing family history of cancer.

In [None]:
problems = ____

In [None]:
fh_cancer = problems[(problems["is_family"] == ____)
                    &
                    (problems[____].str.contains("cancer"))]

In [None]:
# Phrases extracted
list(fh_cancer["text"])

In [None]:
fh_docs = list(nlp.pipe(fh_cancer["sent"]))

In [None]:
w = MedspaCyVisualizerWidget(fh_docs)

In [None]:
# idx = 0
# visualize_dep(fh_docs[idx])
# visualize_ent(fh_docs[idx])