In [None]:
import spacy
from IPython.display import Image

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
sns.set()

# Homework: Clinical Information Extraction
Over the last two weeks, you've been introduced to a number of tools for extracting information from clinical text:
- A rule-based matcher using the `EntityRuler` class
- A pre-trained statistical model for extracting **"PROBLEM"**, **"TREATMENT"**, and **"TEST"** entities
- `ConText` for extracting contextual information such as negation, uncertainty, and family history

For your homework assignment, we'll put it all together, improve our model, and deploy it on MIMIC data. Here is an outline of this assignment:

- Build an NLP model which includes the `EntityRuler`, statistical `ner`, and `ConText`
- Load a sample of discharge summaries from MIMIC
- Review the output of your NLP model on a small number of datasets and make imnprovements by adding patterns or ConTextItems
- Deploy your NLP model on the entire dataset and convert it to structured data
- Analyze the classes and spans of text extracted by your model

As usual, let me know on Slack or Canvas if you have any questions or issues. Let's get started!

# I. Build your model
We'll create a new model by loading the various pieces which we have.

In [None]:
from spacy.pipeline import EntityRuler

In [None]:
from cycontext import ConTextComponent, ConTextItem
from cycontext.viz import visualize_dep, visualize_ent

### TODO
Load a clinical `nlp` model using spacy.

In [None]:
### Your code here


### TODO
- Create a rule-based matcher using spaCy. For the rule-based, component, we'll add a few extra argruments:
    - **phrase_matcher_attr="LOWER"**: This will make the string matching case insensitive
    - **overwrite_ents=False**: This will add new entities to the additional entities extracted by the `ner` component rather than overwriting them
- Next add the empty list of patterns to your rule-based component
- Add the rule-based component to your nlp pipeline

In [None]:
ruler = ____(____, phrase_matcher_attr="LOWER", overwrite_ents=False)

In [None]:
nlp.____(____)

### TODO
Load `ConText` and add it to your pipeline. You can either load context with the predefined rules by setting `rules="default"` **or** start with a completely blank one if you want to build all of the rules yourself by setting` rules=None`.

In [None]:
%%capture
context = ____(____, rules=____)

In [None]:
nlp.____(____)

# II. Get Discharge Summaries MIMIC Data
A **discharge summary** is written at the end of a patient's stay in the hospital. It typically contains a summary of the patient, the diagnoses for which they were admitted, and the treatment that they received during their stay. The rich content of these documents makes them an excellent candidate for processing with NLP.

Clinical documents are stored in MIMIC in the table `noteevents`. We will query a number of random notes from this table and limit them to discharge summaries through the **"category"** column.

For some reason the UVU VPN has not been connecting to the MIMIC database, so if the connection times out, a set of 100 randomly selected documents has been saved to the **data/** folder. If you don't want to try and connect to the VPN, you can comment out the `ValueError()` line and it will automatically go to the cached data.

In [None]:
import pandas as pd
import pymysql
import getpass

In [None]:
# First, try to connect to MIMIC
# If 
try:
    # raise ValueError("Skipping MIMIC and going straight to the data on disk")
    conn = pymysql.connect(host="35.233.174.193",port=3306,
                           user="jovyan",passwd=getpass.getpass("Enter password for MIMIC2 database"),
                           db='mimic2')
    query = """

        SELECT subject_id, text
        FROM noteevents
        WHERE category = 'DISCHARGE_SUMMARY'
        ORDER BY RAND()
        LIMIT 100

        """
    df = pd.read_sql(query, conn)
except:
    print("Failed connection to MIMIC. Reading cached data instead.")
    df = pd.read_json("./data/100_mimic_discharge_summaries.json")

In [None]:
df.head()

In [None]:
len(df)

# 3. Process your texts and review the output
Next, we'll process the discharge summaries and review what our system extracts. Before doing any additional analysis on our dataset, we'll make sure that our system is performing well and see what changes we can make to improve it. You would typically do this by splitting the dataset into **training** and **testing** sets, but for now we'll just combine them into one.

## Review output
### TODO
Go through at least the first 10 or so discharge summaries. Process each individually and review the output. In order to improve the system, look for mistakes in the nLP output. These mistakes can either be:
- **False negatives**: Missing a target entity. This will happen when you see a clinical problem, treatment or test in the text that is not highlighted. You can fix this by **adding patterns** to the `ruler`
- **False positives**: Spans of text which are highlighted but should not be. These are harder to fix. You could write rules to remove an entity from `doc.ents`, but this is a little tricky and difficult to generalize
- **Missing modifiers**: ConText modifiers, such as **"NEGATED_EXISTENCE"** will be highlighted in the text as well. If you see one that is missing, add it to ConText by creating a new `ConTextItem`. You can also visualize what targets the modifiers are applied to by using the `visualize_dep` function.
    - **A note about `visualize_dep`**: This function works best on a *single* sentence rather than an entire doc. So instead of calling `visualize_dep(doc)`, manually add some text, process it with the nlp, and then view the output by calling:  `visualize_dep(nlp("..."))`


Remember, **NLP will never be perfect!** So it's expected to have mistakes, and some mistakes will seem weird and confusing. As NLP developers, its our role to identify where improvements can be made and to decide how much error is acceptable.

### Come back here once you've added patterns or item_data!

In [None]:
# Process text
text = df.iloc[0]["text"] # Change this number to go through all of the output
doc = nlp(text)

In [None]:
# Raw text output
print(text)

In [None]:
# Highlighted entities and modifiers
visualize_ent(doc)

In [None]:
# ConText target-modifier graph for short sentences
visualize_dep(nlp("There is no prior history of such an episode."))

## Make your changes here
Below are an empty list of `patterns` and `ConTextItems` which you can add to `ruler` and `context` components, respectively. After running the cells to add the patterns or rules, go back up to the cell where you call `doc = nlp(text)` to reprocess the text with your updated model.

In [None]:
patterns = [
    
]

In [None]:
ruler.add_patterns(patterns)

In [None]:
context_item_data = [
    
]

In [None]:
context.add(context_item_data)

### Now go back, reprocess the doc, and see if your changes worked!

# 4. Deploy your model and convert text to structured data
Now that you've fine-tuned and improved your model, we're ready to run it on the entire dataset and analyze it! In this step, we'll show how you can use NLP to convert text to **structured** data, which you can then analyze in the same way that we previously analyzed structured EHR data like **labs** and **vitals**. We'll now extract all of the entities from our docs and convert them into a pandas DataFrame.

Start by creating a list called `docs` which contains all the `doc` objects created by our model. We can do this by calling `nlp.pipe()` on the column of the DataFrame containing the text notes and then converting it to a list. This might take a minute or two. We'll measure how long it takes by using the `%%time` magic function.

In [None]:
%%time
docs = list(nlp.pipe(df["text"]))

## Convert entities to dictionaries
Next, we want to convert the **entities** in our docs to **dictionaries**, which can later be turned into a DataFrame. Here are the attributes we want to save for each ent:

- `subject_id`: The patient identifier so that we can do patient-level analysis. This is stored in the **"subject_id"** column of `df`
- `text`: The text which is included in span. To normalize all of these phrases, we'll lowercase it by accessing the `ent.lower_` attribute
- `sent`: The text of the sentence containing the entity. This will be helpful later if we want to look at the context of an entity. This is available as a string in the attribute `ent.sent.text`
- `label`: The label assigned by our NER model or entity ruler. You can access this through the `ent.label_` attribute
- `is_negated`, `is_historical`, `is_uncertain`, `is_family`, and `is_hypothetical`: Each of the attributes extracted by cycontext. We will use this later to analyze what conditions occur in a patient's family history or by excluding conditions which were never experienced

### TODO
Write a function called `ent_to_dict` which takes two arguments: `subject_id` and `ent` and returns a dictionary with all of the attributes described above. Then, run this function on a short example and make sure that the output looks correct.

In [None]:
def ____(subject_id, ent):
    d = {}
    d["subject_id"] = ____
    d["____"] = ent.____
    d[____] = ____.label_
    d[____] = ent.sent.text
    
    # ConText attributes
    d["is_negated"] = ent._.is_negated
    d[____] = ent._.is_historical
    d["is_uncertain"] = ent._.____
    d[____] = ent._.is_family
    d["is_hypothetical"] = ent._.____
    
    return d

In [None]:
# Look at an example
small_doc = nlp("The patient presents for treatment of his kidney cancer. He has a family history of diabetes.")

for ent in small_doc.ents:
    print(____("fake_patient", ent))
    print()

### TODO
Now, iterate through all of the subject_ids and docs. We can iterate through both lists at the same time using Python's `zip` function, which **"zips up"** two arrays. Process each `ent` in the `doc` with your function and append the output to `ent_dicts`. Then created a DataFrame called `ents_df` by calling `pd.DataFrame(ent_dicts)`. 

In [None]:
ent_dicts = []
for subject_id, doc in zip(df["subject_id"], docs):
    for ____ in doc.____:
        ent_dict = ____(subject_id, ent)
        ent_dicts.append(ent_dict)

In [None]:
ents_df = pd.____(____)

Look at the first 5 rows of the DataFrame. The columns should look like this screenshot below:

In [None]:
Image("./images/ents_df_columns.png")

In [None]:
ents_df.head()

# 5. Analysis
Now, we can analyze our extracted dataset using pandas and matplotlib. Go through each of the sections below and follow the instructions to analyze the text.

## I. Label distribution
Let's see how any **problems**, **treatments**, and **tests** are extracted. Plot the count of entity labels in the dataset. Generate a bar graph.

## II. Treatment texts
Let's see what treatments are being used in these patient visits.

### TODO
- Using boolean indexing, create a DataFrame called `treatments` which contains only **TREATMENT** entities
- Then identify the 10 most common texts in that DataFrame by calling `treatments["text"].value_counts().head(10)`. 
    - This is similar to `treatments.groupby("text").size()`, but it will sort it and select the 10 most frequent. Save the output of this as `common_treatment_texts`. 
- Then, plot a horizontal bar graph of `common_treatment_texts`. (Horizontal because that will make the labels easier to read)

In [None]:
treatments = ents_df[ents_df[____] == ____]

In [None]:
common_treatment_texts = treatments["text"].____().____(10)
common_treatment_texts

In [None]:
# Horizontal bar graph
____.plot.barh()

## III. Problems relevant to a visit
As we saw in the previous notebook, many of the conditions mentioned in a document were not actually experienced by a patient during the hospital stay. That is why we ran **context** to generate the attributes such as **is_negated**. Let's now look at all problems in the dataset which are **relevant** to the dataset, meaning that all of the context attributes are `False` (ie., the problem is **not** historical, **not** negated, etc.)

### TODO
- Using boolean indexing, creating a new DataFrame called `problems` where the **label** is **"PROBLEM"** 
- Next, filter the rows to show those where all of the ConText attributes are False. 
- Save this as a DataFrame called `relv_problems` . An outline of code for checking the ConText attributes has already been started for you

In [None]:
# Use boolean indexing to filter ents_df to only problems
problems = ____ 

In [None]:
# Create relv_problems
____ = problems[
    (problems["is_negated"] == False)
        &
    (problems["is_historical"] == ____)
        &
    (problems[____] == False)
        &
    (problems["is_family"] == ____)
        &
    (problems[____] == False)
]

In [None]:
relv_problems.head()

In [None]:
# Horizontal bar graph
____["text"].value_counts().____(10).plot.____()

## IV. Patients with a family history of cancer
In addition to **excluding** conditions which are not experienced by a patient, context can also help us target conditions which occurred in a patient's family history. While these conditions may not directly affect a patient, they are important to a patient's health because they might suffer from a heightened risk for this condition or other complications.

In cycontext, we can detect this by using the `is_family` attribute. 

**Note:** In cycontext, modifiers like **"girlfriend"** or **"husband"** are also considered **"FAMILY"**. In a real analysis of a patient's family history you would restrict the lexicon to a smaller number of modifiers which are actually family members.

Let's now find patients with family history of cancer and see what types of cancer they have.

### TODO
- Using boolean indexing, creating a new DataFrame called `problems` where the **label** is **"PROBLEM"** 
- Filter the problems to rows where:
    - `is_family` is `True`, meaning that someone other than the patient experienced this
    - `is_negated` is `False`, meaning that that person actually did experience it
    - The `text` attribute contains the word **"cancer"**. You can do this by using `problems["text"].str.contains("keyword")` in your filtering
        - This will miss out on abbreviations like **"ca"** or synonyms like **"metastasis"**, but will be good enough. If you'd like to be more extensive in your analysis, you can include these other terms
- Called this filtered DataFrame `fh_cancer`
- Once you have created a filtered dataset, look at some of the sentences in the `fh_cancer["sent"]` column and see who the experiencers are. You can use `visualize_ent(nlp(sentence))` to visualize this with your model 
- In a markdown cell at the bottom, write some of the experiencers that you've found

In [None]:
# Create a DataFrame called problems


In [None]:
fh_cancer = problems[(problems["is_family"] == ____) 
                                & 
                    (problems["is_negated"] == ____)
                                &
                    (problems[____].str.contains(____))]

In [None]:
# Phrases extracted
list(fh_cancer["text"])

In [None]:
sent = fh_cancer.iloc[0]["sent"]
visualize_ent(nlp(sent))

### Family members with cancer:
- ...