In [None]:
import medspacy
from IPython.display import Image

In [None]:
from medspacy.visualization import visualize_dep, visualize_ent, MedspaCyVisualizerWidget
from medspacy.context import ConTextItem
from medspacy.ner import TargetRule

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
sns.set()

# Homework: Clinical Information Extraction
Over the last two weeks, you've been introduced to a number of tools for extracting information from clinical text:
- A rule-based matcher using the `TargetMatcher` class
- A pre-trained statistical `NER` model for extracting **"PROBLEM"**, **"TREATMENT"**, and **"TEST"** entities
- `ConTextComponent` for extracting contextual information such as negation, uncertainty, and family history

For your homework assignment, we'll put it all together, improve our model, and deploy it on MIMIC data. Here is an outline of this assignment:

- Build an medspaCy model which includes the `TargetMatcher`, statistical `NER`, and `ConTextComponent`
- Load a sample of discharge summaries from MIMIC
- Review the output of your NLP model on a small number of datasets and make imnprovements by adding patterns or ConTextItems
- Deploy your NLP model on the entire dataset and convert it to structured data
- Analyze the classes and spans of text extracted by your model

As usual, let me know on Slack or Canvas if you have any questions or issues. Let's get started!

# I. Build your model
We'll create a new model by loading the various pieces which we have.

### TODO
Load a clinical `nlp` model using spacy.

In [None]:
nlp = medspacy.load("en_info_3700_i2b2_2012", 
                    enable=["sentencizer", "target_matcher", "context", "sectionizer"]
                   )

Here are the two components that we will customize:

In [None]:
target_matcher = nlp.get_pipe("target_matcher")

In [None]:
context = nlp.get_pipe("context")

# II. Get Discharge Summaries MIMIC Data
A **discharge summary** is written at the end of a patient's stay in the hospital. It typically contains a summary of the patient, the diagnoses for which they were admitted, and the treatment that they received during their stay. The rich content of these documents makes them an excellent candidate for processing with NLP.

Clinical documents are stored in MIMIC in the table `noteevents`. We will query a number of notes from this table and limit them to discharge summaries through the **"category"** column. We'll just look at 100 notes for now, but if you'd like to increase the number later to get a larger sample size you can.

In [None]:
import pandas as pd
import pymysql
import getpass

In [None]:
# Change to your username
username = "uvu10919523"

conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user=username,passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

In [None]:
query = """
SELECT subject_id, text
FROM noteevents
WHERE category = 'DISCHARGE_SUMMARY'
LIMIT 100
"""
df = pd.read_sql(query, conn)

In [None]:
df.head()

# 3. Process some texts and review the output
Next, we'll process the discharge summaries and review what our system extracts. Processing full notes is a computationally expensive process, so we'll start by looking at just a few texts before processing the entire batch later.

In [None]:
%%time
texts = df["text"].iloc[:5] # Small sample to start with
docs = list(nlp.pipe(texts))

In [None]:
from medspacy.visualization import visualize_ent, visualize_dep
from medspacy.visualization import MedspaCyVisualizerWidget

In [None]:
w = MedspaCyVisualizerWidget(docs)

In [None]:
# idx = 0
# visualize_ent(docs[idx])

## Optional: Improve your model
As we've seen, our default model is not going to be perfect. If you'd like to spend some time improving your model, go through a few docs above and find mistakes. Then fix them using the methods we saw in previous notebooks.

- **False negatives**: Missing a target entity. This will happen when you see a clinical problem, treatment or test in the text that is not highlighted. You can fix this by **adding patterns** to the `ruler`
- **False positives**: Spans of text which are highlighted but should not be. These are harder to fix. You could write rules to remove an entity from `doc.ents`, but this is a little tricky and difficult to generalize
- **Missing modifiers**: ConText modifiers, such as **"NEGATED_EXISTENCE"** will be highlighted in the text as well. If you see one that is missing, add it to ConText by creating a new `ConTextItem`. You can also visualize what targets the modifiers are applied to by using the `visualize_dep` function.
    - **A note about `visualize_dep`**: This function works best on a *single* sentence rather than an entire doc. So instead of calling `visualize_dep(doc)`, manually add some text, process it with the nlp, and then view the output by calling:  `visualize_dep(nlp("..."))`
    
Edit the cells below to add `TargetRules` and `ConTextItems` to fix mistakes you find in the texts.

In [None]:
from medspacy.ner import TargetRule
from medspacy.context import ConTextItem

In [None]:
target_matcher = nlp.get_pipe("target_matcher")

target_rules = [
    # TargetRule(...),
]

In [None]:
context = nlp.get_pipe("context")

context_rules = [
    # ConTextItem(...)
]

Once you've added new rules, go back to the cells at the beginning of this section, reprocess your docs, and reload your visualizer.

### Now go back, reprocess the doc, and see if your changes worked!

# 4. Deploy your model and convert text to structured data
Now that you've fine-tuned and improved your model, we're ready to run it on the entire dataset and analyze it! In this step, we'll show how you can use NLP to convert text to **structured** data, which you can then analyze in the same way that we previously analyzed structured EHR data like **labs** and **vitals**. We'll now extract all of the entities from our docs and write them to a sqlite database.

The function below will take your DataFrame, process all of the texts with your NLP model, and write the results to a file called `"nlp.db"`. 

In [None]:
from helpers import write_nlp_db

In [None]:
# This may take up to 2-3 minutes
write_nlp_db(nlp, df)

We can now connect to this local database using `sqlite3` and treat it like any other structured data. All of our data was written to a table called `ents`. We'll first load all of the results as a pandas dataframe, and in the next section we'll write queries to answer specific questions to explore the NLP-extracted data.

In [None]:
import sqlite3
nlp_conn = sqlite3.connect("nlp.db")

In [None]:
query = """SELECT * FROM ents"""

In [None]:
ents_df = pd.read_sql(query, nlp_conn)

Take a look at the DataFrame below. What does each row correspond to? What do the various columns mean?

In [None]:
ents_df.head()

In [None]:
len(ents_df)

In [None]:
ents_df.columns

# 5. Analysis
Now, we can analyze our extracted dataset using SQL, pandas, and matplotlib, just like we did with MIMIC data in the past. As a reminder, you can run queries by passing them into `pd.read_sql` along with our connection object, which in this case is `nlp_conn`.

The table you will be querying is called `ents`.

Go through each of the sections below and analyze the data to answer the question. You can either write queries to directly get numbers (ie, `SELECT ... FROM ents`), or use the DataFrame we created above, `ents_df`, to just run the analyses in pandas.

If you need a reminder of how to use SQL/pandas, you can refer to the notebooks in [../week_6_clinical_data](../week_6_clinical_data) and [../week_7_terminologies](../week_7_terminologies).

## I. Label distribution
- Find the counts of **problems**, **treatments**, and **tests** which were extracted from our corpus 
- Plot the count of entity labels in the dataset using a bar graph

In [None]:
query = """
SELECT 
    ____
    ,____
FROM ents
GROUP BY ____
"""

In [None]:
labels = pd.read_sql(query, nlp_conn)

In [None]:
labels

In [None]:
sns.____(x="label_", "COUNT(1)", data=labels)

## II. Treatment texts
Let's see what treatments are being used in these patient visits.
- Find the 10 most common `"text"` values for **"TREATMENT"** entities
- Plot a horizontal bar graph of the texts and counts. (Horizontal because that will make the labels easier to read)
- Do any of these "treatments" look like NLP mistakes?

In [None]:
query = """

"""

In [None]:
treatments = pd.read_sql(query, nlp_conn) 
treatments.head()

In [None]:
# Plot a horizontal barplot

## III. Problems relevant to a visit
As we saw in the previous notebook, many of the conditions mentioned in a document were not actually experienced by a patient during the hospital stay. That is why we ran **context** to generate the attributes such as **is_negated**. Let's now look at all problems in the dataset which are **relevant** to the dataset, meaning that all of the context attributes are `False` (ie., the problem is **not** historical, **not** negated, etc.)

- Write a query which gets all **"PROBLEM"** entities from the database where all of the following columns are **0**:
    - `is_negated`
    - `is_historical`
    - `is_uncertain`
    - `is_family`
    - `is_hypothetical`
- Group them by **"text"** and find the 10 most common **"text"** spans
- Plot a horizontal bar plot showing the counts

In [None]:
query = """

"""

In [None]:
relv_problems = pd.read_sql(query, nlp_conn)

In [None]:
relv_problems.head()

In [None]:
# Plot a horizontal bar plot


## IV. Patient family history
In addition to **excluding** conditions which are not experienced by a patient, context can also help us target conditions which occurred in a patient's family history. While these conditions may not directly affect a patient, they are important to a patient's health because they might suffer from a heightened risk for this condition or other complications.

In medspaCy, we can detect this by using the `is_family` attribute, or by seeing that an entity occurred in the `family_history` section of a note, which is shown by the `section_category` attribute. 

Let's now find patients with family history of cancer and see what types of cancer they have.

### TODO
- Write a query to get rows where:
    - `label_` is **"PROBLEM"**
    - `is_family` = **1** **OR** section_category = 'family_history'
- Find the 10 most common text spans and plot them with a horizontal bar plot

In [None]:
query = """

"""

In [None]:
fh = pd.read_sql(query, nlp_conn)

In [None]:
fh.head()

In [None]:
# Plot a horizontal bar plot