In [None]:
import pymysql
import pandas as pd
import getpass

import re
import matplotlib.pyplot as plt

In [None]:
conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user="jovyan",passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

# Clinical Notes
One of the most information-rich sources of clinical data are free-text clinical notes. These are written narratives which discuss many topics of a patient's care, such as diagnoses, history, treatment, complications, etc... Free-text notes offer very detailed accounts of a patient's clinical course, making them very useful for many purposes. However, free text is not inherently meaningful. Unlike ICD-9 or LOINC codes, clinical language does not offer standardized representations of clinical concepts, and unlike vitals and labs, the information found in text cannot be easily quantified and represented in a computable way.

The field of **natural language processing** offers methods for dealing with text and extracting **structured information** from an **unstructured data source**. Later in the semester we'll have a module specifically devoted to NLP. Today we'll look at a couple of text processing methods which give us some insight into the information contained in text.

## Note types
There are many different types of notes in the clinical domain. Each note contains different information, often specific to clinical specialties like cardiology or surgery. One such specialty is radiology. MIMIC-II contains a large number of **radiology reports** which contain a radiologist's interpretation of an image. For example, if a physician suspects a patient has pneumonia, they might order a CT scan. The radiologist will view the image from the scan and determine whether or not has a patient has pneumonia.

## Querying notes
Clinical notes are contained in the table `noteevents`. We'll specifically query radiology notes. Additionally, since there are a large number of notes we'll limit our queries to only look at 1000 notes, although later we'll use a language model which was trained using all of the radiology notes in the database.

### TODO
Finish the query below to query the `noteevents` table and limit the results to notes where the category is "RADIOLOGY_REPORT".

In [None]:
query = """
select text from noteevents
where ___ = '___'
limit 1000;
"""

In [None]:
df = pd.read_sql(query, conn)

In [None]:
len(df)

In [None]:
df.head()

Let's take a quick look at what one of these notes looks like:

In [None]:
print(df['text'].iloc[0])

# Keyword search
One of the most basic things we can do with free text is to do a **keyword search**. Similar to a Google search, we want to find a set of documents which contain a specific phrase. In SQL, we can do this by using the `like` statement, which allows you to use wildcards. For example, the SQL clause `where text like %adve%` would return documents containing the words "adventure", "adventures", "advertisement", "advertise", etc...

### TODO
Limit the query below to only return documents where the text contains the word "pneumonia".

In [None]:
query = """
select subject_id, text from noteevents
where category = 'RADIOLOGY_REPORT'
    and text like '___'
limit 100
"""

In [None]:
df = pd.read_sql(query, conn)

In [None]:
df.head()

Read through a few examples of the document. Where is pneumonia discussed? Do the patients actually have pneumonia? If not, why is it being mentioned?

In [None]:
print(df.iloc[0]['text'])

## Text processing
Before doing any sort of computation with these texts, there are a number of steps to take to make the data easier to work with. Clinical text is **very** messy: it is very inconsistent, confusing, and often ugly. **Preprocessing** is where we clean up the text a little bit. Some possible steps for preprocessing include:
- Converting the text to lower case
- Replacing **"stop words"**: words or phrases which occur so often that they don't contain any useful information ("and", "or", "the", etc...)
- Merging phrases of 2 or more words
- Splitting texts into **"tokens"** (ie., individual words)

In [None]:
import text_processing

In [None]:
from importlib import reload
reload(text_processing)

In [None]:
# Run this command to download some data needed for text processing:
import nltk
nltk.download('punkt')

In [None]:
# Original texts
texts = df['text']

In [None]:
# Preprocessed texts
preprocessed_texts = [text_processing.preprocess(text) for text in texts]

In [None]:
# Tokenized texts
tokenized_texts = [text_processing.tokenize(text, rm_stopwords=True) for text in preprocessed_texts]

In [None]:
print(texts[0])

In [None]:
print(preprocessed_texts[0])

In [None]:
print(tokenized_texts[0])

# Simple word counts
Now that we've preprocessed our text, we can do some very basic operations on it. Let's count how many times each word occurs and see what the most frequent words are. This will be useful for getting a high-level sense of what information is in our corpus.

In [None]:
from collections import defaultdict
counter = defaultdict(int)

In [None]:
for tokens in tokenized_texts:
    for token in tokens:
        counter[token] += 1

In [None]:
srtd_word_counts = sorted(counter.items(), key=lambda x:x[1], reverse=True)
srtd_word_counts

One nice way to visualize this is with a wordcloud:

In [None]:
from wordcloud import WordCloud

In [None]:
wordcloud = WordCloud()
wordcloud.generate_from_frequencies(counter)

In [None]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

# Word2Vec
Now, let's get into something a little more sophisticated. In the steps above, all we did was iterate through the documents and count how many times each word occurred. This gives us a high-level sense of what words are in this vocabulary, but it doesn't tell us anything about the **meaning** or **semantics** of these words.

In this next exercise we'll look at how we can use **machine learning** to generate some **semantic meaning** from the text. A method called **word embeddings** transforms words, which by default have no computational or semantic meaning, into vectors which contain some representation of what the words meaning. We won't go into the details here, but a quick summary is that we look at the **context** of word - the words nearby a target word - to estimate what it means. Words which occur in similar contexts probably mean similar things. 

For example, consider these 3 sentences:
- "I have a **dog** for a pet"
- "I have a **cat** for a pet"
- "I have a **fish** for a pet"

Since the context around "dog", "cat", and "fish" is very similar, they are probably similar semantically. We translate words into vectors, and words which have similar meanings have vectors. These vectors are called **word embeddings** and we can use them to measure the similarity between different words. If you're interested in learning more, here's a tutorial to get you started with word embeddings: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

One algorithm for generating word embeddings is called `word2vec`. I pretrained a word2vec model on all of the MIMIC-II radiology reports and saved it as a pickle file. Let's read this model in and do some experiments:

In [None]:
import pickle

In [None]:
with open('./trained_word2vec.pkl', 'rb') as f:
    model = pickle.load(f)

First, let's pick a target word which will occur in our vocabulary: **"abdomen"**. Let's first see what the embedding for "abdomen" looks like:

In [None]:
model.wv['abdomen']

Now, let's take two other words: "chest" and "radiograph". Which do you think is more similar to abdomen?

In [None]:
print("Similarity between 'abdomen' and 'chest':", model.wv.similarity('abdomen', 'chest'))

In [None]:
print("Similarity between 'abdomen' and 'radiograph':", model.wv.similarity('abdomen', 'radiograph'))

Let's find what terms our model thinks are most similar to "abdomen":

In [None]:
model.most_similar(['abdomen'], topn=10)

Pretty cool! This is an example of how machine learning can be used to derive insights from raw data. It also shows how we can transform raw text, which lacks any defined structure or semantics, and generate some meaning out of it. 

### TODO
**Vocabulary expansion**
Let's look at two concept classes: *medications* and *diagnoses*. Let's say that we know 1-2 words for each class, but we want to come up with a more complete list. Rather than asking a physician to list all of the medications and diseases that they know, can we use word embeddings to find similar words?

Below I've given seed words for each class. Go through the suggestions from word2vec and see how many of each class you can identify using similarity metrics with word2vec. Note that you can give the model a list of words and it will find words which are similar to all of the words in that list, which can help guide your model to find the most similar terms. 

As you're doing this, consider what kinds of words are being returned. Are they similar to the seed words you're starting with? How are they related? Try doing some other classes as well.

You can google abbreviations or words you don't know to see what they mean.

In [None]:
# Diagnoses
model.wv.most_similar(['pneumonia'], topn=20)

In [None]:
# Medications
model.wv.most_similar(['heparin', 'coumadin'], topn=20)