# Phrases

So far we have only thought in terms of single words: "lower", "lobe", "University", "of", "Utah". But in reality often times multiple words form one unit of thought: "University of Utah". Our word vectors will do a better job of representing our text if we fist recognize these phrases. We are going to use the [gensim](https://radimrehurek.com/gensim/models/phrases.html) package to detect and transform these phrases.

For example, the sentence, "I am a faculty member in the departments of Biomedical Informatics and Radiology and Imaging Sciences at the University of Utah." would be transformed to "I am a faculty member in the departments of Biomedical_Informatics and Radiology_and_Imaging_Sciences at the University_of_Utah."

"Biomedical_Informatics is an example of a **bigram phrase** and "University_of_Utah" is a **trigram phrase**. I guess "Radiology_and_Imaging_Sciences" is a quadgram phrase, but we will likely not try to detect phrases that long.

# Using the Gensim Phrases Module

In [None]:
%matplotlib inline

In [None]:
from nose.tools import assert_almost_equal, assert_true, assert_equal, assert_raises
from numbers import Number

## Upgrade to the latest version of gensim

In [None]:
#!conda install gensim -y

In [None]:
import pymysql
import pandas as pd
import getpass
from textblob import TextBlob
import re
from gensim.models.phrases import Phraser, Phrases
from IPython.display import clear_output, display, HTML
import seaborn as sns

In [None]:
import gensim
gensim.__version__

In [None]:
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')
cursor = conn.cursor()

## Select Some Text from the MIMIC2 Database

In [None]:
rad_data = \
pd.read_sql("""SELECT noteevents.subject_id, 
                      noteevents.hadm_id,
                      noteevents.text 
               FROM noteevents
               WHERE noteevents.category = 'RADIOLOGY_REPORT' LIMIT 5000""",conn)
rad_data.head(5)

In [None]:
rad_data.shape

### Define Regular expressions for data cleansing

* Write a regular expression to replace dates in the reports with ``[**DATE**]``
* Write a regular expression to replace times in the reports with ``[**TIME**]``
* Write a regular expression to replaces digits with "d", (e.g. "43 cc" would become "dd cm")

#### Hints, etc.

* Look at some sample reports to see what dates and times look like in the reports
* What order would you need to apply the regular expressions?
* Could we just replace use the digit recognizer and skip the date and time strippers?

In [None]:
rd = re.compile(r"""\d""")

### Write a function to pre-process our text

* Lower case?
* Digits?
* Strip dates/times?

### But first, write unit tests to test whether `preprocess` is functioning correctly
#### Then write functionality to pass tests

You might want to use the `strings` module

In [None]:
import string
string.ascii_uppercase

In [None]:
def preprocess(txt):
    pass

In [None]:
assert_true???

In [None]:
assert_equal???

In [None]:
assert_raises???

## Create a TextBlob from all the text in `rad_data["text"]`

In [None]:
blob = TextBlob(preprocess(" ".join(rad_data["text"])))


## Write a function `train_phrases` that will train bigram and trigram detectors

* We want to be able to ignore common terms in our phrase detection
* We want to be able to specify the minimum number of occurences in our text to be considered a phrase
* Return a dictionary of detectors

### Write unit tests to determine whether `train_phrases` is working as expected

In [None]:
def train_phrases(blob, common_terms=None, min_count=5):
    pass
        

In [None]:
common_terms = ["of", "with", "without", "and", "or", "the", "a"]
generators = train_phrases(blob, common_terms=common_terms, min_count=5)

### Write a function that takes a `TextBlob` instance and phrase generators and returns a string of text
#### Unit tests first

In [None]:
def get_phrased_text(blob, generators):
    pass

In [None]:
len(found_phrases)

## What phrases did we detect?

In [None]:
found_phrases = set([w for w in phrased_txt.split() if "_" in w])
print(len(found_phrases))

### How often did each phrase occur?

In [None]:
from collections import ???

In [None]:
counted_phrases = None

In [None]:
def sorted_counter(cntr):
    lcntr = list(cntr.items())
    lcntr.sort(key=lambda x:x[1], reverse=True)
    return lcntr

In [None]:
lcounted_phrases = sorted_counter(counted_phrases)

In [None]:
for phrase, count in lcounted_phrases:
    print("%s\t%03d"%(phrase.ljust(40),count))


## Create a word vector vocabulary using only words and phrases that occur more than N times
### How to choose N?

### What is our vocabulary from phrased_txt (how many unqiue words)?

Why use `TextBlob.words` instead of just `phrased_txt.split()`?

#### why is `phrased_blob = TextBlob(phrased_txt)` fast and `print(len(set(phrased_blob.words)))` slow?

In [None]:
phrased_blob = TextBlob(phrased_txt)

In [None]:
print(len(set(phrased_blob.words)))

In [None]:
phrased_blob_count = None


In [None]:
phrased_blob_count[:100]

### Based on these most frequent words, create a list of stop words to drop from our vocabulary

In [None]:
stop_words = []

### What are our infrequent words?

In [None]:
phrased_blob_count[-2000:-1000]

In [None]:
sns.distplot([c[1] for c in phrased_blob_count if c[1] > 500])

In [None]:
len([w for w in phrased_blob_count if w[1]>10])

In [None]:
vwords = [w for w in phrased_blob_count if w[1]>0 and w[0] not in stop_words]

In [None]:
vocabulary = {}

### Determining Similarity Between Reports
* CXR vs CT vs MR

In [None]:
rad_data[rad_data["text"].str.contains("MRI")]

## Create a Report Browser

In [None]:
num_reports = rad_data.shape[0]
while True:
    try:
        i = int(input("Enter a number between 0 and %d. otherwise to quit"%num_reports))
        clear_output()

        if i < 0 or i >=num_reports:
            break
        txt = TextBlob(rd.sub("""d""", rad_data.iloc[i]['text'].strip().lower()))
        display(HTML("<>%s</p>"%" ".join(trigram_generator[bigram_generator[txt.tokens]])))
        
    except ValueError:
        break


In [None]:
type(txt)

## Wrangling Doesn't Always Do What You Want

>technique : multiplanar_td and td-weighted_images of the brain with gadolinium_according to standard departmental protocol .