## BMI 6115 Module 3: Pipelines, where the different levels of processing come together
review of downloading and using useful NLTK functions
review of accessing the MIMIC II deceased data set
reading a small number of notes and processing for POS (including histograms and parse trees)
example of crude search across the MIMIC II deceased data set for keywords in the Module to use case of peripheral artery disease (meant to show them how to get a rough idea how many notes might contain some target concepts of interest to stu
   
If you want to use the techniques of biomedical text processing to accomplish some goal, then typically you start with some corpus (a collection of texts of interest to you) and and then process it in some project-specific way. A useful way to think about your project is to conceive of it as a pipeline.

## A simple NLP general purpose pipeline
![Pipleine graphic](../../media/m3_levels_of_processing/simple_pipeline_final.png)


### The NLP pipeline paradigm
Almost all NLP systems that do something useful use some form of a *pipeline*. Like any good programming system, a pipeline breaks a big problem into small, manageable tasks. The pipeline shown above has four tasks and these particular tasks are very common in NLP. We will cover pipeline design in more detail later in the course. But for now, note that in Python the simplest pipeline is just a program with several defined functions that are called in order.

**This in-class notebook will run through some common taks that make up parts of common pipeline components.**

To get started, let's import what we need. If you have worked through the NLTK_basics notebook as required in Module 1, then the `NLTK` commands below should already be available for importing. BUT NOTE: if we had to rebuild the JupyterHub system to add content then you need to re-load the`NLTK` package.


In [None]:
# skip this cell if you already have executed it in a previous notebook
import nltk
nltk.download() #enter 'd' at 'Downloader>' prompt; then enter 'book' at 'Identifier>' prompt

In [None]:
from nltk.tokenize import sent_tokenize       #imports a sentence splitter
from nltk.tokenize import word_tokenize       #imports a string tokenizer breaking on whitespace
from nltk.tokenize import wordpunct_tokenize  #imports a string tokenizer breaking on whitespace and punctuation
from nltk.tokenize import WhitespaceTokenizer #used for generating word spans later
from nltk import pos_tag                      #a part-of-speech tagger
import re                                     #imports Python's regular expression functions
#import PyConTextNLP
import pyRuSH


For this session we will be using the [MIMIC II database](https://physionet.org/mimic2/demo/) often. It is a collection of ICU data on about 4,000 patients available in the public domain without needing a Data Use Agreement.

**Some useful code for reading notes from a database (i.e., a simple *Note Reader Process* in the pipeline model):**


In [None]:
import pymysql       #imports the Python mysql module
import pandas as pd  #imports the Python data analysis library 
import getpass       #imports the getpass module


Now let's connect to MIMIC II:

In [None]:
dbconn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for user jovyan"),db='mimic2')
cursor = dbconn.cursor()


Let's look at the tables in MIMIC II that we can access:

In [None]:
tables = pd.read_sql("SELECT table_name FROM information_schema.tables where table_schema='mimic2'", dbconn)
print(tables)


How many notes do we have in MIMIC II?

In [None]:
print(pd.read_sql('SELECT count(*) from noteevents limit 10',dbconn), " notes in the noteevents table")

Thinking about the **suggested project use case** of extracting concepts related to peripheral artery disease or PAD, let's see how many notes contain 'PAD'.

In [None]:
#note that "text" is the name of the column in noteevents that holds the actual notes
#This query takes about 20 seconds to run, be patient!
print(pd.read_sql("SELECT count(text) from noteevents WHERE text like '%PAD%' LIMIT 10",dbconn))


Well, that's encouraging! Ok, let's read in 10 notes that contain the string 'PAD'. 

In [None]:
#note that "text" is the name of the column in noteevents that holds the actual notes
num_notes = cursor.execute("SELECT text from noteevents WHERE text like '%PAD%' limit 10")
print("Read", num_notes,"notes from noteevents.\n")
note_list = []
for note in cursor:                   #grab each note from the SELECT results        
    note_list.append(str(note))       #add each new note to a list of notes as a string

#### A simple *Sentence Segmenter * in the pipeline model
Before we explore the clinical text sentences, let's look at `NLTK` running against proper English sentences. In a pipleine approach to NLP, Stage 1 of the pipeline has grabbed the notes we want. So now it's time to extract the sentences from the notes. As an example copy the Introduction paragraph from Canvas Module 4 Web page and paste it between the single quotes in the code below; then run the cell:

In [None]:
# Copy the Introduction paragraph from canvas Module 4 below, replaceing "<replace this>"
sentence = 'Manual text annotation is an important part of an NLP system development project. It often is the only way you can create something you can test your clinical information extraction system against. In a nutshell: in a small set of notes you have humans manually annotate (highlight) mentions of the concepts that your system is trying to extract automatically. The human annotations comprise a reference standard (sometimes called the "gold standard"). You can measure the performance of your system by comparing its output with the output generated by humans.<replace this>'
print(sentence)
sentences = sent_tokenize(sentence)
i = 1
for next_sent in sentences:
    print (i,next_sent)
    i+= 1

Great. `NLTK` did a perfect job extracting each sentence from the Introduction paragraph and placing them in a simple list of sentences. 

#### A simple *Sentence Tokenizer and Part of Speech* in the pipeline model
The next level of processing is to take sentences and break them up into useful parts. The simplest way to break down a sentence is to find each of its tokens, a series of characters between punctuation. `NLTK' has a couple of simple tokenizer functions: word_tokenize(), which breaks out tokens between whitespace characters like <space> and <tab>; and wordpunct_tokenize, which breaks out tokens using whitespace *and* punctuation. Let's run both on the fourth sentence in the Introduction. Can you spot the difference between the output from these two functions?

In [None]:
word_tokenize(sentences[3])

In [None]:
wordpunct_tokenize(sentences[3])


You can start to see how the *sentence tokenizer* component of an NLP pipeline could use tokenizer functions extract tokens from segments and, say, count of how often they appear in a text corpus, or even start to build word vectors used by a downstream pipeline component. Let's do a simple example: build a dictionary that maps each token in the Introduction paragraph to a count of the number of times it appears in the paragraph.

In [None]:
#
#
#
#
#My version: it counts the occurrence of tokens and then lists the 10 most frequent ones
word_count = dict()
for sentence in sentences:
    tokens = word_tokenize(sentence)
    for token in tokens:
        token = token.lower()  # make all the words lower case
        word_count[token] = word_count.get(token, 0) + 1  #if token count exists add 1 to it, else set it to 0
t = []
for key, value in word_count.items():
    t.append((value, key))
t.sort(reverse=True)
print('The most common tokens in the Introduction are:')
for freq, word in t[:10]:
    print(word, freq, sep='\t')   

#### Real quick Parsing
Part-of-speech processing and sentence parsing can can get messy very quickly. Working in those areas necesarily means studying formal language and grammar theory. But our lab showed that about 50% of segments in clinical text are not composed in a proper grammar like standard English **anyway**. So the utility of parsing is of limited use in clinical NLP. The `pos_tag()` below is a quick and dirty way to generate part-of-speech tags that are reasonably good.

To get a feel for what parsers can do, navigate to the [Online Stanford Parser](http://nlp.stanford.edu:8080/parser/) and enter a few of the sentences from MIMIC II example below. Try "HTN, DM,PVD, ADMITTED W/ SOB, CP." and "PT VERY SOMNOLENT AT BEGGING OF SHIFT." from sentences[1] **below** (note: the current sentences[] list still holds the Introduction sentences. Note I corrected two misspellings ("SOMULENT" and "BEGING"). Some sophisticated machine learning techniques use POS and parse results in their modeling, but we don't plan to cover that in this course.

Try this simple part-of-speech tagger:

In [None]:
pos_tag(word_tokenize(sentences[3]),tagset='universal')

The parameter `tagset` tells the part-of-speech generator to use a simple style for POS:

|Tag    |Meaning     |English Examples                               |
|------ |:----------:|----------------------------------------------:|
|ADJ    |adjective 	 |new, good, high, special, big, local           
|ADP 	|adposition  |on, of, at, with, by, into, under
|ADV 	|adverb 	 |really, already, still, early, now
|CONJ 	|conjunction |and, or, but, if, while, although
|DET 	|determiner, |article 	the, a, some, most, every, no, which
|NOUN 	|noun 	     |year, home, costs, time, Africa
|NUM 	|numeral 	 |twenty-four, fourth, 1991, 14:24
|PRT 	|particle 	 |at, on, out, over per, that, up, with
|PRON 	|pronoun 	 |he, their, her, its, my, I, us
|VERB 	|verb 	     |is, say, told, given, playing, would
|. 	    |punctuation |. , ; !
|X 	    |other 	     |ersatz, esprit, dunno, gr8, univeristy



We won't go into detail now, but note that when dealing with annotated files like in the next course Module, we are interested in the span of words. Usually this is an offset from the first character of a clinical note. For example, the word "annotation" occurs in the first sentence of the Introduction starting at character #12 and runs through #22. `NLTK` can help us with tasks like determining word spans:

In [None]:
print(sentences[0])
list(WhitespaceTokenizer().span_tokenize(sentences[0]))


Okay, now let's return to the real, and messy, world of clinical text. The first note we fetched earlier, note_list[0], looks like this when printed for humans:

CCU NURSING PROGRESS NOTES
S:"MY THROAT IS SORE"

O:PT IS A 87YR OLD W/ CRF, SP AV GRAFT  [**9-19**]. HTN, DM,PVD, ADMITTED W/ SOB, CP. PT HAD ISCHEMIC CP, PUMP FAILURE, NO RESPONSE TO ANTIANGINALS, DIURETICS. TO CATH LAB YESTERDAY, LAD, OM AND DIAG STENTED. PCWP UP, IABP PLACED.

CV:HEMODYNAMICALLY IMPROVED, IABP 1:1W/ GOOD AUGMENTATION, SYS UNLOADING [**5-30**], DIA UNLOADING 2-15MMHG. MOST RECENT CO/CI 5.3/CI 2.96, SVR 1042. PAD DOWN TO 20. UNABLE TO WEDGE SWAN THIS AM. AT 0500 SWAN OUT IN RV, AND ADVANCED BY DR. [**Last Name (STitle) 88**], POSITION CONFIRMED BY CXR.  PT TOL LOW DOSE LOPRESSOR. PRESENTLY PT OFF IV NTG, IV HEPARIN TITRATED AS PER SS, NOW AT 500UNITS/HR. RIGHT GROIN W/ SM AMT OF OOZING AT SITE. BOTH DP PULSES AUDIBLE W/ [**Last Name (un) 89**], L RADIAL PALP. TOES AND HEELS MOTTLED/ DUSKY, WARM TO TOUCH.
CVVHD/ABLE TO PULL 200CC/HR. NOW PULLING 50CC PER/HR. SEE CAREVUE FOR FULL I/O DETAILS.
RESP:LUNS COARSE LUL, CLEAR RUL, BOTH BASES W/BRONCHIAL BS. O2 WEANED TO 40% THIS AM ABG 7.32/44/91/24, SAT 97%. PT RECEIVED BICARB 1 AMP LAST EVENING FOR PH OF 7.30 PER RENAL. PEDAL EDEMA NOTED.

GI:HYPO , COFFEE GROUND COLORED NG ASPIRATE, GUIAC POS 100CC. NO BM.\\nGU:MIN U/O, WAS BR COLOR NOW MORE STRAW COLOR, REMAINS OLIGURIC.

SKIN:DUODERM INTACT ON COCCUX.

NEURO:PT VERY SOMULENT AT BEGING OF SHIFT. BUT EASILY AROUSABLE ORIENTED X3, PT MORE AWAKE, THIS AM AND C/O RIGHT LEG PAIN, MORPHINE GIVEN.

___
So `NLTK` should tokenize these sentences properly, right? Let's try segmenting the first note:

In [None]:
sentences = sent_tokenize(note_list[0])
i = 0
for sentence in sentences:
    print (i,":", sentence, "\n")
    i+= 1

The `sent_tokenizer` really did a pretty good job. It wasn't confused by the periods that occur in numbers like "ABG 7.32" There are some newlines that snuck through,like in "\nCVVHD/ABLE TO PULL 200CC/HR." How would you strip those out?

In [None]:
#
#
#
#
# Here's a simple way:
s = sentences[15]
print(s.split('\\n'))