<a href="https://colab.research.google.com/github/christinium/AIMed_Workshop_2018/blob/master/MIT%20Tutorial%20-%20Part%20B%20-%20Regular%20Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part B: Fun With Regular Expressions!

In this section, we will learn what a regular expression is and use our new learn skills to get information from notes in MIMIC.

**What is a Regular Expression:**
* A regular expression (RegEx) is a sophisticated search command, it makes use of patterns
* This can be implemented in many different languages. The specific syntax used in each language may vary, but the concepts are the same!

Please refer to this for some basic regular expression definitions: 
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf


### B.0 Fancy Pattern Matching
To visualize how regular expressions work, we will use the following website: 
https://regex101.com/ 

Please go to this website in another browser.
<br><br><br>

You will see a window like this.
<img src="https://raw.githubusercontent.com/christinium/JapanRegEx/316caa5e0f10011b5483c299ec417ed78bf563b0/images/regex101_demo.png" alt="Demo" style="width:700px;"/>

<br><br>
### Example 1:
1) In the **Test String**  box, please paste the following:

```
Lisinopril 40 MG PO Daily
LISINOPRIL 20 MG PO DAILY
lisinoprl 10 mg PO Daily
The patient is allergic to lisinopril.
April showers bring may flowers.
metoprolol XL 100 mg PO Daily
```

2) In the **Regular Expression** box, please try out each one of these patterns and observe the difference in items that are highlighted.

Pattern | Meaning
--------|--------
. |	A period catches all characters (each one is a different color)
pril |	this only catches the phrase pril
.\*pril |	this catches 0 or more characters before pril
[a-z] |	this catches all alphabetical characters
[abcdefghijklmnopqrstuvwxyz] | this also catches all alphabetical characters
[abcde]|this catches just a, b, c, d, or e
[a-z]\*pril |	this catches 0 or more characters, <br> lower case, but does not match spaces or numbers etc
[aA-zZ]+pril| this catches words with one or more character prior to ending in pril
[aA-zZ]{2,} | pril	this catches words with 2 or more characters prir to ending in pril
lisinopril&#124;losartan |	this catches lisinopril or losartan
\d	| this catches numerical digits
\d{2} |	this catches two numberical digits

<br><br>


### Exercise 1: 
1) In the Test String box, please paste the following:
```
Metoprolol 10 mg PO daily
Omeprazole 10 mg PO BID
Lasix 10 mg PO BID
Carvedilol 3.125 mg PO BID
Amlodipine 10 mg PO Daily
Labetalol 100 mg PO TID
```

2) What would you type in the **Regular Expression** box to find:<br>
a) How do you write an expression to just pull out the betablockers, a type of medication that can decrease the heart rate and blood pressure (they end in lol)<br>
b) You want to help someone figure out drugs that need to be dosed twice daily (or BID) for dosing purposes. How do you print out only lines that or twice a day (BID)?  
c) How do you print lines that are more than once a day (including both BID and TID, which is three times daily)?

_\*\*Answers on the Bottom_


## B.1 Regular Expressions using MIMIC Clinical Notes!
In this section, we will use a python (yay!) notebook to use use regular expressions on real clinical notes from MIMIC-III.  There are over 2 million (!) free text notes in here, so there is a ton of data to be used!<br><br>
We will now go through each python code block.<br>
To run a block select it and press **shift** + **enter**


**Import Libraries**:
The cell below imports the necessaary libraries so that our python notebook can talk to the MIMIC database using Google's bigquery library and pull the notes from the database.

In [0]:
import os
import pandas as pd

from google.colab import auth
from google.cloud import bigquery
from google.colab import files

**Authenticate:** The line of code below ensures you are an authenticated user accessing the MIMIC database. You will need to rerun this each time you open the notebook.

In [0]:
auth.authenticate_user() #This will allow you to authenticate access to BigQuery

**Query Function:** This is a method that executes a desired SQL query on the database.  If you want to run a query, you can use the function name below, which we named  *run_query()*

In [0]:
project_id='new-zealand-2018-datathon'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
# Read data from BigQuery into pandas dataframes.
def run_query(query):
  return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})

**Actual Query:** This is the actual SQL query. Notes are contained in the NOTEEVENTS table. This table has ###  ?elements/columns (\*fill in columns).  The column with the actual text of the report is  the TEXT column. Here, we extracting  the TEXT column from the first ten rows of the NOTEEVENTS table.  <br><br>

(* Side note, if you want to run this in bigquery, you can also go to https://bigquery.cloud.google.com, click "Try the new UI" on the top right, and paste the text between the quotes into the "Query Editor" )

In [0]:
first_ten_echo_reports = run_query('''
SELECT row_id, subject_id, hadm_id, TEXT
FROM `physionet-data.mimiciii_notes.noteevents`
WHERE CATEGORY = 'Echo'
LIMIT 10
''')


Let us examine the result of our query.

In [0]:
#This prints the first ten (or only 10 in this case) rows
#If we wanted to print out all of the rows, we can also use:
# print(first_ten_echo_reports)
# (You can try it in another code block if you want)
first_ten_echo_reports.head(10) 

Let us dig deeper and view the full content of the first report

In [0]:
report = first_ten_echo_reports["TEXT"][0] 
print(report)
#Arrays start numbering at 0.  If you want to print out the second row, you can type:
#report = first_ten_echo_reports["TEXT"][1] 
#Don't forget to rerun the block after you make changes!

We are going to extract the heart rate from this note using regular expressions a powerful tool that allows us to do simple text analytics.
Christina to add regex101 example here from her [notebook](https://github.com/christinium/JapanRegEx/blob/master/1.1%20-%20RegEx%20-%20Regular%20Expressions.ipynb)

To use regular expressions in python we import the regular library (typically this is done at the top of the file).

In [0]:
import re

Let us see how we can extract the line containing heart rate from the report.  
*Remember, the variable "report" was established in the code block above.  If you want to look at a different report - you can change the row number and rerun that block and then this block.*

In [0]:
regular_expression_query = r'HR.*'
hit = re.search(regular_expression_query,report) 
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')

This is great. But we want to extract the value (85) from this line. Let us see how we can extract two digit numbers from the report

In [0]:
regular_expression_query = r'\d\d'
hit = re.search(regular_expression_query,report)
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')

Regular expressions are **greedy**. This means they match the first occurrence in the input text. Therefore, we see that we get the height using our current regular expression. Let us modify the regular expression so that we get the first two digit number following the occurence of **HR** in the report.

In [0]:
regular_expression_query = r'(HR).*(\d\d)'
hit = re.search(regular_expression_query,report)
if hit:
  print(hit.group(0))
  print(hit.group(1))
  print(hit.group(2))
else:
  print('No hit for the regular expression')

Great! This is exactly what we wanted. Now let us try to run our regular expression on each of the first ten reports and print the result.

In [0]:
#This runs a for loop - which means for the first 10 rows in our first_ten_echo_reports, we will run our regular expression.  
#We wrote the number 10 in the loop because we know there are 10 rows.
for i in range(10):
  report = first_ten_echo_reports["TEXT"][i]
  regular_expression_query = r'(HR).*(\d\d)'
  hit = re.search(regular_expression_query,report)
  if hit:    
    print('{} :: {}'.format(i,hit.group(2)))
  else:
    print('{} :: No hit for the regular expression')
  

We do not get any hits for reports 2, and 6. Let us check report 2 why this is the case

In [0]:
print(first_ten_echo_reports["TEXT"][2])

**Exercise 2:** The pulmonary therapists make a note for patients who are on mechanical ventilation.  We will analyze these notes.<br>
a) Save the first 10 respiratory notes where the "description" column is "Respiratory Care Shift Note" into a variable called "first_ten_resp_reports" and then print the results. <br>
b) Save the first respiratory note as variable "resp_report".<br>
c) Print out the line that contains right upper lobe (RUL) lung sounds. Then do the same for RLL (right lower lobe), LUL, LLL.

In [0]:
#Use this box to get the first 10 respiratory reports
#The category is 'Respiratory ' (note the space after respiratory)


In [0]:
##Use this box to print out the first report

In [0]:
#Printing out lines with RUL

  

SELECT distinct category
FROM `physionet-data.mimiciii_notes.noteevents`


###Answers to Exercises:



**Exercise 1: Can you modify the notebook to print the height of the patient mentioned in the first ten echo reports?**<br>
a) How do you write an expression to just pull out the betablockers, a type of medication that can decrease the heart rate and blood pressure (they end in lol)<br>
```
[aA-zZ].*lol
```
b) You want to help someone figure out drugs that need to be dosed twice daily (or BID) for dosing purposes. How do you print out only lines that or twice a day (BID)?  
```
.*[BID]
```
c) How do you print lines that are more than once a day (including both BID and TID, which is three times daily)?
```
.*BID|.*TID
```


**Exercise 2:** The pulmonary therapists make a note for patients who are on mechanical ventilation.  We will analyze these notes.<br>
a) Save the first 10 respiratory notes where the "description" column is "Respiratory Care Shift Note" into a variable called "first_ten_resp_reports" and then print the results. <br>
```
#Use this box to get the first 10 respiratory reports
#The category is 'Respiratory ' (note the space after respiratory)
first_ten_resp_reports = run_query('''
SELECT row_id, subject_id, hadm_id, category, description, TEXT
FROM `physionet-data.mimiciii_notes.noteevents`
WHERE category = "Respiratory " AND description = 'Respiratory Care Shift Note'
limit 10
''')

first_ten_resp_reports.head(10)
```

b) Save the first respiratory note as variable "resp_report".<br>
```
resp_report = first_ten_resp_reports["TEXT"][0] 
print(resp_report)
```
c) Print out the line that contains right upper lobe (RUL) lung sounds. Then do the same for RLL (right lower lobe), LUL, LLL.
```
regular_expression_query = r'RUL.*'
hit = re.search(regular_expression_query,resp_report) 
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')
  
for i in range(len(first_ten_resp_reports)):
  if hit:
    print(hit.group())
  else:
    print('No hit for the regular expression')

## Replase RUL with RLL, LUL, LLL to look at the other lobes
```