# Part B - Fun with regular expressions!!

In this section, we will learn what a regular expression is and use our new learn skills to get information from notes in MIMIC.

**What is a Regular Expression:**
* A regular expression (RegEx) is a sophisticated search command, it makes use of patterns
* This can be implemented in many different languages. The specific syntax used in each language may vary, but the concepts are the same!

Please refer to this for some basic regular expression definitions: 
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf


### B.0 Fancy Pattern Matching
To visualize how regular expressions work, we will use the following website: 
https://regex101.com/ 

Please go to this website in another browser.
<br><br><br>

You will see a window like this.
<img src="https://raw.githubusercontent.com/christinium/JapanRegEx/316caa5e0f10011b5483c299ec417ed78bf563b0/images/regex101_demo.png" alt="Demo" style="width:700px;"/>

<br><br>
### Example 1:
1) In the **Test String**  box, please paste the following:

```
Lisinopril 40 MG PO Daily
LISINOPRIL 20 MG PO DAILY
lisinoprl 10 mg PO Daily
The patient is allergic to lisinopril.
April showers bring may flowers.
metoprolol XL 100 mg PO Daily
```

2) In the **Regular Expression** box, please try out each one of these patterns and observe the difference in items that are highlighted.

Pattern | Meaning
--------|--------
. |	A period catches all characters (each one is a different color)
pril |	this only catches the phrase pril
.\*pril |	this catches 0 or more characters before pril
[a-z] |	this catches all alphabetical characters
[abcdefghijklmnopqrstuvwxyz] | this also catches all alphabetical characters
[abcde]|this catches just a, b, c, d, or e
[a-z]\*pril |	this catches 0 or more characters, <br> lower case, but does not match spaces or numbers etc
[aA-zZ]+pril| this catches words with one or more character prior to ending in pril
[aA-zZ]{2,} | pril	this catches words with 2 or more characters prir to ending in pril
lisinopril&#124;losartan |	this catches lisinopril or losartan
\d	| this catches numerical digits
\d{2} |	this catches two numberical digits

<br><br>


### Exercise 1: 
1) In the Test String box, please paste the following:
```
Metoprolol 10 mg PO daily
Omeprazole 10 mg PO BID
Lasix 10 mg PO BID
Carvedilol 3.125 mg PO BID
Amlodipine 10 mg PO Daily
Labetalol 100 mg PO TID
```

2) What would you type in the **Regular Expression** box to find:<br>
a) How do you write an expression to just pull out the betablockers, a type of medication that can decrease the heart rate and blood pressure (they end in lol)<br>
b) You want to help someone figure out drugs that need to be dosed twice daily (or BID) for dosing purposes. How do you print out only lines that or twice a day (BID)?  
c) How do you print lines that are more than once a day (including both BID and TID, which is three times daily)?

_\*\*Answers on the Bottom_


## B.1 Regular Expressions using MIMIC Clinical Notes!
In this section, we will use a python (yay!) notebook to use use regular expressions on real clinical notes from MIMIC-III.  There are over 2 million (!) free text notes in here, so there is a ton of data to be used!<br><br>
We will now go through each python code block.<br>
To run a block select it and press **shift** + **enter**


**Import Libraries**:
The cell below imports the necessaary libraries so that our python notebook can talk to the MIMIC database using Google's bigquery library and pull the notes from the database.

In [1]:
import os
import pandas as pd
import pymysql

## Accessing notes data

#### Option 1: Copy, paste and run the following SQL command in Query Builder and rename the downloaded file as "part_b.csv". Make sure the file is in the same directory as this notebook.

SELECT row_id, subject_id, hadm_id, TEXT
FROM noteevents
WHERE CATEGORY = 'Echo'
LIMIT 10;


This is the actual SQL query. Notes are contained in the NOTEEVENTS table. This table has ###  ?elements/columns (\*fill in columns).  The column with the actual text of the report is  the "text" column. Here, we extracting  the TEXT column from the first ten rows of the NOTEEVENTS table.  <br><br>

In [2]:
# Then import the data into the notebook with the following code
with open('part_b.csv') as echo_reports:
    first_ten_echo_reports = pd.read_csv(echo_reports)

In [3]:
print(first_ten_echo_reports.shape)

(10, 4)


#### Option 2: Uncomment (command+/) if you already have mimiciii locally set up as a SQL database

In [4]:
# sql ='''
# SELECT row_id, subject_id, hadm_id, TEXT
# FROM mimiciii.NOTEEVENTS
# WHERE CATEGORY = 'Echo'
# LIMIT 10;
# '''

In [5]:
# # Data access - Uncomment this block of notes you have set up mimiciii with MySQL
# import pymysql
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = pymysql.connect(**params)

# # Now load the data.
# first_twenty_echo_reports = pd.read_sql_query(sql,conn)

In [6]:
# # Data access - Uncomment this block of notes if you have set up mimiciii with Postgres 
# import psycopg2
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = psycopg2.connect(**params)

# # Now load the data.
# first_twenty_echo_reports = pd.read_sql(sql,conn)

## Start NLP Exercises

Let us examine the result of our query.

In [7]:
#This prints the first ten (or only 10 in this case) rows
#If we wanted to print out all of the rows, we can also use:
# print(first_ten_echo_reports)
# (You can try it in another code block if you want)
first_ten_echo_reports.head(10) 

Unnamed: 0,row_id,subject_id,hadm_id,text
0,59658,65696,167705.0,PATIENT/TEST INFORMATION:\nIndication: Left ve...
1,59659,82208,188268.0,PATIENT/TEST INFORMATION:\nIndication: Acute M...
2,59660,82208,,PATIENT/TEST INFORMATION:\nIndication: Congest...
3,59669,15472,,PATIENT/TEST INFORMATION:\nIndication: Left ve...
4,59670,15472,118185.0,PATIENT/TEST INFORMATION:\nIndication: Left ve...
5,59671,2961,130443.0,PATIENT/TEST INFORMATION:\nIndication: Left ve...
6,59672,7429,110364.0,PATIENT/TEST INFORMATION:\nIndication: Endocar...
7,59767,17513,124736.0,PATIENT/TEST INFORMATION:\nIndication: Pericar...
8,59768,17513,124736.0,PATIENT/TEST INFORMATION:\nIndication: Pericar...
9,59769,17513,124736.0,PATIENT/TEST INFORMATION:\nIndication: Cath la...


Let us dig deeper and view the full content of the first report

In [8]:
report = first_ten_echo_reports["text"][0] 
print(report)
#Arrays start numbering at 0.  If you want to print out the second row, you can type:
#report = first_ten_echo_reports["text"][1] 
#Don't forget to rerun the block after you make changes!

PATIENT/TEST INFORMATION:
Indication: Left ventricular function. Myocardial infarction.
Height: (in) 73
Weight (lb): 200
BSA (m2): 2.15 m2
BP (mm Hg): 100/42
HR (bpm): 89
Status: Inpatient
Date/Time: [**2196-9-16**] at 01:19
Test: Portable TTE (Focused views)
Doppler: Limited Doppler and color Doppler
Contrast: None
Technical Quality: Adequate


INTERPRETATION:

Findings:

LEFT VENTRICLE: Normal LV wall thickness and cavity size. Severe regional LV
systolic dysfunction. No resting LVOT gradient.

LV WALL MOTION: Regional LV wall motion abnormalities include: mid anterior -
akinetic; mid anteroseptal - akinetic; anterior apex - akinetic; septal apex-
akinetic; apex - akinetic;

RIGHT VENTRICLE: Mildly dilated RV cavity. Mild global RV free wall
hypokinesis.

AORTIC VALVE: Bioprosthetic aortic valve prosthesis (AVR). AVR leaflets move
normally.

MITRAL VALVE: Mildly thickened mitral valve leaflets. Moderate (2+) MR.

TRICUSPID VALVE: Indeterminate PA systolic pressure.

PERICARDIUM: No p

We are going to extract the heart rate from this note using regular expressions a powerful tool that allows us to do simple text analytics.
Christina to add regex101 example here from her [notebook](https://github.com/christinium/JapanRegEx/blob/master/1.1%20-%20RegEx%20-%20Regular%20Expressions.ipynb)

To use regular expressions in python we import the regular library (typically this is done at the top of the file).

In [9]:
import re

Let us see how we can extract the line containing heart rate from the report.  
*Remember, the variable "report" was established in the code block above.  If you want to look at a different report - you can change the row number and rerun that block and then this block.*

In [10]:
regular_expression_query = r'HR.*'
hit = re.search(regular_expression_query,report) 
if hit:
    print(hit.group())
else:
    print('No hit for the regular expression')

HR (bpm): 89


This is great. But we want to extract the value (89) from this line. Let us see how we can extract two digit numbers from the report

In [11]:
regular_expression_query = r'\d\d'
hit = re.search(regular_expression_query,report)
if hit:
    print(hit.group())
else:
    print('No hit for the regular expression')

73


Regular expressions are **greedy**. This means they match the first occurrence in the input text. Therefore, we see that we get the height using our current regular expression. Let us modify the regular expression so that we get the first two digit number following the occurence of **HR** in the report.

In [12]:
regular_expression_query = r'(HR).*(\d\d)'
hit = re.search(regular_expression_query,report)
if hit:
    print(hit.group(0))
    print(hit.group(1))
    print(hit.group(2))
else:
    print('No hit for the regular expression')

HR (bpm): 89
HR
89


Great! This is exactly what we wanted. Now let us try to run our regular expression on each of the first ten reports and print the result.

In [13]:
#This runs a for loop - which means for the first 10 rows in our first_ten_echo_reports, we will run our regular expression.  
#We wrote the number 10 in the loop because we know there are 10 rows.
for i in range(10):
    report = first_ten_echo_reports["text"][i]
    regular_expression_query = r'(HR).*(\d\d)'
    hit = re.search(regular_expression_query,report)
    if hit:    
        print('{} :: {}'.format(i,hit.group(2)))
    else:
        print('{} :: No hit for the regular expression'.format(i))
  

0 :: 89
1 :: 95
2 :: 92
3 :: No hit for the regular expression
4 :: No hit for the regular expression
5 :: 86
6 :: 90
7 :: No hit for the regular expression
8 :: No hit for the regular expression
9 :: No hit for the regular expression


We do not get any hits for reports 3, 4, 7, 8 and 9. Let us check report 2 why this is the case

In [14]:
print(first_ten_echo_reports["text"][3])

PATIENT/TEST INFORMATION:
Indication: Left ventricular function. Shortness of breath. History of lung cancer with right pneumonectomy.
Height: (in) 72
Weight (lb): 200
BSA (m2): 2.13 m2
BP (mm Hg): 122/90
Status: Inpatient
Date/Time: [**2176-5-28**] at 08:58
Test: TTE(Focused views)
Doppler: Focused pulse and color flow
Contrast: None
Technical Quality: Suboptimal


INTERPRETATION:

Findings:

This study was compared to the prior study of [**2175-9-19**].


RIGHT ATRIUM/INTERATRIAL SEPTUM: The right atrium is not well visualized.

LEFT VENTRICLE: Left ventricular wall thicknesses and cavity size are normal.
Overall left ventricular systolic function is moderately depressed.

RIGHT VENTRICLE: The right ventricular free wall is hypertrophied. Right
ventricular systolic function appears depressed.

AORTA: The aortic root is normal in diameter. The ascending aorta is mildly
dilated.

AORTIC VALVE: The aortic valve leaflets are mildly thickened. There is no
significant aortic valve stenosis

**Exercise 2:** The pulmonary therapists make a note for patients who are on mechanical ventilation.  We will analyze these notes.<br>
a) Save the first 10 respiratory notes where the "description" column is "Respiratory Care Shift Note" into a variable called "first_ten_resp_reports" and then print the results. <br>
b) Save the first respiratory note as variable "resp_report".<br>
c) Print out the line that contains right upper lobe (RUL) lung sounds. Then do the same for RLL (right lower lobe), LUL, LLL.

In [15]:
#Use this box to get the first 10 respiratory reports
#The category is 'Respiratory ' (note the space after respiratory)


In [16]:
##Use this box to print out the first report

In [17]:
#Printing out lines with RUL

  

SELECT distinct category
FROM `physionet-data.mimiciii_notes.noteevents`


### Answers to Exercises:



**Exercise 1: Can you modify the notebook to print the height of the patient mentioned in the first ten echo reports?**<br>
a) How do you write an expression to just pull out the betablockers, a type of medication that can decrease the heart rate and blood pressure (they end in lol)<br>
```
[aA-zZ].*lol
```
b) You want to help someone figure out drugs that need to be dosed twice daily (or BID) for dosing purposes. How do you print out only lines that or twice a day (BID)?  
```
.*[BID]
```
c) How do you print lines that are more than once a day (including both BID and TID, which is three times daily)?
```
.*BID|.*TID
```


**Exercise 2:** The pulmonary therapists make a note for patients who are on mechanical ventilation.  We will analyze these notes.<br>
a) Save the first 10 respiratory notes where the "description" column is "Respiratory Care Shift Note" into a variable called "first_ten_resp_reports" and then print the results. <br>
```
#Use this box to get the first 10 respiratory reports
#The category is 'Respiratory ' (note the space after respiratory)
first_ten_resp_reports = run_query('''
SELECT row_id, subject_id, hadm_id, category, description, TEXT
FROM `physionet-data.mimiciii_notes.noteevents`
WHERE category = "Respiratory " AND description = 'Respiratory Care Shift Note'
limit 10
''')

first_ten_resp_reports.head(10)
```

b) Save the first respiratory note as variable "resp_report".<br>
```
resp_report = first_ten_resp_reports["TEXT"][0] 
print(resp_report)
```
c) Print out the line that contains right upper lobe (RUL) lung sounds. Then do the same for RLL (right lower lobe), LUL, LLL.
```
regular_expression_query = r'RUL.*'
hit = re.search(regular_expression_query,resp_report) 
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')
  
for i in range(len(first_ten_resp_reports)):
  if hit:
    print(hit.group())
  else:
    print('No hit for the regular expression')

## Replase RUL with RLL, LUL, LLL to look at the other lobes
```