# Extracting values from Echo reports

The cell below imports the necessaary libraries so that our python notebook can talk to the MIMIC database using Google's bigquery library and pull the notes

In [0]:
import os
import pandas as pd

from google.colab import auth
from google.cloud import bigquery
from google.colab import files

The line of code below ensures you are an authenticated user accessing the MIMIC database

In [0]:
auth.authenticate_user() #This will allow you to authenticate access to BigQuery

This is a method that executes a desired SQL query on the database

In [0]:
project_id='hst-953-2018'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
# Read data from BigQuery into pandas dataframes.
def run_query(query):
  return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})

This is the actual SQL query. Here, we extracting the TEXT column from the first ten rows in the NOTEEVENTS table.

In [0]:
first_ten_echo_reports = run_query('''
SELECT TEXT 
FROM `physionet-data.mimiciii_notes.noteevents`
WHERE CATEGORY = 'Echo'
LIMIT 10
''')

Let us examine the result of our query.

In [0]:
print(first_ten_echo_reports["TEXT"])

0    PATIENT/TEST INFORMATION:\nIndication: Coronar...
1    PATIENT/TEST INFORMATION:\nIndication: Endocar...
2    PATIENT/TEST INFORMATION:\nIndication: Aortic ...
3    PATIENT/TEST INFORMATION:\nIndication: Congest...
4    PATIENT/TEST INFORMATION:\nIndication: Shortne...
5    PATIENT/TEST INFORMATION:\nIndication: Pre-Cor...
6    PATIENT/TEST INFORMATION:\nIndication: Congest...
7    PATIENT/TEST INFORMATION:\nIndication: rule-ou...
8    PATIENT/TEST INFORMATION:\nIndication: LV func...
9    PATIENT/TEST INFORMATION:\nIndication: Congest...
Name: TEXT, dtype: object


Let us dig deeper and view the full content of the first report

In [0]:
report = first_ten_echo_reports["TEXT"][0]
print(report)

PATIENT/TEST INFORMATION:
Indication: Coronary artery disease. Left ventricular function. Patient on IABP 1:1.
Height: (in) 68
Weight (lb): 178
BSA (m2): 1.95 m2
BP (mm Hg): 112/61
HR (bpm): 85
Status: Inpatient
Date/Time: [**2133-6-15**] at 10:20
Test: Portable TTE (Complete)
Doppler: Full Doppler and color Doppler
Contrast: None
Technical Quality: Adequate


INTERPRETATION:

Findings:

This study was compared to the prior study of [**2130-10-4**].


LEFT ATRIUM: Normal LA size.

RIGHT ATRIUM/INTERATRIAL SEPTUM: Mildly dilated RA.

LEFT VENTRICLE: Mild symmetric LVH. Moderately dilated LV cavity. Apical LV
aneurysm. Severe global LV hypokinesis. TVI E/e' >15, suggesting PCWP>18mmHg.
No resting LVOT gradient. No LV mass/thrombus.

RIGHT VENTRICLE: Normal RV chamber size and free wall motion.

AORTA: Focal calcifications in aortic root. Moderately dilated ascending
aorta.

AORTIC VALVE: Mildly thickened aortic valve leaflets (3). No AS. No AS.

MITRAL VALVE: Mildly thickened mitral valv

We are going to extract the heart rate from this note using regular expressions a powerful tool that allows us to do simple text analytics.
Christina to add regex101 example here from her [notebook](https://github.com/christinium/JapanRegEx/blob/master/1.1%20-%20RegEx%20-%20Regular%20Expressions.ipynb)

To use regular expressions in python we import the library 

In [0]:
import re

Let us see how we can extract the line containing heart rate from the report

In [0]:
regular_expression_query = r'HR.*'
hit = re.search(regular_expression_query,report)
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')

HR (bpm): 85


This is great. But we want to extract the value (85) from this line. Let us see how we can extract two digit numbers from the report

In [0]:
regular_expression_query = r'\d\d'
hit = re.search(regular_expression_query,report)
if hit:
  print(hit.group())
else:
  print('No hit for the regular expression')

68


Regular expressions are **greedy**. This means they match the first occurrence in the input text. Therefore, we see that we get the height using our current regular expression. Let us modify the regular expression so that we get the first two digit number following the occurence of **HR** in the report.

In [0]:
regular_expression_query = r'(HR).*(\d\d)'
hit = re.search(regular_expression_query,report)
if hit:
  print(hit.group(0))
  print(hit.group(1))
  print(hit.group(2))
else:
  print('No hit for the regular expression')

HR (bpm): 85
HR
85


Great! This is exactly what we wanted. Now let us try to run our regular expression on each of the first ten reports and print the result.

In [0]:
for i in range(10):
  report = first_ten_echo_reports["TEXT"][i]
  regular_expression_query = r'(HR).*(\d\d)'
  hit = re.search(regular_expression_query,report)
  if hit:    
    print('{} :: {}'.format(i,hit.group(2)))
  else:
    print('{} :: No hit for the regular expression')
  

0 :: 85
1 :: 89
{} :: No hit for the regular expression
3 :: 68
4 :: 90
5 :: 57
{} :: No hit for the regular expression
7 :: 75
8 :: 52
9 :: 82


We do not get any hits for reports 2, and 6. Let us check report 2 why this is the case

In [0]:
print(first_ten_echo_reports["TEXT"][2])

PATIENT/TEST INFORMATION:
Indication: Aortic valve disease. Coronary artery disease. Left ventricular function. Mitral valve disease.
Height: (in) 68
Weight (lb): 163
BSA (m2): 1.88 m2
BP (mm Hg): 120/70
Status: Inpatient
Date/Time: [**2178-5-8**] at 10:56
Test: Portable TTE (Complete)
Doppler: Complete pulse and color flow
Contrast: None
Technical Quality: Adequate


INTERPRETATION:

Findings:

LEFT ATRIUM: The left atrium is normal in size.

RIGHT ATRIUM/INTERATRIAL SEPTUM: The right atrium is normal in size.

LEFT VENTRICLE: There is moderate symmetric left ventricular hypertrophy. The
left ventricular cavity size is normal. Overall left ventricular systolic
function is mildly depressed.

LV WALL MOTION: The following resting regional left ventricular wall motion
abnormalities are seen: basal anteroseptal - hypokinetic; mid anteroseptal -
hypokinetic; basal inferolateral - hypokinetic; septal apex - hypokinetic;
apex - hypokinetic;

RIGHT VENTRICLE: Right ventricular chamber size an

Can you modify the notebook to print the height of the patient mentioned in the first ten echo reports?