# Part A - Spotting NASH

Natural Language Processing (NLP) doesn't have to be hard! For many tasks simply finding a bunch of notes that are helpful is enough. In this example we have a nice term (NASH) that is fairly unambiguous. We just want to find patients who may have NASH for some further study.

In [1]:
# First off - load all the silly python libraries we are going to need
import pandas as pd
import numpy as np
import random
from IPython.core.display import display, HTML

## Accessing notes data

#### Option 1: Copy, paste and run the following SQL command in Query Builder and rename the downloaded file as "part_a.csv". Make sure the file is in the same directory as this notebook.

SELECT SETSEED(0.5);
SELECT *, RANDOM() as random_id 
FROM (        
    SELECT row_id, subject_id, text 
    FROM noteevents 
    WHERE text LIKE '%cirrhosis%' 
    ORDER BY row_id, subject_id 
    LIMIT 1000
) A;

#### Option 1 continue: Reading data into notebook

In [2]:
filename = 'part_a.csv'
with open(filename) as cirrhosis_notes:
    notes = pd.read_csv(cirrhosis_notes)

In [3]:
notes.head()

Unnamed: 0,row_id,subject_id,text
0,27,23194,Admission Date: [**2157-4-9**] D...
1,36,5458,Admission Date: [**2137-3-7**] D...
2,45,68109,Admission Date: [**2189-9-7**] D...
3,63,80030,Admission Date: [**2119-6-7**] D...
4,113,89633,Admission Date: [**2168-5-6**] D...


#### Option 2: Uncomment (command+/) if you already have mimiciii locally set up as a SQL database

In [4]:
# sql = """
# SELECT SETSEED(0.5);
# SELECT *, RANDOM() as random_id 
# FROM (        
#     SELECT row_id, subject_id, text 
#     FROM mimiciii.noteevents 
#     WHERE text LIKE '%cirrhosis%' 
#     ORDER BY row_id, subject_id 
#     LIMIT 1000
# ) A;
# """

In [5]:
# # Data access - Uncomment this block of notes you have set up mimiciii with MySQL
# import pymysql
# # Point to your own database
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = pymysql.connect(**params)

# # Now load the data. In general you'd load the whole set of notes but that would take
# # several minutes so for this example we're just going to use a subset
# notes = pd.read_sql_query(sql, conn)

In [6]:
# # Data access - Uncomment this block of notes if you have set up mimiciii with Postgres 
# import psycopg2
# # Point to your own database
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = psycopg2.connect(**params)

# # Now load the data
# notes = pd.read_sql(sql, conn)

## Start NLP Exercises

In [7]:
# Here is the list of terms we are going to consider "good"
terms = ['NASH', 'nonalcoholic steathohepatitis']

In [8]:
# Now scan through all of the notes. Do any of the terms appear? If so stash the note 
# id for future use

matches = []

for index, row in notes.iterrows():
    if any(x in row['text'] for x in terms):
        matches.append(row['row_id'])

print("Found " + str(len(matches)) + " matching notes.")

Found 63 matching notes.


In [9]:
# Display a random note that matches. You can rerun this cell to get another note.
# The fancy stuff is just highlighting the match to make it easier to find.

display_id = random.choice(matches)
text = notes[notes['row_id'] == display_id].iloc[0]['text']
for term in terms:
    text = text.replace(term, "<font color=\"red\">" + term + "</font>")
display(HTML("<pre>" + text + "</pre>"))