<a href="https://colab.research.google.com/github/christinium/AIMed_Workshop_2018/blob/master/MIT%20Tutorial%20-%20Part%20C%20-%20Pneumothorax.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pneumothorax example

## Sentence tokenization, and spotting term + negation

This example spots meantions of the "pneumothorax" lexicon in CXR reports and looks at whether the spotted pneumothorax mentioned was negated or not. 

*Joy Wu* <joy.wu@ibm.com>*, *Daniel Gruhl <dgruhl@us.ibm.com>*

In [0]:
# Required files
import requests
from requests.auth import HTTPBasicAuth
import json
import matplotlib.pyplot as plt
import subprocess
import tempfile
import os
import sys
import pandas as pd
import numpy as np
import re
import random
from nltk.tokenize import sent_tokenize



from google.colab import auth
from google.cloud import bigquery
from google.colab import files

**Authenticate:** The line of code below ensures you are an authenticated user accessing the MIMIC database. You will need to rerun this each time you open the notebook.

In [0]:
auth.authenticate_user() #This will allow you to authenticate access to BigQuery

**Query Function: **This is a method that executes a desired SQL query on the database. If you want to run a query, you can use the function name below, which we named run_query()


In [0]:
project_id='new-zealand-2018-datathon'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
# Read data from BigQuery into pandas dataframes.
def run_query(query):
  return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})

### Sentence splitting:

In [0]:
# Read the sample CXR reports into a pandas dataframe, and print out a random report
#CXRreports = pd.read_csv('mimic3_1000cxrReports.csv')
#CXRreports.head()


CXRreports = run_query('''
SELECT * 
FROM `hst-953-2018.NLP_workshop.cxr`
''')
CXRreports.head()


In [0]:
#This prints a random report
report = CXRreports.text[random.randint(0,1000)]
print(report)

In [0]:
  #This imports nltk and punkt into our environment
  >>> import nltk
  >>> nltk.download('punkt')

In [0]:
# Tokenize the sentences with sent_tokenize from NLTK
from nltk.tokenize.punkt import PunktSentenceTokenizer
# Alternatively, tokenize with PunktSentenceTokenizer from NLTK if you want to keep track of character offsets of sentences
sents = sent_tokenize(report.replace('\n',' ')) # removing new line breaks
# Print out list of sentences:
sent_count = 0
for s in sents:
    print("Sentence " + str(sent_count) +":")
    print(s)
    print()
    sent_count = sent_count + 1

In [0]:
sent_count = 0
for s_start, s_finish in PunktSentenceTokenizer().span_tokenize(report):
    print("Sentence " + str(sent_count) +": " + str([s_start, s_finish]))
    print(report[s_start:s_finish].replace('\n',' '))
    print()
    sent_count = sent_count + 1

### Spot occurrence(s) of word(s) related to your concept in a sentence or document

In [0]:
# Simple spotter: Spot occurrence of a term in a given lexicon anywhere within a text document or sentence:
def spotter(text, lexicon):
    text = text.lower()
    # Spot if a document mentions any of the terms in the lexicon (not worrying about negation detection yet)
    match = [x in text for x in lexicon]
    if any(match) == True:
        mentioned = 1
    else:
        mentioned = 0
    return mentioned

In [0]:
# Where the lexicon is a list of word(s) or phrase(s) refering to a concept of interest to you, e.g.
ptx = ['pneumothorax', 'ptx']
sent1 = 'Large left apical ptx present.'
sent2 = 'Hello world for NLP'


In [0]:
# lexicon mentioned in text, spotter return 1 (yes)
spotter(sent1, ptx)

In [0]:
# lexicon not mentioned in text, spotter return 0 (no)
spotter(sent2, ptx)

**How can we do better?**
We can do the spotting of concepts (lexicons) A LOT better (more sensitive) if we curate a list of all the ways that the concept could be expressed in raw text. This is what the NLP tool can help you achieve.

### Download a lexicon from the NLP tool:

In [0]:
import getpass
# Enter your team's username between the quotation marks:
user = "team?"
# Enter your team's password
#password = getpass.getpass()
# If the above doesn't work, then comment out the password variable above and hard code your team's password below:
password = '(your teams password)'

In [0]:
# This is the id of the lexicon - you can see it in the URL line when you are working with the lexicon
# For example, for pneumothorax, it is:
oid = ".2.48"
# You can do this in a loop to download all relevant lexicons into a data format you prefer too

In [0]:
# Don't spam with insecure warnings - some machines do not have all signing authority
# root certificates preinstalled
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [0]:
# The endpoints for the REST
host = "https://dla.sl.res.ibm.com"
lexurl = host + "/oid" + oid.replace('.', '/')
quartermaster =  host + "/search"

In [0]:
# Set up auth and get the lexicon. Then pull the terms out and lower case them
auth=(user,password)
lex = requests.get(lexurl, verify=False, auth=auth).json()
terms = list(map(lambda x: x["surfaceForm"].lower(), lex["members"]))

In [0]:
# Printing out the pneumothorax lexicon (after 5 minutes of curating work on the NLP tool)
ptx = terms.copy()
print(ptx)

['pneumothorax', 'ptx', 'pneumothoraces', 'pnuemothorax', 'pnumothorax', 'pntx', 'penumothorax', 'pneomothorax', 'pneumonthorax', 'pnemothorax', 'pneumothoraxes', 'pneumpthorax', 'pneuomthorax', 'pneumothorx', 'pneumothrax', 'pneumothroax', 'pneumothraces', 'pneunothorax', 'enlarging pneumo', 'pneumothoroax', 'pneuothorax']


### Negation detection

In [0]:
# But it's not enough to just spot word occurrences to determine if a concept is affirmative (positive/present) or not.

# e.g. lexicon mentioned in text but negated, a simple spotter would still return 1 (yes)
sent3 = 'Pneumothorax has resolved.'
spotter(sent3, ptx)

In [0]:
# However, if negation related words occur in close proximity (e.g. same sentence) to a spotted concept 
# Then we can right some rules to determine if the concept was negated or not

# e.g. spotting negation words in the same sentence:
neg = ['no','never','not','removed', 'ruled out', 'resolved']
spotter(sent3, neg)

### Using off-the-shelf python library for negation, e.g. Negex

In [0]:
#This downloads a copy of negex.py and negex_triggers.txt into this environment, we will learn how to use this in the next block
!wget  https://stuff.mit.edu/~cwc76/hst953/negex.py
!wget  https://stuff.mit.edu/~cwc76/hst953/negex_triggers.txt
  


In [0]:
import negex
rfile = open(r'negex_triggers.txt')
irules = negex.sortRules(rfile.readlines())
rfile.close()

# Example:
sent = "There is no evidence of ptx."
ptx = ['pneumothorax', 'ptx']
tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
negation = tagger.getNegationFlag()
negation

In [0]:
# Applying Negex to the first note:
# Specify the lexicon of interest ("phrases" for Negex)
ptx = terms.copy()
# Get a randome note from the dataset:
note = CXRreports['text'][random.randint(0,1000)]
# Tokenize the sentences:
sents = sent_tokenize(note.replace('\n',' ')) # replacing new line breaks
# Applying spotter function to each sentence:
#neg_output = []
count = 0
for sent in sents:
    # Apply Negex if a term in the ptx lexicon is spotted
    if spotter(sent,ptx) == 1:
        tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
        negation = tagger.getNegationFlag()
        #neg_output.append(negation)
        print("Sentence " + str(count) + ":\n" + sent + "\nNegex output: " + negation + '\n')
        count = count + 1

In [0]:
# Show the relevant CXR report for the analysis
print(note)

### Exercise for you:

You can use similar/improved pipeline to loop through all the notes in your dataset and through different concepts/lexicons!