# Pneumothorax example

## Sentence tokenization, and spotting term + negation

This example spots meantions of the "pneumothorax" lexicon in CXR reports and looks at whether the spotted pneumothorax mentioned was negated or not. 

*Joy Wu* <joy.wu@ibm.com>*, *Daniel Gruhl <dgruhl@us.ibm.com>*

In [92]:
# Required files
import requests
from requests.auth import HTTPBasicAuth
import json
import matplotlib.pyplot as plt
import subprocess
import tempfile
import os
import sys
import pandas as pd
import numpy as np
import re
import random
from nltk.tokenize import sent_tokenize

### Sentence splitting:

In [93]:
# Read the sample CXR reports into a pandas dataframe, and print out a random report
CXRreports = pd.read_csv('mimic3_1000cxrReports.csv')
CXRreports.head()

Unnamed: 0,subject_id,hadm_id,row_id,text
0,6451,183196.0,750185,[**2164-12-6**] 8:26 PM\n CHEST (PORTABLE AP) ...
1,23781,195460.0,745781,[**2165-9-28**] 2:50 AM\n CHEST (PORTABLE AP) ...
2,24552,,738661,[**2153-1-12**] 10:03 PM\n CHEST (PORTABLE AP)...
3,10118,146001.0,745821,[**2194-10-11**] 4:02 PM\n CHEST (PORTABLE AP)...
4,13101,123718.0,740116,[**2123-8-22**] 9:46 AM\n CHEST (PORTABLE AP) ...


In [94]:
report = CXRreports.text[random.randint(0,1000)]
print(report)

[**2194-1-23**] 4:13 PM
 CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 94843**]
 Reason: 29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check
 ______________________________________________________________________________
 [**Hospital 3**] MEDICAL CONDITION:
  29 year old man with
 REASON FOR THIS EXAMINATION:
  29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check for line placement
  thank you
 ______________________________________________________________________________
                                 FINAL REPORT
 PORTABLE CHEST: Compared to previous study of earlier the same date.

 INDICATION: Central line placement repositioning.

 A right internal jugular central venous catheter is present and terminates in
 the distal superior vena cava. An ETT is in satisfactory position and an NG
 tube is coiled in the stomach.

 Cardiac and mediastinal contours are stable. Bilateral asymmetric alveolar
 pattern affecting the

In [95]:
# Tokenize the sentences with sent_tokenize from NLTK
sents = sent_tokenize(report.replace('\n',' ')) # removing new line breaks
# Print out list of sentences:
sent_count = 0
for s in sents:
    print("Sentence " + str(sent_count) +":")
    print(s)
    print()
    sent_count = sent_count + 1

Sentence 0:
[**2194-1-23**] 4:13 PM  CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 94843**]  Reason: 29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check  ______________________________________________________________________________  [**Hospital 3**] MEDICAL CONDITION:   29 year old man with  REASON FOR THIS EXAMINATION:   29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check for line placement   thank you  ______________________________________________________________________________                                  FINAL REPORT  PORTABLE CHEST: Compared to previous study of earlier the same date.

Sentence 1:
INDICATION: Central line placement repositioning.

Sentence 2:
A right internal jugular central venous catheter is present and terminates in  the distal superior vena cava.

Sentence 3:
An ETT is in satisfactory position and an NG  tube is coiled in the stomach.

Sentence 4:
Cardiac and mediastinal contours are st

In [96]:
from nltk.tokenize.punkt import PunktSentenceTokenizer
# Alternatively, tokenize with PunktSentenceTokenizer from NLTK if you want to keep track of character offsets of sentences
sent_count = 0
for s_start, s_finish in PunktSentenceTokenizer().span_tokenize(report):
    print("Sentence " + str(sent_count) +": " + str([s_start, s_finish]))
    print(report[s_start:s_finish].replace('\n',' '))
    print()
    sent_count = sent_count + 1

Sentence 0: [0, 659]
[**2194-1-23**] 4:13 PM  CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 94843**]  Reason: 29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check  ______________________________________________________________________________  [**Hospital 3**] MEDICAL CONDITION:   29 year old man with  REASON FOR THIS EXAMINATION:   29 yo wiht recent r. ij plecement-pulled rij out 2 cm-check for line placement   thank you  ______________________________________________________________________________                                  FINAL REPORT  PORTABLE CHEST: Compared to previous study of earlier the same date.

Sentence 1: [662, 711]
INDICATION: Central line placement repositioning.

Sentence 2: [714, 823]
A right internal jugular central venous catheter is present and terminates in  the distal superior vena cava.

Sentence 3: [824, 900]
An ETT is in satisfactory position and an NG  tube is coiled in the stomach.

Sentence 

### Spot occurrence(s) of word(s) related to your concept in a sentence or document

In [97]:
# Simple spotter: Spot occurrence of a term in a given lexicon anywhere within a text document or sentence:
def spotter(text, lexicon):
    text = text.lower()
    # Spot if a document mentions any of the terms in the lexicon (not worrying about negation detection yet)
    match = [x in text for x in lexicon]
    if any(match) == True:
        mentioned = 1
    else:
        mentioned = 0
    return mentioned

In [98]:
# Where the lexicon is a list of word(s) or phrase(s) refering to a concept of interest to you, e.g.
ptx = ['pneumothorax', 'ptx']
sent1 = 'Large left apical ptx present.'
sent2 = 'Hello world for NLP'


In [99]:
# lexicon mentioned in text, spotter return 1 (yes)
spotter(sent1, ptx)

1

In [100]:
# lexicon not mentioned in text, spotter return 0 (no)
spotter(sent2, ptx)

0

**How can we do better?**
We can do the spotting of concepts (lexicons) A LOT better (more sensitive) if we curate a list of all the ways that the concept could be expressed in raw text. This is what the NLP tool can help you achieve.

### Download a lexicon from the NLP tool:

In [106]:
import getpass
# Enter your team's username between the quotation marks:
user = "team1"
# Enter your team's password
#password = getpass.getpass()
# If the above doesn't work, then comment out the password variable above and hard code your team's password below:
password = 'sends reforms capture mileage'

In [107]:
# This is the id of the lexicon - you can see it in the URL line when you are working with the lexicon
# For example, for pneumothorax, it is:
oid = ".2.48"
# You can do this in a loop to download all relevant lexicons into a data format you prefer too

In [108]:
# Don't spam with insecure warnings - some machines do not have all signing authority
# root certificates preinstalled
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [109]:
# The endpoints for the REST
host = "https://dla.sl.res.ibm.com"
lexurl = host + "/oid" + oid.replace('.', '/')
quartermaster =  host + "/search"

In [110]:
# Set up auth and get the lexicon. Then pull the terms out and lower case them
auth=(user,password)
lex = requests.get(lexurl, verify=False, auth=auth).json()
terms = list(map(lambda x: x["surfaceForm"].lower(), lex["members"]))

In [111]:
# Printing out the pneumothorax lexicon (after 5 minutes of curating work on the NLP tool)
ptx = terms.copy()
print(ptx)

['pneumothorax', 'ptx', 'pneumothoraces', 'pnuemothorax', 'pnumothorax', 'pntx', 'penumothorax', 'pneomothorax', 'pneumonthorax', 'pnemothorax', 'pneumothoraxes', 'pneumpthorax', 'pneuomthorax', 'pneumothorx', 'pneumothrax', 'pneumothroax', 'pneumothraces', 'pneunothorax', 'enlarging pneumo', 'pneumothoroax', 'pneuothorax']


### Negation detection

In [112]:
# But it's not enough to just spot word occurrences to determine if a concept is affirmative (positive/present) or not.

# e.g. lexicon mentioned in text but negated, a simple spotter would still return 1 (yes)
sent3 = 'Pneumothorax has resolved.'
spotter(sent3, ptx)

1

In [113]:
# However, if negation related words occur in close proximity (e.g. same sentence) to a spotted concept 
# Then we can right some rules to determine if the concept was negated or not

# e.g. spotting negation words in the same sentence:
neg = ['no','never','not','removed', 'ruled out']
spotter(sent2, neg)

0

### Using off-the-shelf python library for negation, e.g. Negex

In [114]:
import negex
rfile = open(r'negex_triggers.txt')
irules = negex.sortRules(rfile.readlines())
rfile.close()

# Example:
sent = "There is no evidence of ptx."
#ptx = ['pneumothorax', 'ptx']
tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
negation = tagger.getNegationFlag()
negation

'negated'

In [121]:
# Applying Negex to the first note:
# Specify the lexicon of interest ("phrases" for Negex)
ptx = terms.copy()
# Get a randome note from the dataset:
note = CXRreports['text'][random.randint(0,1000)]
# Tokenize the sentences:
sents = sent_tokenize(note.replace('\n',' ')) # replacing new line breaks
# Applying spotter function to each sentence:
#neg_output = []
count = 0
for sent in sents:
    # Apply Negex if a term in the ptx lexicon is spotted
    if spotter(sent,ptx) == 1:
        tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
        negation = tagger.getNegationFlag()
        #neg_output.append(negation)
        print("Sentence " + str(count) + ":\n" + sent + "\nNegex output: " + negation + '\n')
        count = count + 1

Sentence 0:
There is no focal  consolidation, pleural effusion, or pneumothorax.
Negex output: negated



In [122]:
# Show the relevant CXR report for the analysis
print(note)

[**2100-7-17**] 7:22 AM
 CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 70021**]
 Reason: pt with acidosis hx chf
 ______________________________________________________________________________
 [**Hospital 2**] MEDICAL CONDITION:
  74 year old man with ERSD, CAD, PVD, DM s/p AKA w/ worsening acidosis, rales
 REASON FOR THIS EXAMINATION:
  pt with acidosis hx chf
 ______________________________________________________________________________
                                 FINAL REPORT
 INDICATION:  Worsening acidosis and rales.

 COMPARISONS:  [**2100-7-16**]

 PORTABLE AP CHEST:  The left internal jugular central venous catheter tip is
 in the left brachiocephalic vein. The cardiomediastinal silhouette is stable.
 There are multiple bilateral calcified pleural plaques. There is no focal
 consolidation, pleural effusion, or pneumothorax.  Allowing for differences in
 technique and positioning, there has been no significant change.

 

### Exercise for you:

You can use similar/improved pipeline to loop through all the notes in your dataset and through different concepts/lexicons!