# Part C - Sentence Tokenization and Negation Detection

A term can be mentioned in different contexts in a sentence or paragraph. For example, it could be affirmative, negated, hypothetical, probable (hedged), or related to another irrelevant subject. 

We will use "pneumothorax" in CXR report as an example here. After spotting CXR reports that mention pneumothorax, we will first tokenize the sentences in the report with NLTK \cite{} then try determine if pneumothorax was affirmed and negated with Negex \cite{}.

In [1]:
# Basic required files
import pandas as pd
import numpy as np
import random
import nltk

## Accessing notes data

#### Option 1: Copy, paste and run the following SQL command in Query Builder and rename the downloaded file as "part_c.csv". Make sure the file is in the same directory as this notebook.

SELECT row_id, subject_id, hadm_id, description, text
FROM NOTEEVENTS 
WHERE description IN (
'P CHEST (PORTABLE AP) PORT',  'P CHEST PORT. LINE PLACEMENT PORT', 'TRAUMA #3 (PORT CHEST ONLY)', 'OP CHEST (SINGLE VIEW) IN O.R. PORT', 'P CHEST (PRE-OP AP ONLY) PORT',
'CHEST PORT. LINE PLACEMENT', 'CHEST PORTABLE LINE PLACMENT', 'P CHEST (SINGLE VIEW) PORT',
'CHEST AP ONLY', 'O CHEST SGL VIEW/LINE PLACEMENT IN O.R.', 'CHEST (PORTABLE AP)',
'PO CHEST (SINGLE VIEW) PORT IN O.R.', 'O CHEST (PORTABLE AP) IN O.R.', 'CHEST (PRE-OP AP ONLY)',
'CHEST (SINGLE VIEW)', 'P CHEST SGL VIEW/LINE PLACEMENT PORT')
LIMIT 100;


In [2]:
# Then import the data into the notebook with the following code
with open('part_c.csv') as cxr_reports:
    df_cxr = pd.read_csv(cxr_reports)

#### Option 2: Uncomment (command+/) if you already have mimiciii locally set up as a SQL database

In [3]:
# Query CXR reports from MIMICIII database
# cxr_query ="""SELECT row_id, subject_id, hadm_id, description, text
# FROM mimiciii.NOTEEVENTS 
# WHERE description IN (
# 'P CHEST (PORTABLE AP) PORT', 
# 'P CHEST PORT. LINE PLACEMENT PORT',
# 'TRAUMA #3 (PORT CHEST ONLY)',
# 'OP CHEST (SINGLE VIEW) IN O.R. PORT',
# 'P CHEST (PRE-OP AP ONLY) PORT',
# 'CHEST PORT. LINE PLACEMENT',
# 'CHEST PORTABLE LINE PLACMENT',
# 'P CHEST (SINGLE VIEW) PORT',
# 'CHEST AP ONLY',
# 'O CHEST SGL VIEW/LINE PLACEMENT IN O.R.',
# 'CHEST (PORTABLE AP)',
# 'PO CHEST (SINGLE VIEW) PORT IN O.R.',
# 'O CHEST (PORTABLE AP) IN O.R.',
# 'CHEST (PRE-OP AP ONLY)',
# 'CHEST (SINGLE VIEW)',
# 'P CHEST SGL VIEW/LINE PLACEMENT PORT')
# LIMIT 100;"""

In [4]:
# # Data access - Uncomment this block of notes you have set up mimiciii with MySQL
# import pymysql
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = pymysql.connect(**params)
# #conn = pymysql.connect(db='mimiciii', user='popcorn', password='Butter!', host='localhost')

# # Now load the data.
# df_cxr = pd.read_sql_query(cxr_query, conn)

In [5]:
# # Data access - Uncomment this block of notes if you have set up mimiciii with Postgres 
# import psycopg2
# params = {'database': 'mimic', 'user': 'XXXXX', 'password': 'YYYYY', 'host': 'localhost'}
# conn = psycopg2.connect(**params)

# # Now load the data
# df_cxr = pd.read_sql(cxr_query, conn)

## Start NLP Exercises

In [6]:
print(df_cxr.shape)
df_cxr.head()

(100, 5)


Unnamed: 0,row_id,subject_id,hadm_id,description,text
0,739392,65535,178280.0,CHEST (PORTABLE AP),[**2193-5-26**] 4:25 PM\n CHEST (PORTABLE AP) ...
1,739434,16913,141587.0,CHEST (PORTABLE AP),[**2175-7-6**] 6:21 PM\n CHEST (PORTABLE AP) ...
2,739875,24248,109789.0,CHEST (PORTABLE AP),[**2158-5-31**] 3:15 PM\n CHEST (PORTABLE AP) ...
3,740139,1287,,CHEST (PORTABLE AP),[**2138-7-8**] 9:27 AM\n CHEST (PORTABLE AP) ...
4,740182,2686,195175.0,CHEST (PORTABLE AP),[**2179-8-4**] 7:28 PM\n CHEST (PORTABLE AP) ...


In [7]:
len(df_cxr)

100

### Spotting reports that mention pneumothorax:

In [8]:
# Simple spotter: Spot occurrence of a term in a given lexicon anywhere within a text document or sentence:
def spotter(text, lexicon):
    text = text.lower()
    # Spot if a document mentions any of the terms in the lexicon (not worrying about negation detection yet)
    match = [x in text for x in lexicon]
    if any(match) == True:
        mentioned = 1
    else:
        mentioned = 0
    return mentioned

In [9]:
# Lexicon for pneumothorax: this is a list of terms that are synonyms for pneumothorax which we will use to spot the concept in CXR reports:
ptx = ['pneumothorax', 'ptx', 'pneumothoraces', 'pnuemothorax', 'pnumothorax', 'pntx', 'penumothorax', 'pneomothorax', 'pneumonthorax', 'pnemothorax', 'pneumothoraxes', 'pneumpthorax', 'pneuomthorax', 'pneumothorx', 'pneumothrax', 'pneumothroax', 'pneumothraces', 'pneunothorax', 'enlarging pneumo', 'pneumothoroax', 'pneuothorax']

In [10]:
# Examples using spotter with:
sent1 = 'Large left apical ptx present.'
sent2 = 'Hello world for NLP negation'

In [11]:
# Pnemothorax mentioned in text, spotter return 1 (yes)
spotter(sent1, ptx)

1

In [12]:
# Pneumothorax not mentioned in text, spotter return 0 (no)
spotter(sent2, ptx)

0

In [13]:
# Now loop spotter through all the "reports" and output report ids (row_id) that mention pneumothorax
rowids = []
for i in df_cxr.index:
    text = df_cxr["text"][i]
    rowid = df_cxr["row_id"][i]
    if spotter(text, ptx) == 1:
        rowids.append(rowid)

In [14]:
# Obviously, not all these reports that mention about pneumothorax entail that the patients have the condition
len(rowids)

53

#### Edge cases to think about:
1. Are spaces before and/or after a term important - i.e. could it alter meaning of the spot?
2. Are punctuations before and/or a term going?

#### What could you do to deal with edge cases?
1. Regular expression in spotting
2. Or simply adding some added variations with punctuations or spaces on either end for each term in the lexicon (e.g. " ptx/")

### Negation detection at its simplest:

But it's not enough to just spot word occurrences to determine if a term mentioned in text is affirmative (positive/present) or not (negated).

In [15]:
# e.g. Pneumothorax mentioned in text but negated, a simple spotter would still return 1 (yes)
sent3 = 'Pneumothorax has resolved.'
spotter(sent3, ptx)

1

However, if negation related words occur in close proximity (e.g. in the same sentence) to a spotted term, then we can write some rules to determine if the concept was negated or not

In [16]:
# e.g. Simply spotting negation words in the same sentence:
neg = ['no','never','not','removed', 'ruled out', 'resolved']
spotter(sent3, neg)

1

Luckily, smarter NLP folks have already written some negation libraries to spot negated mentioning of terms for us, which work in more complicated cases! We just have to do some pre-processing of the input text.

### Sentence splitting before running negation detection with Negex:

Instructions for installing NLTK: https://www.nltk.org/install.html

In [17]:
# Lets print a random report from df_cxr
report = df_cxr.text[random.randint(0,100)]
print(report)

[**2193-5-25**] 4:34 PM
 CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 48916**]
 Reason: s/p open heart surgery.Desaturating/agitated
 ______________________________________________________________________________
 [**Hospital 3**] MEDICAL CONDITION:
  66 year old woman with CODE AND PROBABLE STROKE, S/P AVR
 REASON FOR THIS EXAMINATION:
  s/p open heart surgery.Desaturating/agitated
 ______________________________________________________________________________
                                 FINAL REPORT
 PORTABLE CHEST AT 4:43PM [**2193-5-25**]

 HISTORY: Cardiac arrest, resent heart surgery.

 Limited examination due to considerable motion shows bilateral pleural
 effusions. A subclavian venous catheter is identified. ETT not currently
 identified. There is no obvious pneumothorax. Changes at the right base as
 before.

 IMPRESSION: Limited examination. Parenchymal changes similar to [**2193-5-24**].




In [18]:
# Simplest: Tokenize the sentences with sent_tokenize from NLTK
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(report.replace('\n','  ')) # removing new line breaks

# Print out list of sentences:
sent_count = 0
for s in sents:
    print("Sentence " + str(sent_count) +":")
    print(s)
    print()
    sent_count = sent_count + 1

Sentence 0:
[**2193-5-25**] 4:34 PM   CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 48916**]   Reason: s/p open heart surgery.Desaturating/agitated   ______________________________________________________________________________   [**Hospital 3**] MEDICAL CONDITION:    66 year old woman with CODE AND PROBABLE STROKE, S/P AVR   REASON FOR THIS EXAMINATION:    s/p open heart surgery.Desaturating/agitated   ______________________________________________________________________________                                   FINAL REPORT   PORTABLE CHEST AT 4:43PM [**2193-5-25**]     HISTORY: Cardiac arrest, resent heart surgery.

Sentence 1:
Limited examination due to considerable motion shows bilateral pleural   effusions.

Sentence 2:
A subclavian venous catheter is identified.

Sentence 3:
ETT not currently   identified.

Sentence 4:
There is no obvious pneumothorax.

Sentence 5:
Changes at the right base as   before.

Sentence 6:
IMPRESSIO

In [19]:
# Alternatively, tokenize with PunktSentenceTokenizer from NLTK 
# This may be useful if, for some reason, you want to keep track of character offsets of sentences

from nltk.tokenize.punkt import PunktSentenceTokenizer
sent_count = 0
for s_start, s_finish in PunktSentenceTokenizer().span_tokenize(report):
    print("Sentence " + str(sent_count) +": " + str([s_start, s_finish]))
    print(report[s_start:s_finish].replace('\n','  ')) #important not to accidentally alter the character offsets
    print()
    sent_count = sent_count + 1

Sentence 0: [0, 655]
[**2193-5-25**] 4:34 PM   CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 48916**]   Reason: s/p open heart surgery.Desaturating/agitated   ______________________________________________________________________________   [**Hospital 3**] MEDICAL CONDITION:    66 year old woman with CODE AND PROBABLE STROKE, S/P AVR   REASON FOR THIS EXAMINATION:    s/p open heart surgery.Desaturating/agitated   ______________________________________________________________________________                                   FINAL REPORT   PORTABLE CHEST AT 4:43PM [**2193-5-25**]     HISTORY: Cardiac arrest, resent heart surgery.

Sentence 1: [658, 740]
Limited examination due to considerable motion shows bilateral pleural   effusions.

Sentence 2: [741, 784]
A subclavian venous catheter is identified.

Sentence 3: [785, 815]
ETT not currently   identified.

Sentence 4: [816, 849]
There is no obvious pneumothorax.

Sentence 5: [850, 88

### Using an open-source Python library for negation - Negex:

Download negex.python from: https://github.com/mongoose54/negex/tree/master/negex.python

You just need to save the "negex.py" and "negex_triggers.txt" file in your working directory for the notebook to run the negex examples below.

In [20]:
import negex
# Read the trigger negation rule file that comes with negex
rfile = open(r'negex_triggers.txt')
irules = negex.sortRules(rfile.readlines())
rfile.close()

In [21]:
# Simple general example:
sent = "There is no evidence of ptx."
ptx = ['pneumothorax', 'ptx', 'pneumothoraces', 'pnuemothorax', 'pnumothorax', 'pntx', 'penumothorax', 'pneomothorax', 'pneumonthorax', 'pnemothorax', 'pneumothoraxes', 'pneumpthorax', 'pneuomthorax', 'pneumothorx', 'pneumothrax', 'pneumothroax', 'pneumothraces', 'pneunothorax', 'enlarging pneumo', 'pneumothoroax', 'pneuothorax']
tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
negation = tagger.getNegationFlag()
negation

'negated'

In [22]:
# Subset from df_cxr notes that mention pneumothorax:
df_ptx = df_cxr.loc[df_cxr['row_id'].isin(rowids)].copy()
len(df_ptx)

53

In [23]:
# Apply negex to the n-th note in the ptx dataset:
note = df_ptx.text[0]

In [24]:
# Show the relevant CXR report for the analysis
print(note)

[**2193-5-26**] 4:25 PM
 CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 28825**]
 Reason: s/p placement of right chest tube for pleural effusion
 ______________________________________________________________________________
 [**Hospital 2**] MEDICAL CONDITION:
  66 year old woman with CODE AND PROBABLE STROKE, S/P AVR
 REASON FOR THIS EXAMINATION:
  s/p placement of right chest tube for pleural effusion
 ______________________________________________________________________________
                                 FINAL REPORT
 INDICATION:  66 year old woman status post code with probable stroke.  Status
 post placement of right chest tube for pleural effusion.

 Single AP portable upright view of the chest performed at 16:44 hours on [**5-26**].
 Comparison is made with prior portable AP view of the chest performed [**5-25**] at
 16:43 hours.

 SINGLE VIEW OF THE CHEST:  Clips and wires are again noted, consistent with
 prior median 

In [25]:
# Tokenize the sentences:
sents = sent_tokenize(note.replace('\n','  ')) # replacing new line breaks
# Applying spotter function to each sentence:
neg_output = []
count = 0
for sent in sents:
    # Apply Negex if a term in the ptx lexicon is spotted
    if spotter(sent,ptx) == 1:
        tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
        negation = tagger.getNegationFlag()
        neg_output.append(negation)
        print("Sentence " + str(count) + ":\n" + sent + "\nNegex output: " + negation + '\n')
        count = count + 1

Sentence 0:
There is no pneumothorax.
Negex output: negated



In [26]:
# If there are multiple sentences in the note that mention pneumothorax, we would be storing individual sentence's output in a list
neg_output

['negated']

In [27]:
# Example: Now loop through the first 1000 notes in df_ptx (otherwise it would take a while to run on all 105233 reports)
results_ptx = df_ptx[:100].copy()
for i in results_ptx.index:
    note = results_ptx.text[i]
    sents = sent_tokenize(note.replace('\n','  '))
    neg_output = []
    rel_sents = []
    # If a sentence mentions pneumothorax
    for sent in sents:    
        if spotter(sent,ptx) == 1:
            tagger = negex.negTagger(sentence = sent, phrases = ptx, rules = irules, negP=False)
            negation = tagger.getNegationFlag()
            neg_output.append(negation)
            rel_sents.append(sent)
            print("Sentence: " + sent + "|" + "Negex output: " + negation + '\n')
    # Add a column in the df_ptx dataframe to "structure" the extracted ptx data
    results_ptx.loc[i, 'ptx_prediction' ] = '|'.join(neg_output)
    # Add a column in the df_ptx dataframe to store the relevant sentences that mentioned ptx
    results_ptx.loc[i, 'ptx_sentences'] ='|'.join(rel_sents)

Sentence: There is no pneumothorax.|Negex output: negated

Sentence: No pneumothorax is present.|Negex output: negated

Sentence: There is no pneumothorax.|Negex output: negated

Sentence: No gross pneumothorax on the supine view.|Negex output: negated

Sentence: [**2115-6-11**] 4:18 AM   CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 92385**]   Reason: [**Name (NI) 1734**] FOR PTX, LINE PLACEMENT S/P L IJ IN OR   ______________________________________________________________________________   [**Hospital 2**] MEDICAL CONDITION:    68 year old man with ESRD due to DM, CVA, S/P KIDNEY TX AND L IJ PLACEMENT.|Negex output: affirmed

Sentence: REASON FOR THIS EXAMINATION:    [**Name (NI) 1734**] FOR PTX, LINE PLACEMENT S/P L IJ IN OR   ______________________________________________________________________________                                   FINAL REPORT   INDICATION:  Line placement in the O.R.|Negex output: affirmed

Sentence: No pn

Sentence: No pneumothorax is seen.|Negex output: negated

Sentence: No pneumothorax is visualized .|Negex output: negated

Sentence: No   pneumothorax is seen.|Negex output: negated

Sentence: Please assess chest tube position, PTX   ______________________________________________________________________________   [**Hospital 2**] MEDICAL CONDITION:    63 year old man with s/p LUL and chest wall resection for left sup sulcus    tumor.|Negex output: affirmed

Sentence: Please assess chest tube position, PTX   ______________________________________________________________________________                                   FINAL REPORT   INDICATION: S/P left upper lobe lobectomy for superior sulcus tumor.|Negex output: affirmed

Sentence: No definite   pneumothorax is demonstrated, however, the left CP angle is not well   visualized so pneumothorax cannot be entirely excluded.|Negex output: negated

Sentence: 2) No definite pneumothorax   demonstrated, however, pneumothorax cannot be exclud

In [28]:
results_ptx.head()

Unnamed: 0,row_id,subject_id,hadm_id,description,text,ptx_prediction,ptx_sentences
0,739392,65535,178280.0,CHEST (PORTABLE AP),[**2193-5-26**] 4:25 PM\n CHEST (PORTABLE AP) ...,negated,There is no pneumothorax.
1,739434,16913,141587.0,CHEST (PORTABLE AP),[**2175-7-6**] 6:21 PM\n CHEST (PORTABLE AP) ...,negated|negated,No pneumothorax is present.|There is no pneumo...
8,739960,13101,123718.0,CHEST (PORTABLE AP),[**2123-8-10**] 8:58 PM\n CHEST (PORTABLE AP) ...,negated,No gross pneumothorax on the supine view.
11,739334,17807,122482.0,CHEST (PORTABLE AP),[**2115-6-11**] 4:18 AM\n CHEST (PORTABLE AP) ...,affirmed|affirmed|negated|negated,[**2115-6-11**] 4:18 AM CHEST (PORTABLE AP) ...
12,739336,82202,161103.0,CHEST (PORTABLE AP),[**2102-5-23**] 11:22 AM\n CHEST (PORTABLE AP)...,negated,"There is no evidence of consolidations, pleura..."


In [29]:
# Export your structured results!
# tab delimited
results_ptx.to_csv("ptx_results.txt", sep = '\t', encoding='utf-8', index=False)
# as csv:
#df_ptx.to_csv("ptx_results.csv", index=False)

#### Comments:
You can see that Negex is not perfect at its sentence level prediction. Here, it is not picking up hypothetical mentions of pneumothorax - it thinks "r/o ptx" is affirmed. However, from the whole report level, later sentences might give a more correct negated prediction.

#### Exercise for you:
So what could you do to further improve the output?

### Summary points:

1. The same medical concept can be described in many different equivalent ways in unstructured texts. A more robust vocabulary that recognize a concept of interest in many forms would help you spot the concept with higher sensitivity.
2. After spotting an occurrence of a term of interest in unstructured text, it may be important to interpret its context next.
3. Negation detection is one type of NLP context interpretation. There are many others and the importance of each depends on your task.
4. Negation detection at its simplest may be the detection of a negation-related term (e.g. no) in the same sentence. More complex NLP libraries, such as Negex and sPacy, can help you do a better job in more complicated cases.
5. At a whole document level, a term or concept may be mentioned in multiple sentences in different context. It is up to experts (you) to determine how to put together all the information to give the best prediction for the patient.