# Identifying Evidence of Surgical Site Infections in Radiology Reports using Machine Learning
BMI 6106
Final Project by **Alec Chapman** and **Jason McNamee**

In our final project, we will be utilizing Machine Learning to identify evidence of Surgical Site Infections (SSIs) using textual radiology reports.

The project is structured in this way:
1. `Project Introduction` - an overview of the task and the data
2. `Data Exploration` - analyzing the data using descriptive statistics and probablity measurements of vocabulary
3. `Report Classification` - training and evaluating a number of different Machine Learning classifiers
4. `Analysis` - Analyzing the results and testing for statistical significance
5. `Discussion`

## Abstract
*Free-text reports in electronic health records (EHRs) contain medically significant information - signs, symptoms,
findings, diagnoses - recorded by clinicians during patient encounters. These reports contain rich clinical
information which can be leveraged for surveillance of disease and occurrence of adverse events. In order to gain
meaningful knowledge from these text reports to support surveillance ef orts, information must first be converted
into a structured, computable format. Traditional methods rely on manual review of charts, which can be costly and
ineficient. Natural language processing (NLP) methods offer an efficient, alternative approach to extracting the
information and can achieve a similar level of accuracy. We utilized statistical and probabalistic methods to examine properties of radiology reports and determine whether or not text-based Machine Learning (ML) algorithms could effectively classify reports containing evidence of surgical site infections leveraging these mentions. We evaluated our system using a reference standard of reports annotated by domain experts and test for statistical significance.*

# Introduction

## Data Hypothesis
- **Null Hypothesis**: the vocabulary used to describe patients who have fluid collections and who do not have fluid collections is the same.
- **Alternative Hypothesis**: the vocabulary used to describe patients who have fluid collections and who do not have fluid collections is significantly different and can be used to classify reports.

## Task Overview
SSIs are adverse outcomes of surgeries. There are multiple of SSIs and they can result in rehospitalization or death. Many SSIs occur after discharge and so they are difficult to detect. Natural Language Processing (NLP) offers a way to automatically and effectively detect surgical site infections. In this project we will be focusing on Deep/Organ Space SSIs, which are often identified in radiology reports as **collections of fluid.**

In a past project that was presented at AMIA [see *Discussion/References - 10*], a rule-based NLP system to identify fluid collections in radiology reports was developed. A hand-crafted lexicon was used to identify mentions of fluid collection. This lexicon terms such as:

- "fluid collection"
- "hematoma"
- "abscess"
- "biloma"
- "multiloculated fluid"

This lexicon was handcrafted by having clinicians hand-annotate data. Creating such detailed annotations was a costly and difficult process. The purpose of this project is to use a much coarser annotation as labels: a binary decision of whether or not a fluid collection is present. We will then use probablistic methods to identify salient terms that could belong to the lexicon and then train Machine Learning (ML) algorithms to make this distinction between positive and negative reports.

# Methods
- Data
- Data Exploration
- Report Classification - Training ML Algorithms
- Analysis - Testing for significance and comparing results

# Data
In this project, we will use a dataset that consists of 645 deidentifed CT reports (545 training, 100 validation) from MIMIC that have been annotated by two expert clinicians. The annotation study included two goals: 
1. **Mention-level annotation**: specific spans of text were annotated that represented a clinical concept related to fluid collections
2. **Document-level annotation**: an overall judgement about whether or not a fluid collection was present in the document.

For this project we will be focusing exclusively on the second task.

Let's take a look at the data:

In [2]:
import os
import pandas as pd
import sqlite3 as sqlite

DATADIR = '../stats_data'
DB = os.path.join(DATADIR, 'Reference Standard', 'radiology_reports.sqlite')
os.path.exists(DB)

True

In [4]:
conn = sqlite.connect(DB)
df = pd.read_sql("SELECT * FROM training_notes;", conn)

conn.close()

print(df.shape)
df.head()

(545, 8)


Unnamed: 0,rowid,name,text,referenceXML,doc_class,subject,HADM_ID,CHARTDATE
0,1,No_10792_131562_05-29-20,\n CT ABDOMEN W/CONTRAST; CT PELVIS W/CONTRAS...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,32,131562,05-29-20
1,2,No_11050_126785_11-03-33,\n CT CHEST W/CONTRAST; CT ABDOMEN W/CONTRAST...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,34,126785,11-03-33
2,3,No_11879_166554_06-22-37,\n CTA CHEST W&W/O C &RECONS; CT 100CC NON IO...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,06-22-37
3,4,No_11879_166554_06-23-37,\n CT ABDOMEN W/O CONTRAST; CT PELVIS W/O CON...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,06-23-37
4,5,No_11879_166554_07-02-37,\n CT CHEST W/O CONTRAST \n ~ Reason: r/o ste...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,07-02-37


The training set consists of 545 reports. We will only be using two columns:
- ``text``: the original text of the radiology report
- ``doc_class``: whether or not a fluid collection is present. **1** indicates a positive document while **0** indicates a negative.

Let's look at two reports:

First, a negative report. Note the following excerpt in the Impression section:
- "No abnormal intra-abdominal fluid collections identified."

In [8]:
neg_report = df[df.doc_class == 0].text.iloc[0]
print(neg_report)

 
 CT ABDOMEN W/CONTRAST; CT PELVIS W/CONTRAST 
 ~ Reason: assess for undrained collections w/PO and IV contrast
 ~ Admitting Diagnosis: BILE DUCT INJURY
  Contrast: OPTIRAY Amt: 130
 

 ~ UNDERLYING MEDICAL CONDITION:
  83 year old man with hepaticojejunostomy , now w/some bilious output in JP
  drain
 ~ REASON FOR THIS EXAMINATION:
  assess for undrained collections w/PO and IV contrast
 No contraindications for IV contrast
 

 ~ FINAL REPORT
 HISTORY:  83-year-old man status post hepaticojejunostomy, , with
 increased bilious output in the JP drain.  Evaluate for fluid collections.
 Comparison is made to a prior CT examination dated .
 ~ TECHNIQUE:  MDCT-acquired axial images were obtained through the abdomen and
 pelvis with oral and intravenous contrast.  Coronal and sagittal reformations
 were evaluated.
 ~ CT OF THE ABDOMEN WITH INTRAVENOUS CONTRAST:  Limited examination of the lung
 bases displays persistent small bilateral pleural effusions (right greater
 than left) with adja

Now a positive report. Note in the Impression Section:
- "A drainage catheter coursing through 6 cm fluid collection in Morison's pouch"

In [9]:
pos_report = df[df.doc_class == 1].text.iloc[0]
print(pos_report)

 
 CT ABDOMEN W/CONTRAST; CT PELVIS W/CONTRAST 
 ~ Reason: assess for bowel obstruction, r/o collectionplease use po/iv
 ~ Admitting Diagnosis: LIVER TRANSPLANT
 Field of view: 36 Contrast: OPTIRAY Amt: 150
 

 ~ UNDERLYING MEDICAL CONDITION:
 30 year old man pod 11 s/p liver transplant with distended abd, nausea
 ~ REASON FOR THIS EXAMINATION:
  assess for bowel obstruction, r/o collectionplease use po/iv contrast
 No contraindications for IV contrast
 

 ~ FINAL REPORT
 ~ INDICATION:  Postop day eleven status post liver transplant with abdominal
 pain.
 ~ COMPARISON:  .
 ~ TECHNIQUE:  Contrast-enhanced axial CT imaging with multiplanar reformats of
 the abdomen and pelvis was reviewed.
 CT ABDOMEN WITH CONTRAST:  There is a small right pleural effusion and a
 bibasilar atelectasis.  Heart is unremarkable.  There is gynecomastia. The
 liver enhances homogeneously.  The hepatic vasculature including hepatic
 arteries, portal veins, and hepatic veins are patent.  A round 15 mm soft
 tis

# Up Next
[Data Exploration](./Data Exploration.ipynb)