In [1]:
%config InlineBackend.figure_format='retina'

## Abstract

- contains 224,316 chest radiographs of 65,240 patients
- designed a laberl to automatically detect the presence of 14 observations in radiology reports
    - investigate different approaches to using uncertainty labels that output probability of these observations given available frontal and lateral radiographs
- Validation set = 200 chest radiographic studies
    - manually annotated by 3 board-certified radiologists
        - found different uncertainty approachs were useful for different pathologies
- Results, model ROC and PR curves lie above all 3 radiologist operating points

## Introduction

- Automated chest radiograph interpretation at level of practicing radiologists
    - benefit in many medical settings
        - improved workflow prioritization and clinical decision support
        - large-scale screening and global population health initiatives
- Designed labeler that extracted observations from free-text radiology reports
    - captured uncertainties present in reports by using an uncertainty label
- Pay particular attention to uncertainty labels

### Table 1 From CheXpert Paper
![Table 1 From Chexpert Paper](images/chexpert_table_1.png)

## Dataset

### _Data Collection and Label Selection_

- collected chest radiographic studies from Stanford Hospital, performed between October 2002 and July 2017
    - from both inpatient and outpatient centers, along with associated radiology reports
    - from these sampled 1000 reports for manual review by board-certified radiologist
- determined 14 observations (i.e. pathologies) based on prevalence in reports and clinical relevance
    - `Pneumonia` was included as a label to represent images that suggest primary infection as diagnosis
    - `No Finding` observation captured absence of all pathologies
    
### _Label Extraction from Radiology Reports_

- team developed automated rule-based labeler to extract observations from free text radiology reports
    - set up in three distinct stages:
        - __mention extraction__
        - __mention classification__
        - __mention aggregation__
        
#### Mention Extraction

- extracts mentions from list of observation from _Impression_ section of report
    - summarizes key findings in study
    - team also put together manually curated list of phrases to match alternative names for pathologies in reports
    
#### Mention Classification

- after extraction, aim is to classify them as negative, uncertain or positive
- `uncertain` label can catch both uncertainty of radiologist in diagnosis as well as ambiguity inherent in report (__HOW?__)
- Is 3-phase pipeline consisting of:
    - pre-negation uncertainty
    - negation
    - post-negation uncertainty
        - if match is found, mention is classified accordingly 
        - if mention is not matched in any of the phases, it is classified as positive
- Rules for mention classification designed on universal dependency parse of report
    - first, split and tokenize sentences using `NLTK`
    - then, sentences parsed using Bllip parser trained using __David McClosky's__ biomedical model [see here](https://nlp.stanford.edu/~mcclosky/papers/dmcc-thesis-2010.pdf)
    - finally, universal dependency graph of each sentence is computed using Stanford CoreNLP [see here](https://nlp.stanford.edu/pubs/USD_LREC14_paper_camera_ready.pdf)
    
#### Mention Aggregation

- use classification for each mention of observations to determine label from 12 pathologies as well as `Support Devices` and `No Finding`
    - observations with at least one mention is assigned a positive (1) label
    - observation assigned uncertain (u) label if no positively classified mentions and at least one uncertain mention
    - observation assigned negative label if there is at least one negatively classified mention
    - assign _blank_ if there is no mention of an observation
    - `No Finding` observation assigned a positive label (1) if no pathology classified as positive or uncertain
    
# Table 2 From Chexpert Paper
![Table 2](images/chexpert_table_2.png)

## Labeler Results

### Report Evaluation Set
- report evaluation set = 1000 radiology reports from 1000 distinct randomly sampled patients
    - do not overlap with patients whose studies were used to develop the labeler
- two board-certified radiologists (w/o access to additional info) label each observation
    - confidently present (1)
    - confidently absent (0)
    - uncertainly present (u)
    - not mentioned (blank)
- resulting annotation serve as ground truth on the report evaluation set

### Comparison to NIH labeler
- compared labeler against method used in NIH medical image dataset
- Table 2 (see above) shows the performace of the CheXpert labeler vs. NIH labeler
    - across all observations CheXpert labeler achieved higher F1 score
    - The F1 score: weighted average of the precision and recall, with the best value at 1 and worst score at 0. 
        - The relative contribution of precision and recall to the F1 score are equal. 
        - The formula for the F1 score is:
`F1 = 2 * (precision * recall) / (precision + recall)`
-  Three key differences between CheXpert method and NIH method
    - did __not__ use automatic mention extractors like MetaMap or DNorm
    - incorporated several additional rules to capture large variation in ways negation and uncertainty are conveyed
    - split uncertainty classification of mentions into pre-negation and post-negation
        - allowed them to resolve cases of uncertainty rules double matching with negation rules in the reports
        - Example, the following phrase `cannot exclude pneumothorax` conveys uncertainty in the presence of pneumothorax
        - without pre-negation stage, 'pneumothorax match is classified as negative due to 'exclude XXX' rule
        - by applying 'cannot exclude' rule in pre-negation, this observation can be correctly classified as uncertain

## Model

- Trained models that take as input a single-view chest radiograph and output the probability of each of the 14 observations
    - when more than one view is available, the models output the maximum probability of the observations across the views
    
### Uncertainty Approaches
- training labels in the dataset for each observation are 0 (negative), 1 (positive), or _u_ (uncertain)

#### Ignoring (_U-Ignore_)
- simple apprach is to ignore the _u_ labels during training
    - can serve as baseline to compare approaches which incoroporate uncertainty labels
- optimized the sum of the _masked_ binary cross-entropy losses over observations
    - masked the loss for the observations which are marked as uncertain for the study
- Can produce biased models if the cases are not missing completely at random
- In this dataset, uncertainty labels are quite prevalent for some observations
    - Consolidation, for example has uncertainty label ~2x as prevalent as positive label
    - as a result, this approach ignores a large proportion of labels (i.e. reduces effective size of the dataset)
    
#### Binary Mapping
- investigated whether the uncertain labels for any of the observations could be replaced by 0 or 1 label
    - map all instances of _u_ to 0 (_U-Zeroes_ model), or all to 1 (_U-Ones_ model)
- similar to zero imputation strategies in stats
- however if uncertainty label does convey useful info to classifier, then it can distort decision making and degrade performance

#### Self-Training
- another framework is to consider uncertainty labels as unlabeled examples
    - lending its way to semi-supervised learning
    - _multi-label learning with missing labels_ (MLML)
        - aims to handle multi-label classification given training instances that have a partial annotation of their labels
- Investigated self-training approach (_U-SelfTrained_) for using the uncertainty label
    - first trained a model using _U-Ignore_ (ignores _u_ labels during training) to convergence
    - then used the model to make predictions that re-label each of the uncertainty labels with the probability prediction outputted by the model
    - do __not__ replace any instances of 1 or 0s
    - on the relabeled examples, set up loss as the mean of the binary cross-entropy losses over the observations
    
#### 3-Class Classification (_U-MultiClass_)
- this approach treats _u_ label as its own class
    - as opposed to mapping it to a binary label for each of the 14 observations
- hypothesis: can better incorporate information from image by supervising uncertainty
    - allows netowrk to find own representation of uncertainty on different pathologies
- Output the probability of each of the 3 possible classes (equaling 1)
    - set up loss as the mean of the multi-class cross-entropy losses over the observations
    - at test time, output probability of positive label after applying softmax restriced to the positive and negative classes
    
### Training Procedure
- Followed same architecture and training process for each of the uncertainty approaches
- Experimented with the following:
    - `ResNet152`
    - `DenseNet121` (__found to have the best results__)
    - `Inception-v4`
    - `SE-ResNeXt101`
- Images fed into network with:
    - size = 320 x 320 pixels
    - used Adam optimzer with default β-parameters of β1 = 0.9, β2 = 0.999
    - learning rate = 1 x 10e-4
        - was fixed for the duration of the training
    - batches are sampled using a fixed batch size of 16 images
    - trained for 3 epochs, saving checkpoints every 4800 iterations

## Validation Results

### Figure 3 from CheXpert Paper
![Figure 3](images/figure_3.png)

### Validation Set
- 200 studies from 200 patients
    - randomly sampled from full dataset
    - Three board-certified radiologists annotated each of the studies in validation set
        - classified each observation into present, uncertain likely, uncertain unlikely and absent
        - their annotations were binarized such that all present and uncertain likely cases treated as positive & all absent and uncertain unlikely cases treated as negative
        
### Comparison of Uncertainty Approaches

#### Procedure
- evaluate approaches using area under the reciever operating characteristic curve (AUC) metric
- focus on the evaluation of 5 observations which we call __competition tasks__
    - based on clinical importance and prevalence in validation set:
        - Atelectasis
        - Cardiomegaly
        - Consolidation
        - Edema
        - Pleural Effusion
        
#### Model Selection
- for each of uncertainty approaches, chose best 10 checkpoints per run using avg. AUC across competition tasks
    - run each model 3x, take ensemble of 30 generated checkpoints on validation set
        - computed the mean of the output probabilities over the 30 models
        
#### Results
- On Atelectasis, _U-Ones model_ (AUC=0.858) significantly outperformed _U-Zeroes model_ (AUC=0.811)
- On Cardiomegaly, _U-MultiClass model_ (AUC=0.854) performed significantly better than _U-Ignore model_ (AUC=0.828)
- For Consolidation, Edema and Pleural Effusion, did not find the best models to be significantly better than the worst

#### Analysis
- Ignoring uncertainty label is not effective approach to handling uncertainty
    - particularly ineffective on Cardiomegaly
        - most of uncertain Cardiomegaly cases are borderline (i.e. "minimal cardiac enlargement")
        - if ignored can cause model to perform poorly on cases that are hard to distinguish
    - _U-MultiClass_ approach could enable model to better disambiguate boderline cases
- Detection of Atelectasis & Edema, _U-Ones_ approach has high performance
    - hints that uncertainty label for this observation effectively utilized when treated as positive
- For Consolidation lebel, _U-Zeros_ approach performed the best
    - noted that Atelectasis and Consolidation often mentioned together in radiology reports
    - example: 'findings may represent atelectasis vs. consolidation' is very common
        - for this, the labeler assigned uncertain for both observations
        - found from ground truth panel, many of these cases resolved as Atelectasis-positive and Consolidation-negative

## Test Results

- selected final model based on best performing ensemble on each competition task on the validation set
    - _U-Ones_ for Atelectasis and Edema
    - _U-MultiClass_ for Cardiomegaly and Pleural Effusion
    - _U-SelfTrained_ for Consolidation
    
### Test Set
- consisted of 500 studies from 500 patients randomly sampled from 1000 studies in report test set
- individually annotated by eight board-certified radiologists
    - majority vote of 5 radiologist annotations serves as strong ground truth
    
### Comparison to Radiologists

#### Procedure
- computed sensitivity (recall), specificity, and precision against test set ground truth
- to compare model to radiologists, plotted radiologist operating points with model on both the ROC and Precision-Recall (PR) space
    - examined whether points lie below curves to determine if model is superior
    
#### Results
- best AUC on Pleural Effusion (0.97) and worst on Atelectasis (0.85)
    - all other AUC's were > 0.9
- On Cardiomegaly, Edema, and PLeural Effusion, model achieves higher performance than all 3 radiologists but not their majority vote
- On Consolidation, model performance exceeds 2 of the 3 radiologists
- Atelectasis, all 3 radiologists perform better than the model

#### Limitations
- First, neither the radiologists nor the model had access to patient history or previous exmainations
    - has been shown to decrease diagnostic performance in chest radiograph interpretation
- Second, no statistical test was performed to assess whether difference between the performance of the model and radiologists was statistically significant

### Visualization
- visualized areas of radiograph which model predicts to be most indicative of each observation
- uses Gradient-weighted Class Activation Mappings (Grad-CAMs)
    - utilize gradient of an output class into the final convolutional layer to produce low resolution map highlighting portions of image important to detection of output class
- Constructed the map using the gradient of the final linear layer as the weights 
    - then performed a weighted sum of the final feature maps using those weights
    - upscaled resulting map to the dimensions of the original image
    - overlayed map on the images

## Existing Chest Radiograph Datasets

- Indiana Network for Patient Care hosts the OpenI dataset
    - consists of 7,470 front-view radiographs and radiology reports that have been alebeled with key findings by human annotators
- National Cancer Institute hosts the PLCO Lung dataset
    - contains ~185k full resolution images
    - due to nature of collection process though, has low prevalence of clinically important pathologies
        - such as Pneumothorax, Consolidation, Effusion, and Cardiomegaly
- MIMIC-CXR dataset from MIT (recently announced)
- ChextX-ray14 dataset from NIH
    - using this as benchmark is problematic, labels in test set extracted from reports using an automatic labeler