# Module 9 Week 2
## Overview
In the next three notebooks, we'll be developing an NLP system to extract mentions of pneumonia from clinical text. To do this, we'll use an open-source package called [pyConText](https://github.com/chapmanbe/pyConTextNLP/tree/master/pyConTextNLP) which leverages regular expressions and NetworkX graphs to identify concepts and their contexts in text.

First, we'll get familiar with the dataset and task by looking at a **gold standard** of human-annotated documents which we'll compare to our NLP system.

## IMPORTANT NOTE
pyConText **does not support** Python version 2.7. A number of the computers at UVU seem to use Python 2.7 as the default. You should instead use at least Python 3.5 or higher (I use 3.7). 

To check your Python version, either run this cell below or copy and paste this command into Anaconda Prompt. If you are running Python 2, follow the instructions on Canvas to create a conda environment with Python 3.

In [None]:
!python -V

In [None]:
import os, glob
import pandas as pd
import re

In [None]:
from IPython.display import display, HTML
import sklearn.metrics

# packages for interaction
from IPython.html.widgets import interact, interactive, fixed
from IPython.display import display, HTML, Image
import ipywidgets

# and also our utilities for this class
from nlp_pneumonia_utils import *

# NLP Annotation
When designing an NLP system, we need examples to compare our system with a human's judgment. This allows us to see examples where our system makes mistakes and to measure metrics such as **accuracy**, **precision**, and **recall**.

One way to gather this information is by **annotating** clinical text. In an annotation study, human experts will read through a small number of clinical documents and manually extract the information of interest. These annotations then become part of a **reference standard** which we use to evaluate our system.

# Pneumonia Dataset
Today, we'll be working with an annotated dataset of MIMIC-II radiology reports. Our training set will consist of 100 documents which were reviewed and marked for:
- **Mention-level evidence**: Individual phrases or sentences which the annotators considered evidence of pneumonia
- **Document-level classification**: Whether or not the document indicates the patient has pneumonia

We'll start by looking through the annotated dataset to get a sense of what our task is.

In [None]:
# Read in the data from our training oflder
annotated_doc_map = read_doc_annotations('pneumonia_data/training_v2')
annotated_docs = list(annotated_doc_map.values())

print('Total Annotated Documents : {0}'.format(len(annotated_docs)))

total_positives = 0
for anno_doc in annotated_docs:
    if anno_doc.positive_label:
        total_positives += 1
    
print('Total Positive Pneumonia Documents : {0}'.format(total_positives))

In [None]:
df = annotated_doc_map_to_df(annotated_doc_map)

In [None]:
df.head()

In [None]:
df['annotation_level'].value_counts()

In [None]:
df['type'].value_counts()

Let's take a look at what this dataset looks like. We can scroll through one document at a time and view a marked-up version of our document, plus look at the structured annotations.

Take a few minutes to scroll through the documents. Positive mention-level annotations of pneumonia will be highlighted red within the text.

**Discussion**
- What phrases/words seem to mean "pneumonia"?
- Are there any documents which have the word "pneumonia" but aren't highlighted?

In [None]:
# This function let's us iterate through all documents and view the markup
def view_annotation_markup(anno_docs):
    @interact(i=ipywidgets.IntSlider(min=0, max=len(anno_docs)-1))
    def _view_markup(i):
        report_html = pneumonia_annotation_html_markup(anno_docs[i])
        report_html = report_html.replace('\n', '<br>')
        display(HTML(report_html))

In [None]:
index = 2
sub_df = df[df['document_idx'] == index]
sub_df

In [None]:
view_annotation_markup(annotated_docs)