# Keyword Searching and Error Analysis

A good starting point to determine the complexity of your problem is a keyword search: if a keyword or set of keywords appear in the document, classify the document as positive.

In [3]:
#First import packages
import urllib.request
import os
import codecs
import zipfile
import pandas as pd
from IPython.display import display, HTML
import ipywidgets
import sklearn.metrics

In [4]:
from nlp_pneumonia_utils import read_doc_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import mark_text
from nlp_pneumonia_utils import pneumonia_annotation_html_markup
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

## 1. Create a keyword classifier
### Team exercise:
### iterate through a set of keywords (self.keywords)
### If a keyword in the set appears in 'text', return prediction = 1
### else, return prediction = 0. 
Complete the function below.

In [6]:
class KeywordClassifier(object):
    def __init__(self):
        self.keywords = set()
    def predict(self, text):
        prediction = 0
        
#         your code here
        return prediction
    


Test the function you just wrote by adding one keyword to the set: 'pneumonia'

In [7]:
keyword_classifier = KeywordClassifier()
keyword_classifier.keywords.add('pneumonia')
annotated_doc_map = read_doc_annotations('data/training_v2.zip')
print('Total Annotated Documents : {0}'.format(len(annotated_doc_map)))

keyword_positives = 0
actual_positives = 0
for doc in annotated_doc_map.values():
    if keyword_classifier.predict(doc.text) == 1:
        keyword_positives+=1
        
    if doc.positive_label == 1:
        actual_positives+=1
        
print('Number of documents labeled positive by keyword classifier: ' + str(keyword_positives))
print('Number of documents labeled positive by gold standard: ' + str(actual_positives))


Reading annotations from file : data/training_v2.zip
Opening local file : data/training_v2.zip
Total Annotated Documents : 70
Number of documents labeled positive by keyword classifier: 30
Number of documents labeled positive by gold standard: 34


## 2. Error analysis of keyword classifier

### Let's first look at false negatives 



In [8]:
def list_false_negatives(gold_docs, prediction_function):
    fn_docs={}
    for doc_name, gold_doc in gold_docs.items():
        gold_label=gold_doc.positive_label;
        pred_label = prediction_function(gold_doc.text)
        if gold_label==1 and pred_label==0:
            fn_docs[doc_name]=gold_doc            
    return fn_docs     


In [9]:
fn=list_false_negatives(annotated_doc_map, keyword_classifier.predict)
docs=list(fn.keys())
print(len(docs))

13


### How do we fix false negatives?

Let's look through each document that is a false negative, showing  one document a time:<br/><br/>

In [10]:
@interact(i=ipywidgets.IntSlider(min=0, max=len(docs)-1))
def display_doc(i):
    doc_name=docs[i]    
    print(doc_name)
    anno_doc=fn[doc_name]
    display(HTML(pneumonia_annotation_html_markup(anno_doc).replace('\n', '<br>')))    


interactive(children=(IntSlider(value=0, description='i', max=12), Output()), _dom_classes=('widget-interact',…

## 3. More efficient review:
Not convenient to read? Let's try snippet view instead. Now we need to make another function to replace "*pneumonia_annotation_html_markup*". 

Although we measuring the document level annotation, we will focus on mention level ("**EVIDENCE_OF_PNEUMONIA**") error analyses. Because the later is where the errors originate from.<br/><br/>



In [11]:

def snippets_markup(annotated_doc_map):
    html = ["<html>","<table width=100% >",
            "<col style=\"width:25%\"><col style=\"width:75%\">"
            "<tr><th style=\"text-align:center\">document name</th><th style=\"text-align:center\">Snippets</th>"]
    for doc_name, anno_doc in annotated_doc_map.items():
        html.extend(snippet_markup(doc_name,anno_doc))
    html.append("</table>")
    html.append("</html>")
    return ''.join(html) 


def snippet_markup(doc_name,anno_doc):
    from pyConTextNLP.display.html import __sort_by_span
    from pyConTextNLP.display.html import __insert_color
    html=[]
    color= 'blue'    
    window_size=50    
    html.append("<tr>")
    html.append("<td style=\"text-align:left\">{0}</td>".format(doc_name))
    html.append("<td></td>")
    html.append("</tr>")
    for anno in anno_doc.annotations:
        if anno.type == 'EVIDENCE_OF_PNEUMONIA':
#           make sure the our snippet will be cut inside the text boundary
            begin=anno.start_index-window_size
            end=anno.end_index+window_size
            begin=begin if begin>0 else 0
            end=end if end<len(anno_doc.text) else len(anno_doc.text)    
#           render a highlighted snippet
            cell=__insert_color(anno_doc.text[begin:end],[anno.start_index-begin,anno.end_index-end],color)
#           add the snippet into table
            html.append("<tr>")
            html.append("<td></td>")
            html.append("<td style=\"text-align:left\">{0}</td>".format(cell))
            html.append("</tr>") 
    return html

Let's try it out:<br/><br/>

In [12]:
fn=list_false_negatives(annotated_doc_map, keyword_classifier.predict)
docs=list(fn.keys())
display(HTML(snippets_markup(fn)))

document name,Snippets
subject_id_146_hadm_id_18965,
,al effusion.  Right CPA not included on film. There is obscuration of left hemidiaphragm  likely secondary to atelectasis/consolidation in left lower lobe.
subject_id_150_hadm_id_12121,
,es are  unremarkable.  IMPRESSION: Small focal opacity in right upper lobe and right paratracheal  opacity. In the sett
,CHEST PA AND LATERAL: The heart size is normal. There is an area of  increased opacity lateral to the right paratracheal stripe. In the
,"pacity lateral to the right paratracheal stripe. In the right  upper lobe, there is a small focal opacity. The lungs are otherwise clear.  There are no"
subject_id_157_hadm_id_26180,
,"ung is incompletely imaged  on this study and there is a questionable area of abnormality partially  obscuring the mid portion of the right hemidiaphragm, incompletely evaluated.  IMPRESSION:"
subject_id_261_hadm_id_19250,
,ignificant change compared with [**3197-12-9**]. Bilateral pulmonary  opacities involving the lower and mid lung zones.


## 4. Add more keywords to your keyword classifier to decrease the number of false negatives


In [13]:
keyword_classifier.keywords = {'pneumonia', 'consolidation'}
print(keyword_classifier.keywords)

{'pneumonia', 'consolidation'}


## 5. Compute the outcome metrics using your KeywordClassifier

Let's copy and paste the *keyword_classifier* definition below for editing convenience. Reuse the *calculate_prediction_metrics* function. We need to change a little bit of its input parameter, because it needs a list for the 1st parameter, and our *annotated_doc_map* is a dictionary. So we use *list(annotated_doc_map.values())* instead of *annotated_doc_map* directly.<br/><br/>


In [19]:
res=calculate_prediction_metrics(list(annotated_doc_map.values()), keyword_classifier.predict)

Precision : 0.6829268292682927
Recall :    0.8235294117647058
F1:         0.7466666666666667

Confusion Matrix : 


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,23,13
1,6,28


In [24]:
gold_docs=list(annotated_doc_map.values())
prediction_function=keyword_classifier.predict
gold_labels = [x.positive_label for x in gold_docs]
pred_labels = []
for gold_doc in gold_docs:
    pred_label = prediction_function(gold_doc.text)
    pred_labels.append(pred_label)

# now let's use scikit-learn to compute some metrics
precision = sklearn.metrics.precision_score(gold_labels, pred_labels)
recall = sklearn.metrics.recall_score(gold_labels, pred_labels)
f1 = sklearn.metrics.f1_score(gold_labels, pred_labels)
# let's use Pandas to make a confusion matrix for us
confusion_matrix_df = pd.crosstab(pd.Series(gold_labels, name='Actual'),
                                  pd.Series(pred_labels, name='Predicted'))

print('Precision : {0}'.format(precision))
print('Recall :    {0}'.format(recall))
print('F1:         {0}'.format(f1))

print('\nConfusion Matrix : ')
display(confusion_matrix_df)

Precision : 0.6829268292682927
Recall :    0.8235294117647058
F1:         0.7466666666666667

Confusion Matrix : 


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,23,13
1,6,28


## 6. Quiz
Try the following questions, see if you've understood this notebook.

In [38]:
from quiz_utils import error_analyses_1
error_analyses_1()

RadioButtons(description='False negative means:', layout=Layout(width='600px'), options=('Negative in both gol…

Button(description='Submit', style=ButtonStyle())

In [37]:
from quiz_utils import error_analyses_2
error_analyses_2()

RadioButtons(description='Which is corret:', layout=Layout(width='600px'), options=('list', 'array', 'dictiona…

Button(description='Submit', style=ButtonStyle())

In [39]:
from quiz_utils import error_analyses_3
error_analyses_3()

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2019.<br/>
Presenters : Dr. Wendy Chapman, Kelly Peterson, Alec Chapman, Jianlin Shi <br> Acknowledgement: Many thanks to Olga Patterson because part of the materials are adopted from his previous work.