# Error Analysis (1)

From the previous notebook, you have seen that our current NLP solution does not get 100% correct. Although it is unrealistic to reach that goal, we definitely can make it closer. 

This notebook will show you how to analysis errors related to name entity recognition, and guide you through step by step to improve the recall. We will talk about how to improve precision tomorrow.

## 1. Locate the errors

In [1]:
#First import packages
import urllib.request
import os
import codecs
import zipfile
import pandas as pd
from IPython.display import display, HTML
import ipywidgets
import sklearn.metrics

Reuse the classes and functions that we have created in previous notebook.
Note: we are going to use *read_doc_annotations* (return a dictionary with document name as the key, and annotations as the value) instead of *read_annotations* (return a list of documents' annotations), so that we list the errors with the corresponding document name.
<br/><br/>

In [2]:
from nlp_pneumonia_utils import read_doc_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import mark_text
from nlp_pneumonia_utils import pneumonia_annotation_html_markup
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

<a id="cell1"></a>Next, we tweak the function **calculate_prediction_metrics** to list the difference--errors, instead of calculating the measurements:

Question before we move on:

Why we only care *false negatives* for now?<br/><br/>


In [3]:
def list_false_negatives(gold_docs, prediction_function):
    fn_docs={}
    for doc_name, gold_doc in gold_docs.items():
        gold_label=gold_doc.positive_label;
        pred_label = prediction_function(gold_doc.text)
        if gold_label==1 and pred_label==0:
            fn_docs[doc_name]=gold_doc            
    return fn_docs     


## 2. Display errors
Now we put everything together to display errors:<br/><br/>

In [4]:
class KeywordClassifier(object):
    def __init__(self):
        self.keywords = set()
    def predict(self, text):
        prediction = 0
        for keyword in self.keywords:
            if keyword in text:
                prediction = 1
        return prediction
    


In [5]:
keyword_classifier = KeywordClassifier()
# let's load in some manual keywords...
keyword_classifier.keywords.add('pneumonia')
annotated_doc_map = read_doc_annotations('data/training_v2.zip')
print('Total Annotated Documents : {0}'.format(len(annotated_doc_map)))

fn=list_false_negatives(annotated_doc_map, keyword_classifier.predict)

docs=list(fn.keys());

Reading annotations from file : data/training_v2.zip
Opening local file : data/training_v2.zip
Total Annotated Documents : 70


Show one document a time:<br/><br/>

In [6]:
@interact(i=ipywidgets.IntSlider(min=0, max=len(docs)-1))
def display_doc(i):
    doc_name=docs[i]    
    print(doc_name)
    anno_doc=fn[doc_name]
    display(HTML(pneumonia_annotation_html_markup(anno_doc).replace('\n', '<br>')))    


## 3. More efficient review:
Not convenient to read? Let's try snippet view instead. Now we need to make another function to replace "*pneumonia_annotation_html_markup*". 

Although we measuring the document level annotation, we will focus on mention level ("**EVIDENCE_OF_PNEUMONIA**") error analyses. Because the later is where the errors originate from.<br/><br/>



In [1]:

def snippets_markup(annotated_doc_map):
    html = ["<html>","<table width=100% >",
            "<col style=\"width:25%\"><col style=\"width:75%\">"
            "<tr><th style=\"text-align:center\">document name</th><th style=\"text-align:center\">Snippets</th>"]
    for doc_name, anno_doc in annotated_doc_map.items():
        html.extend(snippet_markup(doc_name,anno_doc))
    html.append("</table>")
    html.append("</html>")
    return ''.join(html) 


def snippet_markup(doc_name,anno_doc):
    from pyConTextNLP.display.html import __sort_by_span
    from pyConTextNLP.display.html import __insert_color
    html=[]
    color= 'blue'    
    window_size=50    
    html.append("<tr>")
    html.append("<td style=\"text-align:left\">{0}</td>".format(doc_name))
    html.append("<td></td>")
    html.append("</tr>")
    for anno in anno_doc.annotations:
        if anno.type == 'EVIDENCE_OF_PNEUMONIA':
#           make sure the our snippet will be cut inside the text boundary
            begin=anno.start_index-window_size
            end=anno.end_index+window_size
            begin=begin if begin>0 else 0
            end=end if end<len(anno_doc.text) else len(anno_doc.text)    
#           render a highlighted snippet
            cell=__insert_color(anno_doc.text[begin:end],[anno.start_index-begin,anno.end_index-end],color)
#           add the snippet into table
            html.append("<tr>")
            html.append("<td></td>")
            html.append("<td style=\"text-align:left\">{0}</td>".format(cell))
            html.append("</tr>") 
    return html

Let's try it out:<br/><br/>

In [8]:
display(HTML(snippets_markup(fn)))

document name,Snippets
subject_id_146_hadm_id_18965,
,al effusion.  Right CPA not included on film. There is obscuration of left hemidiaphragm  likely secondary to atelectasis/consolidation in left lower lobe.
subject_id_150_hadm_id_12121,
,es are  unremarkable.  IMPRESSION: Small focal opacity in right upper lobe and right paratracheal  opacity. In the sett
,CHEST PA AND LATERAL: The heart size is normal. There is an area of  increased opacity lateral to the right paratracheal stripe. In the
,"pacity lateral to the right paratracheal stripe. In the right  upper lobe, there is a small focal opacity. The lungs are otherwise clear.  There are no"
subject_id_157_hadm_id_26180,
,"ung is incompletely imaged  on this study and there is a questionable area of abnormality partially  obscuring the mid portion of the right hemidiaphragm, incompletely evaluated.  IMPRESSION:"
subject_id_261_hadm_id_19250,
,ignificant change compared with [**3197-12-9**]. Bilateral pulmonary  opacities involving the lower and mid lung zones.


## 4. Now what?<br/><br/>

## 5. Try to compute the measurement scores using your KeywordClassifier

Let's copy and paste the *keyword_classifier* definition below for editing convenience. Reuse the *calculate_prediction_metrics* function. We need to change a little bit of its input parameter, because it needs a list for the 1st parameter, and our *annotated_doc_map* is a dictionary. So we use *list(annotated_doc_map.values())* instead of *annotated_doc_map* directly.<br/><br/>


In [9]:
calculate_prediction_metrics(list(annotated_doc_map.values()), keyword_classifier.predict)

Precision : 0.7
Recall :    0.6176470588235294
F1:         0.65625

Confusion Matrix : 


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,27,9
1,13,21


In [10]:
#check what is left
keyword_classifier = KeywordClassifier()
# let's load in some manual keywords...
keyword_classifier.keywords.add('pneumonia')
keyword_classifier.keywords.add('extra keyword')
keyword_classifier.keywords.add('....')
fn=list_false_negatives(annotated_doc_map, keyword_classifier.predict)
display(HTML(snippets_markup(fn)))

document name,Snippets
subject_id_146_hadm_id_18965,
,al effusion.  Right CPA not included on film. There is obscuration of left hemidiaphragm  likely secondary to atelectasis/consolidation in left lower lobe.
subject_id_150_hadm_id_12121,
,es are  unremarkable.  IMPRESSION: Small focal opacity in right upper lobe and right paratracheal  opacity. In the sett
,CHEST PA AND LATERAL: The heart size is normal. There is an area of  increased opacity lateral to the right paratracheal stripe. In the
,"pacity lateral to the right paratracheal stripe. In the right  upper lobe, there is a small focal opacity. The lungs are otherwise clear.  There are no"
subject_id_157_hadm_id_26180,
,"ung is incompletely imaged  on this study and there is a questionable area of abnormality partially  obscuring the mid portion of the right hemidiaphragm, incompletely evaluated.  IMPRESSION:"
subject_id_261_hadm_id_19250,
,ignificant change compared with [**3197-12-9**]. Bilateral pulmonary  opacities involving the lower and mid lung zones.


## 6. Quiz
Try the following questions, see if you've understood this notebook.

In [11]:
from quiz_utils import error_analyses_1
error_analyses_1()

In [None]:
from quiz_utils import error_analyses_2
error_analyses_2()

In [None]:
from quiz_utils import error_analyses_3
error_analyses_3()

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2018.<br/>
Presenters : Dr.Wendy Chapman, Jianlin Shi