# Error analysis

Let's implement our pipeline using the simpler regex implementation


In [1]:
from pipeUtils import Annotation
from pipeUtils import Document
import re

doc = Document()
doc.load_document_from_file('data/test.txt')
doc.load_annotations_from_brat('data/test.ann')

target_regexes = []
regexes = ['pain',
          'depres\\w+',
          'suicidal\\s*ideation'
          ]
for reg in regexes:
    target_regexes.append(re.compile(reg, re.IGNORECASE))
    
neg_regex = '(\\bno\\b|denies)'

for reg in target_regexes:
    for match in reg.finditer(doc.text):
        new_annotation = Annotation(start_index=int(match.start()), end_index=int(match.end()), type='DepressionSymptoms')
        new_annotation.spanned_text = doc.text[new_annotation.start_index:new_annotation.end_index]
        # Check negation right before the found target.
        # Making sure that the pre_text is valid
        if new_annotation.start_index - 30 > 0:
            pre_text_start = new_annotation.start_index - 30
        else:
            pre_text_start = 0
        pre_text = doc.text[pre_text_start: new_annotation.start_index]
        
        # We do not need to know the exact location of the negation keyword, so re.search is acceptable
        if re.search(neg_regex, pre_text , re.IGNORECASE):
            new_annotation.attributes["Negation"] ='Negated'
        doc.annotations.append(new_annotation)

In [7]:
print(doc.toString())

test.txt
-------
General Medical Clinic
05/28/2010 13:00


CC
Follow up depression.

Subjective

Depression
The pt indicates Citalopram is helping control her depression symptoms but she continues to feel depressed most days.  Her sleep and fatigue have improved significantly with use of Citalopram.  She denies suicidal ideation.  Her PHQ-9 score is 18 today.

Hypertension
No Light-headedness.  The pt reports compliance with use of lisinopril and metoprolol.  She has been on these two medications for several years and has never used any other antihypertensive medications in her life.  She has not been checking her BP at home.

Osteoarthritis
Knee pain is well controlled currently.  No knee pain today.

Coronary Artery Disease
No angina.  No dyspnea.


Allergies
NKDA

PMH
Depression
Hypertension
Iron deficiency anemia
Osteoarthritis
Coronary Artery Disease
Hyperlipidemia

PSurgHx
None 

FamHx
None significant

SocHx
Lifetime non-user of tobacco.
Drinks alcohol rarely.
Has 5 adult chil

We have processed the document and can compare annotations of two types:

    Symptom - is the reference standard annotation type
    DepressionSymptoms - is the annotation type created by your process

The [pipeUtils](pipeUtils_doc.html) framework has a method that allows to compare annotations of two types.

In [2]:
tp, fp, fn, tp_list, fp_list, fn_list = doc.compare_types_by_span('Symptom', 'DepressionSymptoms', False) 
#false means to allow overlapped.

In [8]:
print(tp, fp, fn)

4 4 3


------------

tp_list is a list of True Positive pairs. We can print them side by side for comparison.


In [9]:
for a in tp_list:
    print(a[0].toString(),'||', a[1].toString())

0 DepressionSymptoms 682 686 pain [Negation:Negated] || T10 Symptom 677 686 knee pain [Negation:Negated][Experiencer:Patient]
0 DepressionSymptoms 142 152 depression  || T2 Symptom 142 161 depression symptoms [Negation:Affirmed]
0 DepressionSymptoms 188 197 depressed  || T8 Symptom 188 197 depressed 
0 DepressionSymptoms 296 313 suicidal ideation [Negation:Negated] || T5 Symptom 296 313 suicidal ideation [Negation:Negated]


----------
**fn_list** and **fp_list** are simple lists of unmatched annotations.

In [10]:
for a in fn_list:
    print(a.toString())

for a in fp_list:
    print(a.toString())

T3 Symptom 224 231 fatigue 
T6 Symptom 362 378 Light-headedness [Negation:Negated]
T9 Symptom 734 741 dyspnea [Negation:Negated]
0 DepressionSymptoms 638 642 pain 
0 DepressionSymptoms 55 65 depression 
0 DepressionSymptoms 80 90 Depression 
0 DepressionSymptoms 765 775 Depression 


-----------

Since we have **True Positive**, **False Positive**, and **False Negative** counts, we can calculate **Precision** and **Recall**.

In [11]:
print('TP =',tp, 'FP =',fp, 'FN =',fn)

if tp > 0 :
    precision = tp / (tp + fp)
    print('Precision=',precision)

if tp > 0 :
    recall = tp / (tp + fn)
    print('Recall=',round(recall,3))

TP = 4 FP = 4 FN = 3
Precision= 0.5
Recall= 0.571


Now we can calculate **F1 score**

In [None]:
F1 = (2*precision*recall)/(precision+recall)