# FHI Module 7 Hands-on ---Breast Cancer Workbook

## Time to get your hands dirty

You've learned what we need to do and how the tool works. Now it is time for you to make it actually work.

You are welcome to spend your time however you'd like but here are a few ideas of how to improve your system:
* Improve targets.  Are there any False Negatives your system is missing?  Are there regular expressions that would help?
* Improve modifiers.  Not all modifiers typically used in practice are the modifiers starter file.  Are there some to add?  Do some existing modifiers cause problems in your processing?  They can be changed or removed.
* Improve document classification rules.  This is **optional**, because the default rules are ready to go. If you are interested, feel free the read the comments in the file to see how it works.

## 1. Let's go

In [None]:
# import packages that we will need
from nlp_pneumonia_utils import read_doc_annotations
from DocumentClassifier import DocumentClassifier
from nlp_pneumonia_utils import list_errors
from visual import Vis
from visual import snippets_markup
from visual import view_pycontext_output
from visual import display_doc_text
# packages for interaction
from IPython.display import display, HTML
import ipywidgets

## 2. Load our training set

In [None]:
# Any document with this kind of annotation is a positive document
pos_doc_type='FAM_BREAST_CA_DOC'
annotated_doc_map = read_doc_annotations(archive_file='data/bc_train.zip', pos_type=pos_doc_type)
print('Total Annotated Documents : {0}'.format(len(annotated_doc_map)))

## 3. Read in our Knowledge Base files
- The targets file is seeded with two targets: **"breast cancer"** and **"breast carcinoma"**  ([target rule file](/edit/KB/fam_bc_targets.yml))  
- The modifier file has all for `Negation` and `Temporality`, but the `Family` modifiers are not complete ([modifier rule file](/edit/KB/fam_bc_modifiers.yml)). 
- **You'll need to add some to capture all of the positive instances of family history.**

In [None]:
TARGETS_FILE_PATH = 'KB/fam_bc_targets.yml'
MODIFIERS_FILE_PATH = 'KB/fam_bc_modifiers.yml'
FEATURE_INFERENCER_FILE_PATH = 'KB/fam_bc_featurer_inferences.csv'
DOC_INFERENCER_FILE_PATH = 'KB/fam_bc_doc_inferences.csv'
# clear just in case files/regular expressions have been updated
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})
classifier.reset_saved_predictions()

Here are the target terms that we're looking for:

In [None]:
classifier.targets

And here are the modifiers:

In [None]:
classifier.modifiers

## 4. Let's classify some documents
The function * list_errors* wraps up several functions together. It will compare the classifier's conclusions against the reference standard (manually annotated documents), and return the false positive documents (with pyConText markups), false negative documents (with manual annotations), and the measurements (precision, recall and F1).


For the detailed implementation of this *list_errors* function, you can check the code in [nlp_pneumonia_utils](/edit/nlp_pneumonia_utils.py).

In [None]:
%%time
print('****************')
print('Performance for Classifier :')
current_false_negatives, current_false_positives, measurements,confusion_matrix_df = classifier.eval(annotated_doc_map)
print(measurements)
display(confusion_matrix_df)
print('****************')

Our classifier achieved the following scores:<br>
**Precision** = 0.958<br>
**Recall** =    0.719<br>
**F1** = .821

## 5. Development of your system: can you improve the performance?
* The tools below will highlight and graph False Positives and False Negatives. Use them to look at errors and find ways to fix them.

Instructions:
1. Run the system and calculate performance
2. Review false negatives and positives and make changes to the target file or the modifier file
3. Repeat

### 5.1 Review the False Negatives - we have provided two viewers below

There are two reasons that our pipeline got false negative errors:

1. A term is missing in your target lexicon file. If so, we need to add your new lexicon to target rule file: `./KB/fam_bc_targets.yml`
2. We're missing a modifier rule for **family context** such as "grandmother". Add it to modifier rule file: `./KB/fam_bc_modifiers.yml`
3. Our context rule **excluded** the target concept. If so, we will need to locate the context rule, remove or modifiy it in your modifier rule file: `./edit/KB/fam_bc_modifiers.yml`

## False Negative Viewer - reference standard snippet annotations

In [None]:
fn_docs=dict((k, v) for k, v in annotated_doc_map.items() if k in current_false_negatives)
display(HTML(snippets_markup(annotated_doc_map,'FAM_BREAST_CA')))

If you are sure the target lexicon have been included in the targets file, then these false negatives errors must be caused by your modifiers that excluded these targets.Let's take a look at what pyConText output looks like:

## False Negative Viewer - pyConText annotations

In [None]:
# set up the visualizer for pyConText output
vis=Vis(MODIFIERS_FILE_PATH)
fn_docs = dict((k,v) for k, v in classifier.saved_markups_map.items() if k in current_false_negatives)
view_pycontext_output(fn_docs,vis)

### 5.2 Review the false positives
For False Positives, it's most useful to see a pyConText graph since there may need to be modifiers adjusted so that targets can be properly utilized in classification

In [None]:
fp_docs = dict((k,v) for k, v in classifier.saved_markups_map.items() if k in current_false_positives)
view_pycontext_output(fp_docs,vis)

### 5.3 pyConText playground
After you change your target and modifier rules, type a sentence below (str) and make sure the rule does what you think it does. 
(move above viewers)


In [None]:
# Refresh the classifier with updated rules
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})

str='''his sister was dx breast cancer 20 years ago'''
res=classifier.predict(str)
print("Positive" if res==1 else "Negative")
view_pycontext_output(classifier.get_last_doc_markups(), vis)

## 6. Test Set Evaluation 

In [None]:
%%time
annotated_doc_map = read_doc_annotations(archive_file='img/bc_test.zip', pos_type=pos_doc_type)

classifier.reset_saved_predictions()
print('****************')
print('Performance for Classifier on test set:')
current_false_negatives, current_false_positives, measurements,confusion_matrix_df = classifier.eval(annotated_doc_map)
print(measurements)
display(confusion_matrix_df)
print('****************')

<br/><hr/>This material presented as part of the Foundations of Healthcare Informatics Course, 2017 Fall, BMI, University of Utah. It's revised from the <a href="https://github.com/UUDeCART/decart_rule_based_nlp">material</a> of the DeCART  Summer Program (Data, exploration, Computation, and Analytics Real-world Training for the Health Sciences) at the University of Utah in 2017. <br/><br/>Original presenters : Dr. Wendy Chapman, Jianlin Shi and Kelly Peterson.<br/>
Revised by: Jianlin Shi and Dr. Wendy Chapman<br/>
<img align="left" src="https://wiki.creativecommons.org/images/1/10/Cc.org_cc_by_license.jpg" alt="Except where otherwise noted, this website is licensed under a Creative Commons Attribution 3.0 Unported License.">