# Now it's time to work on the final group project.  
All of the pieces we've learned are integrated into this initial template to work on improving F1 measure on classification of pneumonia evidence

## A few ideas of how to improve your system:
* Improve targets.  Are there any False Negatives your system is missing?  Are there regular expressions that would help?
* Improve modifiers.  Not all modifiers typically used in practice are the modifiers starter file.  Are there some to add?  Do some existing modifiers cause problems in your processing?  They can be changed or removed.
* Improve document classification rules.  What rules work best?  What is the best "default" classification?
* Consider handling of document "sections".  Are there certain headers or subsections which are more or less likely to contain evidence?  You could modify your own "markup" function to do this or you could add Modifiers to do this in some cases

## Also before we get going, a few Pro Tips:
* Remember that pyConText files need to be tab delimited.  If you edit these files in JupyterHub, it might be difficult to see the tabs and if you press "TAB" you will actually get spaces, so try to use Copy-and-Paste
* Classification rules and modifiers are difficult.  Don't be afraid to ask for help

## We are going to use yaml for our knowledge representation

In [1]:
#!conda install pyyaml -y

## Before we continue we need to open our knowledge file(s) for editing.

### In the File menu select open

![opening a file](./opening_files.png)

### cd to the KB file and select `pneumonia_targets.yml`

![selecting yaml file](./select_yaml_file.png)

### now open `pneumonia_targets.yml` to edit

![editing yaml file](./edit_yaml_file.png)

In [2]:
import more_utils as mu
import os

## Load our training set

In [3]:
annotated_doc_map = mu.read_doc_annotations('data/training_v2.zip')

# let's also use a simple list of documents as well as this map
annotated_docs = list(annotated_doc_map.values())

print('Total Annotated Documents : {0}'.format(len(annotated_docs)))

Reading annotations from file : data/training_v2.zip
Opening local file : data/training_v2.zip
Total Annotated Documents : 70


## Setting up our resources
You're welcome to start new files or continue in the files we've used, but let's set up some defaults we have used in the course

In [4]:
TARGETS_FILE_PATH = 'file:///' + os.path.join(os.getcwd(), 'KB/pneumonia_targets.yml')
MODIFIERS_FILE_PATH = 'file:///' + os.path.join(os.getcwd(),'KB/pneumonia_modifiers.yml')
CLASSIFIER_FILE_PATH = 'KB/classifierRules.csv'

## Load our targets and resources now

In [5]:
# clear just in case files/regular expressions have been updated
mu.clearPyConTextRegularExpressions()

targets = mu.get_items(TARGETS_FILE_PATH)
modifiers = mu.get_items(MODIFIERS_FILE_PATH)
#targets = pyConTextNLP.itemData.instantiateFromCSVtoitemData(TARGETS_FILE_PATH)
#modifiers = pyConTextNLP.itemData.instantiateFromCSVtoitemData(MODIFIERS_FILE_PATH)

print('Targets loaded : {0}'.format(len(targets)))
print('Modifiers loaded : {0}'.format(len(modifiers)))

Targets loaded : 3
Modifiers loaded : 363


## Construct our Document Classifier

In [6]:
debug_classifier = False
docClassifier = mu.DocumentClassifier(CLASSIFIER_FILE_PATH, debug_classifier, modifiers, targets) 

## Let's attempt some predictions
* You will do a lot of iterations modifying content and then coming back here to check performane
* Remember : the prediction function passed here passes in a string (text) and returns a 0 or 1

In [7]:
print('****************')
print('Performance for Classifier :')
mu.calculate_prediction_metrics(annotated_docs, docClassifier.predict)

****************
Performance for Classifier :
Precision : 0.9
Recall :    0.7941176470588235
F1:         0.84375

Confusion Matrix : 


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33,3
1,7,27


## Development of your system:
* We have found the tools below for highlighting and graphing False Positives and False Negatives to be very useful.  We've provided them below in case it helps you as well

In [8]:
# NOTE : You may need to modify this color mapping if you add  Target or Modifier categories not found here
# prepare some colors for displaying any markup we might see
colors = {
    "evidence_of_pneumonia": "orange",
    "definite_negated_existence": "red",
    "probable_negated_existence": "indianred",
    "ambivalent_existence": "orange",
    "probable_existence": "forestgreen",
    "definite_existence": "green",
    "historical": "goldenrod",
    "indication": "pink",
    "acute": "golden"
}

### We may need to download the textblob corpora

In [9]:
#!python -m textblob.download_corpora

In [10]:

# get our current set of false negatives and false positives if we use our simple toy classifier
# which uses targets and a simplified implementation of modifiers
current_false_negatives = mu.list_false_negatives(annotated_doc_map, docClassifier.predict)
current_false_positives = mu.list_false_positives(annotated_doc_map, docClassifier.predict)

fn_report_results = mu.marking_false_negatives(current_false_negatives, modifiers, targets)
fp_report_results = mu.marking_false_positives(current_false_positives, modifiers, targets)

print('Current total False Negatives : {0}'.format(len(current_false_negatives)))
print('Current total False Positives : {0}'.format(len(current_false_positives)))

Marking up False Negatives
Marking up False Positives
Current total False Negatives : 7
Current total False Positives : 3


## For False Negatives, it's most useful to see the expert span annotations for positive pneumonia evidence to see if there may be targets that should be added

In [11]:
mu.view_pycontext_graph(fn_report_results, colors)

## For False Positives, it's most useful to see a pyConText graph since there may need to be modifiers adjusted so that targets can be properly utilized in classification

In [12]:
mu.view_pycontext_graph(fp_report_results, colors)

# TEST SET Evaluation 
* We've been waiting for the test set.  It will not be available until the morning of the final class session.
* At that time, you can uncomment this code and make any changes to it as instructed by the class instructors:

In [None]:
#test_doc_map = mu.read_doc_annotations('data/training_v2.zip')

# let's also use a simple list of documents as well as this map
#test_docs = list(test_doc_map.values())

#print('Total Test Documents : {0}'.format(len(test_docs)))

# and now let's check performance on the TEST set...
#print('****************')
#print('Performance for Classifier on TEST set :')
#mu.calculate_prediction_metrics(test_docs, docClassifier.predict)

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2017.<br/>
Presenters : Dr. Wendy Chapman, Jianlin Shi and Kelly Peterson