# Machine Learning Comparison
In previous notebooks, we used a rule-based Information Extraction system to classify documents for family history of cancer. Our two classes are:
- **Positive document** - has evidence that a family member has had cancer. 
- **Negative document** - does not have evidence that family member has had cancer

With rule-based NLP systems, you can easily model a concept and write rules to capture it. However, there are certain disadvantages to a rule-based systems

### Discussion
* What are some disadvantages to rule-based NLP systems?

In this notebook, we'll use **Machine Learning** to classify the documents as either **positive** or **negative** and then compare the two methods.

In [None]:
# import packages that we will need
from nlp_pneumonia_utils import read_doc_annotations
from DocumentClassifier import DocumentClassifier
from nlp_pneumonia_utils import list_errors
from visual import Vis
from visual import snippets_markup
from visual import view_pycontext_output
from visual import display_doc_text
# packages for interaction
from IPython.display import display, HTML
import ipywidgets

import sklearn

# Helper functions
from ml_utils import *

# Representing Text Data
We need to convert the raw text into a format that can be computed with. To do with, we'll be converting each document into a numerical vector using a **Bag of Words** model.

The idea behind a Bag of Words (BOW) model is simple: for each document, we'll jumble together all of the words in the document, not caring about the order they occurred in and represent the documents in a matrix. Each row will represent a document and each column will represent a word in our vocabulary. If a word is present in that document, that column will be 1. If it isn't, that column will be 0.

To get an intuition, here's a simple example: Suppose we have these 3 very short lower-cased documents:
1. "the dog ate."
2. "the cat sat."
3. "the cat sat on the dog."

In this example, we have a total of 7 words in our vocabulary:

V = {the, dog, ate, cat, sat, on, "."}

To represent this as a vector, here's what our matrix will look like:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import pandas as pd

In [None]:
example_docs = ["The dog ate.", "The cat sat.", "The cat sat on the dog."]
X_example, vectorizer_example = vectorize_documents(example_docs)
display_word_matrix(X_example, vectorizer_example)

# Transform our dataset

In [None]:
texts_train, labels_train, texts_test, labels_test = read_in_data()

In [None]:
# Transform Data
X_train, vectorizer = vectorize_documents(texts_train)
X_test, _ = vectorize_documents(texts_test, vectorizer=vectorizer)

Let's look at what features we have:

In [None]:
display_word_matrix(X_train, vectorizer)

# Naive approach
When you Google "Best machine learning algorithms", the first result Google suggests is `LogisticRegression`. So, let's try that!

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# from sklearn.uti

clf = LogisticRegression()
clf.fit(X_train, labels_train)
pred = clf.predict(X_test)

print(classification_report(labels_test, pred, labels=["Positive Document"])) # Just look at scores for positive docs

# Better approach
Our first attempt got a pretty low score for predicting positive documents. This is much lower than the rule-based system , which got an F1 of 0.821 

But machine learning is rarely an "out-of-the-box" kind of task. A first try will rarely do well and there are plenty of tricks to improve our performance. We'll try a few of them right now and see if we can improve our performance.

1. **Data Clean-Up** - Looking at our features above, we can see a lot of useless information like punctuation, numbers, and very specific combinations of words that probably don't matter at all for our task. To address this, we'll convert our documents to lower-case, use regular expressions to clean up the text a bit.
2. **Features** - We'll first set a *document frequency threshold* of 0.2, which will restrict our vocabulary to words that occur in at least 20% of the documents. We'll also expand our features to look at bigrams and trigrams instead of just unigrams (words).
3. **Data** - A disadvantage of machine learning is that it typically requires a larger amount of data. To maximize the amount of data that we can use, we're going to mix all of our data together and use *5-fold cross-validation* to train and evaluate on each data point, allowing us to use all of our data for both training and testing (importantly, without ever mixing them!)
4. **Different Models**: We just picked the first classifier we found on Google, but it's important to try lots of different algorithms and see if one works significantly better than the others.

In [1]:
# 1. Data clean-up
import re

def preprocess(text):
    text = text.lower()
    # Remove punctuation, special symbols
    ## Your code here
    
    # Change any combination digits to be a special NUM symbol
    ## Your code here
    
    # Remove excess whitespace for human readability
    ## Your code here
    
    # Anything else to try?
    
    return text

In [None]:
# 2. Cross-validation
# Create one dataset for cross-validation.
texts = texts_train + texts_test
y = labels_train + labels_test
print("Total number of documents: {}".format(len(texts)))

In [None]:
print("****Before clean-up:****")
print(texts[0][:250])
print()
texts = [preprocess(text) for text in texts]
print("****After clean-up:****")
print(texts[0][:250])

In [None]:
# Transform cleaned-up texts with an added document frequency `df` and larger ngrams
X, vectorizer = vectorize_documents(texts, ngrams=(1,3), min_df=0.2)
display_word_matrix(X, vectorizer)

In [None]:
# Here are the classifiers we'll use
clfs = [LogisticRegression(), RandomForestClassifier(random_state=0), 
        DecisionTreeClassifier(random_state=0), SVC(), MultinomialNB()]
clf_names, scores = evaluate_cross_val_clfs(X, y, clfs)

Some of our steps clearly worked - The DecisionTreeClassifier got an F1 above 0.9, much higher than both our baseline LogisticRegression model and the rule-based system. Let's look at a more detailed analysis:

In [None]:
dtree = DecisionTreeClassifier()
pred = cross_val_predict(dtree, X, y, cv=5)
print(classification_report(y, pred, labels=['Positive Document'])) # Just look at positive labels

# What To Do Next:
We tried a few things to improve our machine learning scores. Here are a few more steps we could take:
- **Additional Data Clean-Up** - Remove stopwords, stem words, etc...
- **Hyperparameter Tuning** - Every machine learning model has hyperparameters that you can adjust. Pick a model and try training it with different hyperparameter combinations until you can find the best score.
- **CV** - Try different cross-validation partitions.
- **Feature Selection** - Try feature selection methods to reduce the number of features in our model

# Interpreting Results
One advantage of rule-based systems is that the decision-making process *makes sense*. A domain expert can look at a finding, look at a rule, and tell you whether or not it's correct. 

This is much more challenging with machine learning, which is often viewed as a *black box* that doesn't necessarily make sense to a human. It's important to take a look at results and confirm that they make sense and to identify any potential problems.

Our highest performing algorithm is a Decision Tree, which is one of the more easy-to-understand algorithms. `sklearn` has utilities that allow you to visualize the classification process.

### Discussion 
- Look at the decision tree below and trace through the classification process.
- Do these rules make sense? If not, why is the classifier still performing so well? 
- What are the potential problems of this?

In [None]:
# Retrain DT classifier using all of the data
dtree = DecisionTreeClassifier(random_state=0)
dtree.fit(X, y)

In [None]:
# List of vocabulary terms
feature_names = vectorizer.get_feature_names()

In [None]:
visualize_tree(dtree, feature_names)

You can use the document viewer below to look at some examples of positive and negative documents and see if the decision tree's rules apply:

In [None]:
pos_doc_type='FAM_BREAST_CA_DOC'
annotated_doc_map = read_doc_annotations(archive_file='data/bc_train.zip', pos_type=pos_doc_type)
pos_docs=dict((k, v) for k, v in annotated_doc_map.items() if  v.annotations[0].type ==pos_doc_type)
neg_docs=dict((k, v) for k, v in annotated_doc_map.items() if  v.annotations[0].type !=pos_doc_type)

In [None]:
display_doc_text(pos_docs)

In [None]:
display_doc_text(neg_docs)

# Conclusion