# Accuracy is Not Enough
## The Story of the Confusion Matrix

So far, we've looked at accuracy as the only measure for our models.  Let's see why this is not a great answer and see what we can use instead.

We'll start with our three tree-based models and the heart attack dataset once more.  So let's generate some results.

In [None]:
import numpy as np
import pandas as pd
from sklearn import tree, ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb


clf_cart = tree.DecisionTreeClassifier()
clf_rf = ensemble.RandomForestClassifier()


Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.  I'll also remove the bits where we analyze the data, as we've seen it enough times already.

In [None]:
heart_attack_data = "../data/HeartAttackData.csv"
df = pd.read_csv(heart_attack_data, header=0)
y = df['output']
X = df.drop('output', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

Now let's train each model in turn, starting with CART, then random forest, and finally XGBoost.

In [None]:
clf_cart = clf_cart.fit(X_train, y_train)
predicted_cart = clf_cart.predict(X_test)

clf_rf = clf_rf.fit(X_train, y_train)
predicted_rf = clf_rf.predict(X_test)

clf_xgb = xgb.XGBClassifier(max_depth=5, n_estimators=45, use_label_encoder=False, eval_metric='logloss')
clf_xgb = clf_xgb.fit(X_train, y_train)
predicted_xgb = clf_xgb.predict(X_test)

Let's take a quick look at the three accuracy scores:

In [None]:
print("CART = %f; RF = %f; XGBJ = %f" % (accuracy_score(y_test, predicted_cart), accuracy_score(y_test, predicted_rf), accuracy_score(y_test, predicted_xgb)))

Next up, let's look at the confusion matrix and classification report.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix(y_test, predicted_cart)

The confusion matrix shows us four things.  In order from left-to-right, top-to-bottom:

1. Actual negative, predicted negative.  In other words, CART predicted no heart attack and the person had no heart attack.
2. Actual negative, predicted positive.  In other words, CART predicted a heart attack, but the person had no heart attack.
3. Actual positive, predicted negative.  In other words, CART predicted no heart attack, but the person had a heart attack.
4. Actual positive, predicted positive.  In other words, CART predicted a heart attack and the person had a heart attack.

Now let's look at the classification report.

In [None]:
print(classification_report(y_test, predicted_cart))

The classification report generates several metrics based on the confusion matrix.

- Precision says, "If we predict a particular outcome, what is the likelihood it will be that outcome?"  In other words, take the "good" outcome from the sum of **vertical** elements:  `32 / (32 + 7) = 0.82` for the no-heart attack scenario and `37 / (37 + 15) = .71` for the heart attack scenario.
- Recall says, "Given a particular outcome, how likely were we to have predicted this?"  In other words, take the "good" outcome from the sum of **horizontal** elements:  `32 / (32 + 15) = 0.68` for the no-heart attack scenario and `37 / (37 + 7) = 0.84` for the heart attack scenario.
- F1 score (or F-score) is a way of combining precision and recall.  The formula is `2 * (precision * recall) / (precision + recall)`.


In [None]:
confusion_matrix(y_test, predicted_rf)

In [None]:
print(classification_report(y_test, predicted_rf))

In [None]:
confusion_matrix(y_test, predicted_xgb)

In [None]:
print(classification_report(y_test, predicted_xgb))

We can see based on this that the random forest and XGBoost algorithms are pretty similar.  The accuracy is slightly different, but not by much.  And the other measures see a give-and-take which gives us the feeling that either one could work equally well.

## Where It Matters:  Imbalanced Classes

In this first example, we had a similar number of patients with heart attacks as those without heart attacks.  Now let's look at a separate scenario, with very few members of one class.  This is the same heart attack data, except only 24 entries have output = 1.  Now let's see how the three algorithms perform.

In [None]:
heart_attack_data = "../data/HeartAttackDataImbalanced.csv"
df = pd.read_csv(heart_attack_data, header=0)
y = df['output']
X = df.drop('output', axis=1)
X_train, X_test, y_train, y_test                                                                                                                                                      = train_test_split(X, y, test_size=0.30, random_state=1740)

In [None]:
clf_cart = clf_cart.fit(X_train, y_train)
predicted_cart = clf_cart.predict(X_test)

clf_rf = clf_rf.fit(X_train, y_train)
predicted_rf = clf_rf.predict(X_test)

clf_xgb = xgb.XGBClassifier(max_depth=5, n_estimators=45, use_label_encoder=False, eval_metric='logloss')
clf_xgb = clf_xgb.fit(X_train, y_train)
predicted_xgb = clf_xgb.predict(X_test)

In [None]:
print("CART = %f; RF = %f; XGBJ = %f" % (accuracy_score(y_test, predicted_cart), accuracy_score(y_test, predicted_rf), accuracy_score(y_test, predicted_xgb)))

Random forest and XGBoost look like they're equally as good, so this is just like the prior scenario, right?

Well, before we move too much further, let's check out the confusion matrix and classification report for each, starting with CART.

In [None]:
confusion_matrix(y_test, predicted_cart)

CART got 34 of 41 non-heart attacks right, but only 3 of 8 heart attacks right.

In [None]:
print(classification_report(y_test, predicted_cart))

In [None]:
confusion_matrix(y_test, predicted_rf)

Random forest got an amazing 40 of 41 non-heart attack cases right, but only 2 of 8 heart attacks.

In [None]:
print(classification_report(y_test, predicted_rf))

A little bit more terminology here. **Specificity** is recall on the negative case:  that is, how high is your recall for non-heart attack scenarios?  **Sensitivity** is recall on the positive case: that is, how high is your recall for heart attack scenarios?

This would be an example of a **specific** test.  In other words, this model does a really good job of rejecting the patients who did not have a heart attack:  it successfully rejected 40 of 41 cases.

It is, however, not a particularly **sensitive** test.  In other words, the model does **not** do a good job of correctly detecting patients who did have a heart attack.

In [None]:
confusion_matrix(y_test, predicted_xgb)

XGBoost was not as good at predicting non-heart attacks compared to random forest, but it was much better at predicting heart attacks, getting 5 of 8 right rather than 2 of 8.

In [None]:
print(classification_report(y_test, predicted_xgb))

In other words, if you needed a single test to separate heart attack cases from non-heart attack cases, even though random forest and XGBoost are equally **accurate**, XGBoost shows considerably better sensitivity (recall on the positive case) without giving up too much specificity (recall on the negative case).

The XGBoost does come at the cost of a few more false positives (4 versus 1 for the random forest example), but in most medical scenarios, false negatives are considerably worse than false positives--if you treat a person who didn't need it, the risk is lower than failing to treat a person who does need it.