# Measuring Predictive Models

Reading References:

- http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html
- http://scikit-learn.org/stable/modules/model_evaluation.html

## Performance Metrics

|**Term**                   |**Definition**
|--------------------------:|:-------------
|**Confusion Matrix:**      |A contingency table of observations by their predicted vs actual class labels. For the notebook, assume the colums are the predicted labels and the rows are the actual labels.
|**Accuracy:**              |The fraction of instances that are correctly classified. `sum(diagonal) / observations`
|**Precision:**             |The fraction of correct predictions for a certain class. `diagonal / colsums`
|**Recall:**                |The fraction of instances of a class that were correctly predicted. `diagonal / rowsums`
|**F1:**                    |Harmonic mean (or a weighted average) of precision and recall. `2 * precision * recall / (precision + recall) `
|**One-vs-all Matrices:**   |Confusion matrix for one class at a time. The sum of these matrices allows us to compute weighted metrics.
|**Average Accuracy:**      |The fraction of correctly classified instances in the sum of one-vs-all matrices matrix.
|**Macro-averaged Metrics:**|Average performance, equal weights by class.
|**Micro-averaged Metrics:**|Average performance, weighted by sum of one-vs-all matrices. Favors classes with a larger number of instances.

## Baseline Models

|**Term**                   |**Definition**
|--------------------------:|:-------------
|**Majority-class Metrics:**|lol
|**Random-guess Metrics:**  |lol
|**Kappa Statistic:**       |lol

In [24]:
import numpy as np
import pandas as pd

In [12]:
# Generate data
np.random.seed(0)
data_classes = np.array(['a', 'b', 'c'])
actual = np.random.choice(data_classes, size = 100, replace = True)

predicted = actual.copy()
mix_indices_actual = np.random.choice(range(100), size = 30, replace = False)
mix_indices_predicted = np.random.choice(range(100), size = 30, replace = False)
predicted[mix_indices_predicted] = actual[mix_indices_actual]

### Confusion Matrix

References for further implementation:

- https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python

In [66]:
# manual implementation
cm = np.zeros([3, 3]).astype(int)

for cm_index_actual, i in enumerate(data_classes):
    for cm_index_predict, j in enumerate(data_classes):
        cm[cm_index_actual, cm_index_predict] = np.sum((actual == i) & (predicted == j))

cm_index = ['predict ' + x for x in data_classes]
cm_columns = 
pd.DataFrame(cm, index = predict_ind, columns = ['actual ' + x for x in data_classes])

Unnamed: 0,actual a,actual b,actual c
predict a,33,3,3
predict b,3,29,2
predict c,3,3,21


In [63]:
# sklearn implementation
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual, predicted, labels = data_classes)

pd.DataFrame(cm, index = ['predict ' + x for x in data_classes], columns = ['actual ' + x for x in data_classes])

Unnamed: 0,actual a,actual b,actual c
predict a,33,3,3
predict b,3,29,2
predict c,3,3,21
