## Evaluating a classification model

* What is the purpose of model evaluation and what are some common evaluation procedures?
* What is the usage of classification accuracy, and what are its limitations?
* How does a confussion matrix describe the performance of a classifier?
* What metrics can be computed from a confusion matrix?
* How can you adjust classifer performance by changing the classification threshold?
* What is the purpose of an ROC curve?
* How does Area Under the Curve (AUC) differ from classification accuracy?

## Review of model evaluation

* Need a way to choose between models.
* We use a **model evaluation procedure** to estimate how well a model will generalize to out-of-sample data.
* This requires a **model evaluation metric** to quantify the model performance.

## Model evaluation metrics

* **Regression problems**. Mean Absolute Error, Mean Squared Error, Root Mean Squared Error.
* **Classification problems**. Classification Accuracy.

However there are many other metrics for both of them.

In [2]:
# read the data into a pandas DataFrame
import pandas as pd
path = 'https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(path, header=None, names=col_names)

**Question**. Can we predict the diabetes status of a patient given their health measurements?

In [4]:
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
feature_cols = ["pregnant", "insulin","bmi","age"]
X = pima[feature_cols]
y = pima["label"]

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [8]:
from sklearn.linear_model import LogisticRegression

In [10]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

In [11]:
y_pred_class = logreg.predict(X_test)

**Classification accuracy:** percentage of correct predictions.

In [12]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.6770833333333334


However, every time we train a classification model we should compare the accuracy with **null accuracy** which is the accuracy that could be achieved by always predicting the most frequent class in the testing set.

In [13]:
#Calculating the proportion of ones
y_test.mean()

0.3229166666666667

In [15]:
#Calculating the proportion of zeroes
1-y_test.mean()

0.6770833333333333

In [16]:
max(y_test.mean(), 1-y_test.mean())

0.6770833333333333

In [17]:
y_test.value_counts()/len(y_test)

0    0.677083
1    0.322917
Name: label, dtype: float64

This shows a weakness of classification accuracy. It does not tell anything about the underlying distribution of the testing set.

In [21]:
# The model has no problem predicting the zeroes but it is so difficult for it to predict the ones.
print(y_test.values[0:25])
print(y_pred_class[0:25])

[1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
[0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]


## Confusion matrix

It is a contingency table that describes the performance of a classification model.

In [25]:
cf = metrics.confusion_matrix(y_test, y_pred_class)
print(metrics.confusion_matrix(y_test, y_pred_class))

[[114  16]
 [ 46  16]]


**Important**. All metrics in sklearn expect the true values as its first argument.

**Basic terminology**

* Upper left. True negatives (TN).
* Lower right. True positives (TP).
* Upper right. False positives (FP) (Type I error).
* Lower left. False negatives (FN) (Type II error).

## Metrics computed from a confusion matrix

**Classification accuracy**. Overall, how often is the classifier incorrect?

In [29]:
TP = cf[1][1]
TN = cf[0][0]
FP = cf[0][1]
FN = cf[1][0]

In [30]:
accuracy = (TP+TN)/float(TP+TN+FP+FN)
print(accuracy)

0.6770833333333334


**Classification error**. Aka misclassification rate.

In [32]:
class_error = (FP+FN)/float(TP+TN+FP+FN)
print(class_error)

0.3229166666666667


**Sensitivity**. This answers the question: When the actual value is positive, how often is the prediction correct?

AKA **true positive rate** or **recall**

In [33]:
sens = TP/float(TP+FN)
print(sens)

0.25806451612903225


**Specificity**. This answers the question: When the actual values is negative, how often the prediction is correct?

This measures how specific (or selective) is the classifier in predicting positive instances.

In [34]:
spec = TN/float(TN+FP)
print(spec)

0.8769230769230769


**False positive rate**. When the actual value is negative, how often the prediction is incorrect?

It is also 1 - specificity

In [36]:
fpr = FP/float(TN+FP)
print(fpr)

0.12307692307692308


**Precision**. When a positive value is predicted, how often is the prediction correct?

This measures how precise is the classifier when predicting positive instances.

In [37]:
prec = TP/float(TP+FP)
print(prec)

0.5


Other scores are Mathews correlation coefficient and f1 score.

Which metric to optimize largely depends on the business subject.

In [38]:
1+1

2