In [18]:
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

import pandas as pd

From ML with R
"The goal of evaluating a classification model is to have a better understanding of how its performance with extrapolate to future cases. SInce it is usually not feasible to test a still unprove model in a live, environment, we typically simulate fugure conditions by asking the model to classify a dataset made of cases that resimble what it woll be asked to do in the future. By observing the learner's responses to this examination, we can learn abou tits strenghts ans weaknesses"

We accomplish this by comparing the model's predicted class values to the actual class values 

### Accuracy

The simplest measure of a classifier's performance is overall accuracy. In overall accuracy, we divide the number of predictions the classifier got right by the total number of predictions made
<br/><br/>

$$\text{Accuracy} = \frac{\text{# Correct Predictions}}{\text{Total # of Predictions}}$$

<br/><br/>

Imagine that we are trying to build a model to filter out spam email. The following data contains the text of email messages along with their actual type (i.e., label). "Ham" messages constitute real email messages whereas "spam" messages are, well, spam.  


In [181]:
#Data from https://github.com/PacktPublishing/Machine-Learning-with-R-Third-Edition
spam_ham = pd.read_csv('sms_spam.csv')
spam_ham = spam_ham[['type']]
spam_ham.head()

Unnamed: 0,type
0,ham
1,ham
2,ham
3,spam
4,spam


If we have a quick look at the type column, we can see that most messages are ham.

In [114]:
counts_df = spam_ham [['type']].groupby(['type']).size().reset_index(name='outcome_counts')

counts_df['percentage'] = counts_df['outcome_counts'] / counts_df['outcome_counts'].sum()

counts_df

Unnamed: 0,type,outcome_counts,percentage
0,ham,4812,0.865623
1,spam,747,0.134377


In [172]:
t= np.zeros(spam_ham.size)
t.size

5559

Using this data, we can build a very simple classifier, which just predicts the majority class. In other words, the classifier will always predict "ham."  

In [182]:
dummy_clf = DummyClassifier(strategy="most_frequent")
#dummy_clf = DummyClassifier(strategy="stratified")
#Note, by definition, the features are ignored in this model
dummy_clf.fit(np.zeros(spam_ham.size), spam_ham[['type']])
#dummy_clf.fit(spam_ham[['text']], spam_ham[['type']])
dummy_predictions = dummy_clf.predict(np.zeros(spam_ham.size))

accuracy = round(accuracy_score(spam_ham[['type']], dummy_predictions),2)
print(f'Accuracy of majority class model is: {accuracy}')

Accuracy of majority class model is: 0.87


Judging by overall accuracy alone, we might conclude that the dummy classifier is doing a decent job. But of course, this is is not a good model since it never predicts "spam," which is what we are actually trying to get the model to do correctly.

## Confusion Matrix

A confusion matrix is used for classification models. It allows us to break down our results in terms of true positives, true negatives, false positives, and false negatives. This gives us a better sense of the kinds of errors our model is making
<br/><br/>

|            | Predicted true | Predicted false |
|------------|----------------|-----------------|
|Actual true | True Positive  | False Negative  |
|Actual false| False Positive | True Negative   |

<br/><br/>

Accordingly, we can reframe our formula for accuracy as

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
<br/><br/>

We can create a confusion matrix from our majority class model

In [183]:
cm = confusion_matrix(spam_ham[['type']], dummy_predictions)

#need to flip so spam is the positive case
cm = np.flip(cm)


cm_df = pd.DataFrame(cm, 
               columns=['predicted_spam', 'predicted_ham'], 
               index = ['actual_spam', 'actual_ham'])



cm_df

Unnamed: 0,predicted_spam,predicted_ham
actual_spam,0,747
actual_ham,0,4812


In this case, we'll consider "spam" the positive case since that is what we are trying to detect. So, cases where the model predicted "ham" but the message was actually "spam" are false negatives. Cases where our model predicted "spam" but the message was actually "ham" are false positives. 

Even though our model performed well in terms of overall accuracy, we can see from the confusion matrix that it performed abysmally in terms of false negatives. In machine learning, overall accuracy provides a poor measure of a model's performance when we have a "class imbalance," meaning one label occurs much more frequently than the other(s). 



Often with machine learning, we neither want a model that is too conservative in predicting the positive class nor too aggressive. Two metrics that help us assess this are precision and recall

### Precision 

$$\text{Precision} = \frac{TP}{TP + FP}$$

### Recall 

$$\text{Recall} = \frac{TP}{TP + FN}$$


<br/><br/>
In plain language, prediction captures the proportion of positive examples that are actually positive. That is, when the model predicts the positive label, how often is it correct? A precise model will only predict the positive class when the example is very likely to be positive.

Recall is a measure of how complete the positive predictions are. A model that has high recall will capture a large proportion of the actual positive examples. 

We will calculate precision and recall for our dummy classifier, but first we need to do some minor modifications on our predictions to make this possible. Since our dummy classifier produces no instances of the "spam" label, we will get into some divide by 0 issues if we tried to compute these metrics as is. To avoid this, we'll add a few cases of correct and incorrect "spam" predictions to our dataset.

In [184]:
new_data = pd.DataFrame(['spam', 'spam', 'spam', 'spam', 
                         'spam', 'spam', 'ham', 'ham'], columns=['type'])


new_predictions = np.array(['spam', 'spam', 'spam', 'spam', 
                                'spam', 'spam', 'spam', 'spam'])

new_spam_ham = pd.concat([spam_ham, new_data], ignore_index=True)
new_dummy_predictions = np.append(dummy_predictions, new_predictions)

new_cm = confusion_matrix(new_spam_ham[['type']], new_dummy_predictions)

#need to flip so spam is the positive case
new_cm = np.flip(new_cm)


new_cm_df = pd.DataFrame(new_cm, 
               columns=['predicted_spam', 'predicted_ham'], 
               index = ['actual_spam', 'actual_ham'])



new_cm_df


Unnamed: 0,predicted_spam,predicted_ham
actual_spam,6,747
actual_ham,2,4812


In [186]:
TP = new_cm_df.loc['actual_spam', 'predicted_spam']
TN = new_cm_df.loc['actual_ham', 'predicted_ham']
FP = new_cm_df.loc['actual_ham', 'predicted_spam']
FN = new_cm_df.loc['actual_spam', 'predicted_ham']

print (f'Precision is {TP/(TP + FP)}')
print (f'Recall is {TP/(TP + FN)}')


Precision is 0.75
Recall is 0.00796812749003984


In [178]:
5/8

0.625

Precision is relatively better than recall. Why is this? We have designed our dummy classifier so that it almost never predicts "spam." The data we added contained only 8 predictions of "spam," 6 of which were correct (6/8 = 0.75). So, when our dummy model does predict the positive class (i.e., "spam"), it does a pretty decent job. It's a reasonably precise model.

What our model does not do well is cover 


Why is this? This is driven by our data having a "class imbalance." That is, one class is more prevalent than the other. In our data, only about 13% of the examples are actually spam, but our model predicts "spam" 50% of the time. So, the model will overpredict "spam," making it not very precise. 



Sensitivity looks at all the examples where the classifier assigned the "spam" label. 
Specificity looks at all the examples where the classifier assigned the "ham" label. In actuality, about 87% of the messages were "ham," so we would expect roughly that percentage of the classifier's "ham" guesses to be correct. 





### Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

### Precision (Positive Predicted Value) 

$$\text{Precision} = \frac{TP}{TP + FP}$$

Intuitively, what precision states is out of the number of times your model predicts true, how many times is it correct? This metric penalizes heavily for False Positives. This metric should be considered when its OK to have some false negatives but not false positives. Imagine if your model is predicting the conclusion of a jurisdiction. Its OK to leave a criminal free, rather than punishing an innocent one. 

### Recall (Sensitivity) 

$$\text{Recall} = \frac{TP}{TP + FN}$$

Intuitively, what recall states is out of the times the output is true, how many times are you correct? This metric penalizes heavily for False Negatives. This metric should be considered when its OK to have some false positives but not false negatives.


### F1 Score

F1 score is the harmonic mean of precision and recall. 


$$\text{F}_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

In [8]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
clf = SVC().fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))

array([[10,  0],
       [ 1,  9]], dtype=int64)

|            | predicted true | predicted false |
|------------|----------------|-----------------|
|actual true |        10      |        0        |
|actual false|         1      |        9        |