In [18]:
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

import pandas as pd

In [16]:
# spam_ham.size
# X = np.zeros([spam_ham.size, 1])
spam_ham.shape[0]

1390

From ML with R
"The goal of evaluating a classification model is to have a better understanding of how its performance with extrapolate to future cases. SInce it is usually not feasible to test a still unprove model in a live, environment, we typically simulate fugure conditions by asking the model to classify a dataset made of cases that resimble what it woll be asked to do in the future. By observing the learner's responses to this examination, we can learn abou tits strenghts ans weaknesses"

We accomplish this by comparing the model's predicted class values to the actual class values 

### Accuracy

The simplest measure of a classifier's performance is overall accuracy. In overall accuracy, we divide the number of predictions the classifier got right by the total number of predictions made
<br/><br/>

$$\text{Accuracy} = \frac{\text{# Correct Predictions}}{\text{Total # of Predictions}}$$

<br/><br/>

Imagine that we are trying to build a model to filter out spam email. The following data contains the text of email messages alone with their actual type. "Ham" messages constitute real email messages whereas "spam" messages are, well, spam.  


In [39]:
#Data from https://github.com/PacktPublishing/Machine-Learning-with-R-Third-Edition
spam_ham = pd.read_csv('sms_spam.csv')
spam_ham.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...


If we have a quick look at the type column, we can see that most messages are ham.

In [45]:
spam_ham [['type']].groupby(['type']).size().reset_index(name='outcome counts')

Unnamed: 0,type,outcome counts
0,ham,4812
1,spam,747


Using this data, let's build the simplest classifier possible, which just predicts the majority class. Sometimes this is referred to as ZeroR for "zero rules" since it does not learn any associations between features and outcomes. 

In [46]:
dummy_clf = DummyClassifier(strategy="most_frequent")
#Note, by definition, the features are ignored in this model
dummy_clf.fit(spam_ham[['text']], spam_ham[['type']])
dummy_predictions = dummy_clf.predict(spam_ham[['text']])

#confirm that all of the predictions are the same
#print(pd.unique(dummy_predictions))

['ham']


Let's quickly confirm that all of the predictions are indeed "ham"

In [47]:
#spam_ham [['type']].groupby(['type']).size().reset_index(name='outcome counts')
np.unique(dummy_predictions, return_counts=True)

(array(['ham'], dtype='<U3'), array([5559]))

In [44]:
accuracy = round(accuracy_score(spam_ham[['type']], dummy_predictions),2)
print(f'Accuracy of majority class model is: {accuracy}')

Accuracy of majority class model is: 0.87


Judging by accuracy alone, we might conclude that the ZeroR classifier is doing a pretty decent job. But of course, this is a pretty terrible model if we actually want to use it to filter out spam since it catches a whopping 0% of spam messages. 

In [9]:
spam_ham [['actual_type']].groupby(['actual_type']).size().reset_index(name='counts')

Unnamed: 0,actual_type,counts
0,ham,1207
1,spam,183


## Confusion Matrix

Confusion matrix is used only on classification tasks. It describes the following matrix

|            | predicted true | predicted false |
|------------|----------------|-----------------|
|actual true | True Positive  | False Negative  |
|actual false| False Positive | True Negative   |

---------------------------------------------------

### Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

### Precision (Positive Predicted Value) 

$$\text{Precision} = \frac{TP}{TP + FP}$$

Intuitively, what precision states is out of the number of times your model predicts true, how many times is it correct? This metric penalizes heavily for False Positives. This metric should be considered when its OK to have some false negatives but not false positives. Imagine if your model is predicting the conclusion of a jurisdiction. Its OK to leave a criminal free, rather than punishing an innocent one. 

### Recall (Sensitivity) 

$$\text{Recall} = \frac{TP}{TP + FN}$$

Intuitively, what recall states is out of the times the output is true, how many times are you correct? This metric penalizes heavily for False Negatives. This metric should be considered when its OK to have some false positives but not false negatives.


### F1 Score

F1 score is the harmonic mean of precision and recall. 


$$\text{F}_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

In [8]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
clf = SVC().fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))

array([[10,  0],
       [ 1,  9]], dtype=int64)

|            | predicted true | predicted false |
|------------|----------------|-----------------|
|actual true |        10      |        0        |
|actual false|         1      |        9        |