<font size=5>Confusion Matrix</font>

<p>When we build models, it is important to assess how good or bad our model is, and how well it performs on unseen data. Several metrics like accuracy, time taken etc. exist to evaluate model performance. We will see some of the most important and useful ones for the same.</p>
<p>What all metrics can we use to evaluate the performance of a classification model? The obvious thing that comes to mind is accuracy over an unseen test set. Accuracy is simply the number of values correctly predicted.</p><p>
There is another metric called the confusion matrix, which is a matrix consisting of the number of predicted and actual values for both classes. Confusion matrix is useful in that we can assess how many predictions the model got right, and we understand that the model is performing in this particular way so we now think about how we can further improve our model.</p>
<p>There are some terms that one must know regarding confusion matrices.</p>
<ol>
    <li>True Positives: This is the number of samples predicted positive which were actually positive.</li>
    <li>True Negatives: This is the number of samples predicted negative which were actually negative.</li>
    <li>False Positives: This is the number of samples predicted positive which were <b>not</b> actually positive.</li>
    <li>False Negatives: This is the number of samples predicted negative which were <b>not</b> actually negative.</li>
</ol>
<p>In the case of multi-class classification, however, the confusion matrix shows the number of samples predicted correctly and wrongly for each class instead of true positives etc.</p>    

In [1]:
from sklearn.metrics import confusion_matrix

y_true = [0,0,1,0,1] # dummy label data
y_pred = [1,1,0,0,1] # dummy predicted data

print(confusion_matrix(y_true,y_pred))

[[1 2]
 [1 1]]


<font size=5>Classification Measures</font>

<p>There are measures other than the confusion matrix which can help achieve better understanding and analysis of our model and its performance. We talk about two particular measures here - precision and recall.</p>
<p>Note that precision and recall will be defined per class label, not for the dataset as a whole. Precision defines the percentage of samples with a certain predicted class label actually belonging to that class label. Recall defines the percentage of samples of a certain class which were correctly predicted as belonging to that class.</p>
<p>However, how do we choose between precision and recall? Which one is a better metric - precision or recall? Turns out, we can use a better metric which combines both of these - the f1 score. The f1 score is defined as the harmonic mean of precision and recall, and is a far better indicator of model performance than precision and recall (usually).</p>


In [3]:
from sklearn.metrics import classification_report

print(classification_report(y_true,y_pred))

             precision    recall  f1-score   support

          0       0.50      0.33      0.40         3
          1       0.33      0.50      0.40         2

avg / total       0.43      0.40      0.40         5



<p>Tip: Accuracy is not always a good measure of model performance. Accuracy fails when the class labels are highly unbalanced, simply because the accuracy will be high owing to the model predicting a large number of samples as belonging to the majority class label. In such cases, f1 score is a better metric. There are some other metrics like ROC_AUC, which stands for Receiver Operating Characteristic - Area Under Curve. That is, it returns the area under the ROC curve.</p>