# Evaluation Metrics for Classification

## Confusion Matrix

A **Confusion Matrix** is used to evaluate classification models by comparing predicted and actual labels:

|                  | **Predicted -**      | **Predicted +**      |
|------------------|----------------------|----------------------|
| **Actual -**      | True Negative (TN)   | False Positive (FP)   |
| **Actual +**      | False Negative (FN)  | True Positive (TP)    |

### Key Metrics:

- **Accuracy**: 
  $$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

- **Precision**: 
  $$ \text{Precision} = \frac{TP}{TP + FP} $$

- **Recall**: 
  $$ \text{Recall} = \frac{TP}{TP + FN} $$

- **F1-Score**: 
  $$ \text{F1-Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

These metrics help measure the model's performance.


In [7]:
from sklearn.metrics import confusion_matrix

actual = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1]
predicted = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0]

true_positives = 0
true_negatives = 0
false_positives = 0
false_negatives = 0

for i in range(len(predicted)):
  if actual[i] == 1 and predicted[i] == 1:
    true_positives += 1
  if actual[i] == 0 and predicted[i] == 0:
    true_negatives += 1
  if actual[i] == 0 and predicted[i] == 1:
    false_positives += 1
  if actual[i] == 1 and predicted[i] == 0:
    false_negatives += 1

print(true_positives, true_negatives, false_positives, false_negatives)

conf_matrix = confusion_matrix(actual, predicted)

print(conf_matrix)

3 0 3 4
[[0 3]
 [4 3]]


## Accuracy

**Accuracy** is a common metric for evaluating classification models. It is calculated as the ratio of correctly classified predictions (True Positives and True Negatives) to the total number of predictions.

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

Where:
- **TP** = True Positives
- **TN** = True Negatives
- **FP** = False Positives
- **FN** = False Negatives

Let's calculate the accuracy of the classification algorithm.


In [8]:
accuracy = (true_positives + true_negatives) / len(predicted)
accuracy

0.3

## Recall

**Recall** is useful when the goal is to capture as many true positive cases as possible. It measures the ratio of correct positive predictions (True Positives) to the total number of actual positive cases.

$$ \text{Recall} = \frac{TP}{TP + FN} $$

Where:
- **TP** = True Positives
- **FN** = False Negatives

Recall is the ratio of correct positive classifications made by the model to all actual positives. For example, in a spam classifier, recall would be the number of correctly labeled spam emails divided by all actual spam emails in the dataset.

A model that always predicts "not spam" might have high accuracy, but its recall will be 0 because it never identifies any true positives.


In [9]:
recall = true_positives/(true_positives + false_negatives)
recall

0.42857142857142855

## Precision

**Precision** helps us understand the accuracy of the positive predictions made by the model. It measures the ratio of correct positive predictions (True Positives) to the total number of positive predictions.

$$ \text{Precision} = \frac{TP}{TP + FP} $$

Where:
- **TP** = True Positives
- **FP** = False Positives

Precision is the ratio of correct positive classifications to all positive classifications made by the model. For example, in a spam classifier, precision would be the number of correctly labeled spam emails divided by all the emails predicted as spam (correct or incorrect).

A model that predicts every email is spam would have a recall of 1, but very low precision due to the large number of false positives.


In [10]:
precision = true_positives/(true_positives + false_positives)
precision

0.5

## F1-Score

The **F1-score** combines both precision and recall into a single statistic by calculating their harmonic mean. This is useful because it accounts for both precision and recall in a balanced way, and provides a low score if either precision or recall is low.

$$ \text{F1-score} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}} $$

We use the harmonic mean instead of the arithmetic mean because we want the F1-score to be low when either precision or recall is close to 0.

For example, if recall = 1 and precision = 0.02:

- Arithmetic mean: 
  $$ \frac{1 + 0.02}{2} = 0.51 $$
  
  This value seems high for such a low precision.

- Harmonic mean (F1-score): 
  $$ \frac{2 \times 1 \times 0.02}{1 + 0.02} = 0.039 $$

  This result more accurately reflects the effectiveness of the classifier.


In [11]:
f_1 = 2*precision*recall/(precision+recall)
f_1

0.4615384615384615

## Review

There is no perfect metric for evaluating a classification model. The decision to use **accuracy**, **precision**, **recall**, **F1-score**, or another metric depends on the specific context of the problem.

For example, in the email spam classification problem:
- We may prefer a model with high **precision** to avoid mistakenly labeling important emails as spam, even if it means some spam emails end up in the inbox (low recall).

Understanding the question you're trying to answer will guide you in choosing the most relevant statistic for your problem.

The Python library **scikit-learn** provides functions to calculate all of these metrics.

Key Takeaways:
- Classifications can result in **True Positive (TP)**, **True Negative (TN)**, **False Positive (FP)**, or **False Negative (FN)**. These values are summarized in a **confusion matrix**.
- **Accuracy** measures the proportion of correct classifications out of all classifications made.
- **Recall** is the ratio of correct positive classifications to all actual positives.
- **Precision** is the ratio of correct positive classifications to all predicted positives.
- **F1-score** combines precision and recall. It will be low if either precision or recall is low.


In [12]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

actual = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1]
predicted = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0]

print(accuracy_score(actual, predicted))

print(recall_score(actual, predicted))

print(precision_score(actual, predicted))

print(f1_score(actual,predicted))

0.3
0.42857142857142855
0.5
0.46153846153846156
