<img src="../figs/holberton_logo.png" alt="logo" width="500"/>

# Error Analysis

Evaluation measures are essential for **estimating the performance of machine learning models**. They help understand how well a model performs on new data and how accurately it predicts outcomes. 


These measures provide **quantifiable insights** into the model's effectiveness, guiding decisions on model selection, refinement, and deployment. 


Without thorough evaluation, it's challenging to determine if a model meets its objectivess.

## What is the confusion matrix?

The **confusion matrix is a key tool for error analysis in machine learning**. It is a matrix that shows the number of correct and incorrect predictions made by a classification model on a dataset. The matrix is usually represented as a table with rows and columns representing the predicted and actual classes, respectively.

<img src="../figs/3-supervised/confusion.png" alt="logo" width="500"/>

The four possible outcomes in a binary classification problem are:

1. **True Positive (TP)**: when the model correctly predicts a positive example (i.e., it correctly identifies the presence of a particular class)


2. **False Positive (FP)**: when the model incorrectly predicts a positive example (i.e., it identifies the presence of a particular class when it is not actually present)


3. **True Negative (TN)**: when the model correctly predicts a negative example (i.e., it correctly identifies the absence of a particular class)


4. **False Negative (FN)**: when the model incorrectly predicts a negative example (i.e., it fails to identify the presence of a particular class)

#### Example: Binary Classification (dog vs cat)

|             | Actual Dog   | Actual Cat   |
|-------------|--------------|--------------|
| Predicted Dog   | True Positive (TP)   | False Positive (FP)   |
| Predicted Cat   | False Negative (FN)  | True Negative (TN)    |


## What is type I error? type II?

Type I and Type II errors are two types of errors that can occur in hypothesis testing.


- **Type I error**, also known as a **false positive**, occurs when a hypothesis that is actually true is rejected based on sample data. In other words, it is the error of concluding that there is a significant effect or difference when in reality there is none. The probability of making a Type I error is denoted by the Greek letter alpha ($\alpha$)



- **Type II error**, also known as a **false negative**, occurs when a hypothesis that is actually false is not rejected based on sample data. In other words, it is the error of failing to detect a significant effect or difference when one actually exists. The probability of making a Type II error is denoted by the Greek letter beta ($\beta$), and is affected by factors such as the sample size, the magnitude of the effect being tested, and the level of statistical significance chosen for the test.

**Example**

Type I error (False Positive): This occurs when the model predicts a sample as positive (dog) when it is actually negative (cat).

Type II error (False Negative): This occurs when the model predicts a sample as negative (cat) when it is actually positive (dog). 

<img src="../figs/3-supervised/errors.png" alt="logo" width="500"/>


Both Type I and Type II errors can have serious consequences, depending on the context in which the hypothesis testing is being conducted. For example, in medical testing, a Type I error could lead to a patient being diagnosed with a condition they do not have, while a Type II error could lead to a patient being given a clean bill of health when they actually have the condition.

## What is sensitivity? specificity? precision? recall?

Sensitivity, specificity, precision, and recall are all metrics commonly used to evaluate the performance of a classification model. These metrics are particularly important in cases where the positive and negative classes are not evenly distributed in the dataset.

- **Sensitivity** (also known as recall or true positive rate): **measures the proportion of actual positive instances that are correctly identified by the model**. It is calculated as 

$$
\frac{TP}{TP + FN}
$$
- `TP` is the number of true positives 
- `FN` is the number of false negatives.

Sensitivity is an important metric when the goal is to minimize false negatives (i.e., to avoid missing positive instances).


- **Specificity** (also known as true negative rate): measures the proportion of actual negative instances that are correctly identified by the model. It is calculated as
$$
\frac{TN}{TN + FP}
$$

- `TN` is the number of true negatives  
- `FP` is the number of false positives. 

Specificity is an important metric when the goal is to minimize false positives (i.e., to avoid labeling negative instances as positive).

- **Precision** (also known as positive predictive value): measures the proportion of instances that are correctly classified as positive out of all instances predicted as positive. It is calculated as 
$$
\frac{TP}{TP + FP}
$$

- `TP` is the number of true positives 
- `FP` is the number of false positives. 

Precision is an important metric when the goal is to minimize false positives (i.e., to avoid labeling negative instances as positive).

- **Recall** (also known as sensitivity): measures the proportion of actual positive instances that are correctly identified by the model out of all actual positive instances. It is calculated as 
$$
\frac{TP}{TP + FN}
$$

- `TP` is the number of true positives 
- `FP` is the number of false negatives. 

Recall is an important metric when the goal is to minimize false negatives (i.e., to avoid missing positive instances).

In general, high sensitivity and high specificity are desirable, but in some cases, there may be a trade-off between these two metrics. Precision is also important when there is a high cost associated with false positives, while recall is important when there is a high cost associated with false negatives.

### Example

- **Sensitivity** (True Positive Rate): This is the proportion of actual positives (dogs) that are correctly identified as positive by the model. For example, if the model correctly identifies 90 out of 100 dogs in the dataset, the sensitivity would be 0.9 or 90%.


- **Specificity** (True Negative Rate): This is the proportion of actual negatives (cats) that are correctly identified as negative by the model. For example, if the model correctly identifies 85 out of 100 cats in the dataset, the specificity would be 0.85 or 85%.


- **Recall** (also known as Sensitivity): This is the proportion of true positives (dogs) that are correctly identified by the model out of all actual positives. For example, if the model correctly identifies 90 out of 100 dogs in the dataset, the recall would be 0.9 or 90%.


- **Precision**: This is the proportion of true positives (dogs) that are correctly identified by the model out of all samples predicted as positive by the model. For example, if the model identifies 95 samples as dogs, and 90 of them are actually dogs, the precision would be 0.947 or 94.7%.


- **Accuracy**: This is the proportion of all samples that are correctly classified by the model. It is calculated as the sum of true positives and true negatives divided by the total number of samples. For example, if the model correctly identifies 90 dogs and 85 cats out of a total of 200 samples, the accuracy would be (90 + 85) / 200 = 0.875 or 87.5%.

## What is an F1 score?

The `F1` score is a common metric used to evaluate the performance of a classification model. It is the harmonic mean of precision and recall, and is calculated as 

$$
\frac{2 \cdot (precision \cdot recall)}{precison + recall}
$$

The `F1` score takes into account both precision and recall, and is a useful metric when the positive and negative classes are not evenly distributed in the dataset. It is particularly useful when the goal is to balance both precision and recall, as it gives equal weight to both metrics.

A high `F1` score indicates that the model has both high precision and high recall, meaning it is able to correctly identify positive instances while avoiding false positives and false negatives. A low `F1` score indicates that the model may be biased towards one metric over the other, and may need to be adjusted or improved.

In general, the `F1` score is a good metric to use when the positive class is rare or when there is an imbalance in the number of positive and negative instances in the dataset. However, it is important to note that the `F1` score does not take into account the true negatives, and may not be the best metric to use in cases where the negative class is also important. 

## What is bias? variance?

Bias and variance are two important concepts in machine learning that describe the properties of a model and its ability to generalize to new data.

- **Bias** refers to the tendency of a model to consistently make systematic errors in its predictions, regardless of the training data. **A model with high bias may oversimplify the problem and make assumptions that are not valid for the data**. Such a model may underfit the data and have low accuracy on both the training and test sets.


- **Variance** refers to the tendency of a model to be overly sensitive to the noise or randomness in the training data, leading to high variability in its predictions. **A model with high variance may overfit the training data and perform well on the training set**, but have poor generalization to new data.

<img src="../figs/3-supervised/biasvariance.png" alt="logo" width="350"/>


The **bias-variance tradeoff** is a key concept in machine learning, as it describes the tradeoff between these two properties. A model with high complexity (e.g., a deep neural network with many parameters) may have low bias but high variance, while a model with low complexity (e.g., a linear regression model) may have low variance but high bias. The goal is to find the right balance between bias and variance to achieve good performance on both the training and test sets.

Regularization techniques such as L1 and L2 regularization, dropout, and early stopping can be used to reduce variance and prevent overfitting, while techniques such as feature engineering and ensemble learning can be used to reduce bias and improve the model's accuracy.

**Example**

- **Low bias, low variance**: A model that fits the training data well and generalizes well to new data. It has both low training and test error. This is the ideal situation.


- **Low bias, high variance**: A model that fits the training data well but does not generalize well to new data. It has low training error but high test error. This is often caused by overfitting, where the model is too complex and captures noise in the training data.


- **High bias, low variance**: A model that does not fit the training data well and does not generalize well to new data. It has high training error and high test error. This is often caused by underfitting, where the model is too simple and cannot capture the underlying patterns in the data.


- **High bias, high variance**: A model that has both high training error and high test error. This is caused by a combination of underfitting and overfitting, where the model is both too simple and too complex at the same time. This is often seen in models with high complexity but insufficient data.


## What is irreducible error?

Irreducible error, also known as noise, is a type of error that cannot be reduced by improving the model or by increasing the amount of training data. It represents the inherent variability or randomness in the data that is beyond the control of the model.

Irreducible error arises from many sources, such as measurement errors, natural variations in the data, or unpredictable events that affect the target variable. It is also influenced by external factors that are not included in the model, such as environmental factors, social and economic conditions, and human behavior.

Since irreducible error cannot be reduced by the model, the best way to deal with it is to minimize its impact by collecting high-quality data, using appropriate preprocessing techniques, and selecting appropriate features that are relevant to the problem. In addition, it is important to acknowledge the existence of irreducible error and to evaluate the performance of the model in terms of its ability to capture the underlying patterns in the data while minimizing the impact of the noise.


### Happy Coding