# Model Evaluation: Evaluation Metrics and Cross-Validation Methods

- After building a machine learning model, it's crucial to assess how well the model performs on unseen data. 

- Model evaluation helps ensure that the model generalizes well beyond the training set. 

- This step involves using various evaluation metrics and cross-validation methods to test the model’s performance.

## 1. Evaluation Metrics
Different types of machine learning tasks (classification, regression, etc.) require different evaluation metrics. 
- Below are some common metrics used for both classification and regression models:

### Classification Metrics
1. Accuracy

- Definition: The ratio of correctly predicted instances to the total instances.
- Use Case: Best for balanced datasets, where the classes are equally represented.
- Formula: Accuracy = Correct Predictions/Total Predictions

           Accuracy = 10[Correct Predictions]/20[Total Correct Predictions] =? 50%

- Example: If your model correctly predicts 90 out of 100 images as either cats or dogs, the accuracy is 90%.


Loan Prediction System --> If loan to be granted or not granted
Records/Instances/Rows - 1000 [Male - 800[granted/not granted], Female - 200[granted/not granted]]--> Unbalanced/Biased Dataset
Records/Instances/Rows - 1000 [Male - 550, Female - 450]--> Balanced/Unbiased Dataset

Creating ML model -> training it on above dataset --> Predicting --> ?Biased

Ok - 1+
Not Ok - 1+1+1+1+1+

2. Precision, Recall, and F1-Score

Precision: Measures how many of the predicted positive instances are actually positive.

    Precision = True Positives/True Positives+False Positives

Emails - Spam Detection
Let’s take the example of spam detection in emails:

- True Positive (TP): The model correctly identifies an email as spam (the email is actually spam and the model predicted it as spam).

Example: A marketing email gets correctly flagged as spam.

- False Positive (FP): The model incorrectly identifies a non-spam email as spam (the email is not spam, but the model flagged it as spam).

Example: An important work email gets flagged as spam incorrectly.

- True Positive (TP): Actual positive, predicted positive.
- False Positive (FP): Actual negative, predicted positive.
- False Negative (FN): Actual positive, predicted negative.
- True Negative (TN): Actual negative, predicted negative.

In [2]:
import numpy as np
from sklearn.metrics import confusion_matrix

# Create synthetic true labels (y_true) and predictions (y_pred)
np.random.seed(42)  # For reproducibility
y_true = np.random.randint(0, 2, size=10)  # True labels (binary: 0 or 1)
y_pred = np.random.randint(0, 2, size=10)  # Model predictions (binary: 0 or 1)

print("True Labels (y_true):", y_true)
print("Predictions (y_pred):", y_pred)

# Calculate the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

# Output the results
print(f"True Positives (TP): {tp}")
print(f"False Positives (FP): {fp}")


True Labels (y_true): [0 1 0 0 0 1 0 0 0 1]
Predictions (y_pred): [0 0 0 0 1 0 1 1 1 0]
True Positives (TP): 0
False Positives (FP): 4


### Recall (Sensitivity): 
- Measures how many actual positives are correctly predicted.

    Recall= True Positives/True Positives+False Negatives

### F1-Score: 
- Harmonic mean of precision and recall. Best for imbalanced datasets.

    F1 = 2(Precision*Recall/Precision + Recall)
    
    <br>
    
- Example: In medical diagnosis, if the model predicts the presence of a disease, precision would tell you how often the model is correct when it says the disease is present, and recall tells how often it detects all cases of the disease.


### Confusion Matrix

Definition: A table used to describe the performance of a classification model by displaying true positives, true negatives, false positives, and false negatives.

- Example: For binary classification, it helps identify how well the model differentiates between two classes.

### ROC-AUC (Receiver Operating Characteristic - Area Under the Curve)

- Definition: Measures the performance of a classification model at different threshold levels. AUC evaluates the trade-off between true positive rate and false positive rate.


- Use Case: Used when you want to measure the probability estimates of classification.

## Regression Metrics


### Mean Absolute Error (MAE)

- Definition: The average of the absolute differences between the predicted and actual values.

- Use Case: Provides a clear idea of how far predictions deviate from actual values.

- Example: If the model predicts house prices, MAE will tell the average dollar amount by which the predicted price is wrong.

### Mean Squared Error (MSE)

- Definition: The average of the squared differences between the predicted and actual values.

- Use Case: Punishes larger errors more heavily than MAE.

- Example: Useful in regression problems like stock price predictions, where large errors can have bigger consequences.


### Root Mean Squared Error (RMSE)

- Definition: The square root of MSE, giving the error in the same units as the target variable.

- Use Case: A more interpretable version of MSE.

- Example: If predicting energy consumption, RMSE tells how off the predictions are in kilowatt-hours.

### R-Squared (R²)

- Definition: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables.


# Cross-Validation Methods
Cross-validation helps to estimate the performance of a model on unseen data. It divides the data into subsets to ensure that the model is not overfitting or underfitting.

- Holdout Method
    - Definition: The dataset is split into two sets: a training set and a test set.
    - Drawback: The evaluation is dependent on a single split, which can lead to high variance.


### K-Fold Cross-Validation
- Definition: The data is split into K equal-sized subsets, or folds. The model is trained K times, each time using a different fold as the test set and the remaining folds as the training set.
- How it Works:
    - Split the dataset into K parts (e.g., 5 or 10).
    - Train the model on K-1 parts, then test it on the remaining part.
    - Repeat the process K times, each time using a different fold as the test set.
    - Calculate the average of the evaluation metric across all K iterations.
- Example: For customer segmentation, the dataset is split into 5 folds, ensuring that the model generalizes well across all customer groups.

In [None]:
Test on 1 --> train on others or vise-a-versa