# Model Validation

Model validation is a crucial step in the machine learning workflow that ensures the performance and reliability of a model before it is deployed in real-world applications. It involves assessing how well a model generalizes to unseen data, which helps in identifying potential issues such as overfitting or underfitting.



## Training and Test Sets

For effective model validation, the dataset is typically divided into atleast two subsets:

1. **Training Set**: This subset is used to train the model. Training involves feeding the model with input data and corresponding labels so that it can learn patterns and relationships.

2. **Test Set**: This subset is used to assess the final performance of the model after training and validation. It provides an unbiased evaluation of the model's generalization ability. <u>It is extremely important that the test set remains completely unseen during the training to ensure an unbiased evaluation of the model's performance.</u>


<center><img src="https://fahadsultan.com/csc272_f23/_images/train_test.png" alt="Train, Validation, Test Split" width="100%" style="filter:invert(1)"/></center>

In some cases, a third subset called the **Validation Set** is also used during the training process to tune hyperparameters and make decisions about model architecture. However, in simpler workflows, cross-validation techniques can be employed instead of a separate validation set.

In sklearn, the `train_test_split` function from the `model_selection` module is commonly used to split the dataset into training and test sets. Here is an example:




In [5]:
import pandas as pd 
from sklearn.model_selection import train_test_split

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)

elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


In [6]:
X = elections[['Year', 'Popular vote']]

y = elections['Result']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
X_train.shape, y_train.shape

((127, 2), (127,))

In [9]:
X_test.shape, y_test.shape

((55, 2), (55,))




## Performance Metrics

Depending on the type of problem (classification, regression, etc.), different metrics are used to evaluate model performance. Common metrics include accuracy, precision, recall, F1-score for classification tasks, and mean squared error (MSE), R-squared for regression tasks.

The most common metric for evaluating a classifier is **accuracy**. Accuracy is the proportion of correct predictions. It is the number of correct predictions divided by the total number of predictions.

$$Accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

For example, if we have a test set of 100 documents, and our classifier correctly predicts the class of 80 of them, then the accuracy is 80%.

Accuracy is a good metric when the classes are _balanced_ $N_{class1} \approx N_{class2}$. However, when the classes are imbalanced, accuracy can be misleading. For example, if we have a test set of 100 documents, and 95 of them are positive and 5 of them are negative, then a classifier that always predicts positive will have an accuracy of 95%. However, this classifier is not useful, because it never predicts negative.

**Multi-class classification as multiple Binary classifications**

Every multi-class classification problem can be decomposed into multiple binary classification problems. For example, if we have a multi-class classification problem with 3 classes, we can decompose it into 3 binary classification problems.

<br/>

<img src="../assets/binary_multiclass.png" width="100%" style="filter:invert(1)"/>

<br/>
<br/>

Assuming the categorical variable that we are trying to predict is binary, we can define the accuracy in terms of the four possible outcomes of a binary classifier: 

1. True Positive (TP): The classifier correctly predicted the positive class.
2. False Positive (FP): The classifier **incorrectly** predicted the negative class as positive.
3. True Negative (TN): The classifier correctly predicted the negative class.
4. False Negative (FN):  The classifier **incorrectly** predicted the positive class as negative.

True positive means that the classifier correctly predicted the positive class. False positive means that the classifier incorrectly predicted the positive class. True negative means that the classifier correctly predicted the negative class. False negative means that the classifier incorrectly predicted the negative class.

These definitions are summarized in the table below: 

|       | Prediction $\hat{y} = f'(x)$ | Truth $y = f(x)$     |
| :---        |    :----:   |          ---: |
| True Negative (TN)    | 0        | 0   |
| False Negative (FN)   | 0        | 1      |
| False Positive (FP)   | 1        | 0      |
| True Positive (TP)   | 1        | 1      |



In terms of the four outcomes above, the accuracy is:

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Accuracy is a useful metric, but it can be misleading. 

Other metrics that are often used to evaluate classifiers are: 

* **Precision**: The proportion of positive predictions that are correct. Mathematically, it is defined as:

$$\text{Precision} = \frac{TP}{TP + FP}$$

* **Recall**: The proportion of positive instances that are correctly predicted. Mathematically, it is defined as:

$$\text{Recall} = \frac{TP}{TP + FN}$$

The precision and recall are often combined into a single metric called the **F1 score**. The F1 score is the harmonic mean of precision and recall. The harmonic mean of two numbers is given by:

* **F1 Score**: The harmonic mean of precision and recall.

$$\text{F1-Score} = 2 \times \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

<!-- $$Baseline\ Accuracy = \frac{Number\ of\ majority\ class\ predictions}{Total\ number\ of\ predictions}$$ -->

<!-- The baseline accuracy is the accuracy of the majority class classifier. It is the accuracy we would get if we just guessed the majority class for every instance. It is a useful baseline to compare our classifier to. If our classifier is not better than the baseline, then we should probably just use the baseline classifier.

Another way to evaluate a classifier is to look at the confusion matrix. A confusion matrix is a table that shows the number of correct and incorrect predictions for each class. For example, if we have a test set of 100 documents, and our classifier correctly predicts the class of 80 of them, then the accuracy is 80%. But if we had just guessed the majority class for all of them, we would have gotten 50% accuracy. This is called the baseline accuracy.

<img src="../assets/confusion_matrix.png">



<img src="../assets/classification.png">



<img src="../assets/training_testing.png">
 -->




## Cross-Validation

- A technique used to assess how the results of a statistical analysis will generalize to an independent dataset. Common methods include k-fold cross-validation and leave-one-out cross-validation.

<img src="../assets/cross_validation.png" width="100%" style="filter:invert(1)"/>

## Overfitting and Underfitting

- **Overfitting**: When a model learns the training data too well, including noise and outliers, leading to poor generalization on new data.

- **Underfitting**: When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets.

## Hyperparameter Tuning

- The process of optimizing the parameters that govern the training process of the model (e.g., learning rate, number of trees in a random forest) to improve performance on the validation set.

## Baselines 

Establishing a baseline is an essential step in model validation. A baseline provides a reference point against which the performance of more complex models can be compared. It helps to determine whether a new model is actually improving upon simpler approaches.



## Baseline Models

1. **Simple Heuristic Baseline**:
   - A straightforward approach that uses basic rules or averages. For example, in a classification task, predicting the majority class for all instances.

2. **Random Baseline**:

    - A model that makes random predictions. This is often used to demonstrate that a more sophisticated model performs better than chance.

3. **Majority Class Baseline**:

    - In classification tasks, this baseline predicts the most frequent class in the training data for all instances.