# Critical evaluation of models and strategies for healthcare applications

## Introduction to Model Performance Evaluation 

1. **Data Preparation**:
   - **80/20 or 70/30 Split**: Common splits for data where 70-80% is used for training and validation, and the rest is for testing.
   - **Importance of Test Set**: Set aside for final performance evaluation, especially critical for time series where training is on earlier data, and testing is on later data.

2. **Train-Validation-Test Split**:
   - **Train-Test Split**: Basic split for training and evaluating models.
   - **Train-Validation-Test Split**: Validation set is used for tuning hyperparameters. Often confused with the test set, but test set is only used for final evaluation after model training.
   - **Validation = Development Set**: Sometimes called the "dev set." Validation is done on unseen data similar to the training set.

3. **Hyperparameter Tuning**:
   - Multiple models are trained with different hyperparameters on the validation set.
   - Best performing hyperparameters are chosen for final model evaluation on the test set.

4. **Cross-Validation (K-Fold)**:
   - Data is split into k subsets (folds).
   - Train on k-1 subsets and test on the remaining subset, rotating the test subset each time.
   - Useful when data is limited to get multiple performance estimates from different splits.

5. **Terminology Clarifications**:
   - Validation ≠ External validation.
   - Validation set is for model development, not final testing.

This summary captures the key points about data preparation, splitting, and model evaluation strategies, focusing on practical implementation in machine learning projects.

## Overfitting and Underfitting

1. **Model Training & Generalization**:
   - Models start by finding weights that fit the training patterns and then progressively refine them.
   - **Overfitting**: Occurs when the model memorizes training data, including noise, which leads to poor performance on unseen data.
   - **Underfitting**: Happens when the model is too simple to capture the underlying structure of the data, leading to poor performance on both training and new data.

2. **Overfitting**:
   - Results from fitting noise and anomalies in the data.
   - More common with complex models (e.g., deep neural networks) and small datasets.
   - Leads to excellent training performance but poor generalization to unseen data (test/validation).
   - **Early stopping**: A common method to avoid overfitting by stopping training when validation error starts to increase.

3. **Underfitting**:
   - Occurs when the model is too simple (e.g., linear models on complex data).
   - Poor performance on both training and test sets.
   - Solutions include using more complex models or algorithms.

4. **Learning Curves**:
   - **Training Loss Curve**: Shows how well the model is fitting the training data over time.
   - **Validation Loss Curve**: Shows performance on unseen validation data.
   - **Good Fit**: Both training and validation loss decrease initially, and then stabilize at similar values (plateau). A small gap between the two is called the **generalization gap**.
   - **Underfitting Diagnosis**: Training and validation loss stay high and flat over time, indicating the model cannot reduce error.
   - **Overfitting Diagnosis**: Training loss decreases steadily, but validation loss starts increasing after a point, signaling memorization of noise in the training set.

5. **Debugging with Learning Curves**:
   - **Overfitting Signs**: Validation loss increases after an initial decrease while training loss keeps decreasing.
   - **Underfitting Signs**: Both training and validation loss curves remain high, indicating the model isn’t learning from the data.

6. **Performance Metrics**:
   - **Accuracy Curve**: Plotted alongside loss curves to visualize actual model performance.
   - While loss gives insights into learning, accuracy helps assess if the model is achieving desirable performance levels.

### Key Concepts:
- **Generalization Gap**: Difference between training loss and validation loss; a small gap is ideal.
- **Plateau**: The point where both loss curves stabilize, indicating that the model has converged to a good fit.

This summary covers the essential concepts of overfitting, underfitting, and the use of learning curves to monitor and diagnose issues during training.

## Strategies to Address Overfitting, Underfitting and Introduction to Regularization 

### Addressing Underfitting

Underfitting happens when a model fails to capture the underlying patterns in the training data, leading to poor performance on both the training and validation sets. Here are some ways to address it:

1. **Train the model longer**: The model might need more time to learn meaningful patterns. Often, what seems like underfitting could simply be that the model hasn't been trained enough to capture the complex features in the data.
   
2. **Increase model capacity**: A model with low capacity, like a simple linear model, may not be complex enough to learn the relationships in the data. Neural networks, known as *universal function approximators*, have the flexibility to model any complex shape or pattern.
   - Add more layers to the network.
   - Increase the number of neurons in each layer.

3. **Ensure data contains enough information**: In cases like clinical image data, downsampling images (e.g., from 2000x3000 to 224x224 pixels) can remove critical information, leading to underfitting. Ensure that preprocessing doesn’t strip away important features.

### Addressing Overfitting

Overfitting occurs when a model learns patterns too specific to the training set, leading to poor generalization to unseen data. Solutions include:

1. **Weight Decay (L1/L2 Regularization)**:
   - This involves adding a penalty to the loss function based on the size of the weights. Large or many non-zero weights indicate a complex model, and regularization discourages this by penalizing large weight values. L1/L2 regularization helps constrain the model complexity, encouraging simpler, more general patterns.

2. **Dropout**:
   - Dropout randomly sets certain neurons’ output to zero during training. This forces the model to build redundancy into its structure, preventing it from relying too heavily on any one neuron and thus discouraging overly complex solutions.

3. **Data Augmentation**:
   - By altering the input data (rotating, cropping, flipping, adjusting brightness, etc.), data augmentation makes it harder for the model to memorize the training data. This technique encourages learning general features that are robust to variations in the input.

### Regularization and Its Role

Regularization is a key strategy to control overfitting, where we impose constraints to ensure the model generalizes better to unseen data. Techniques like weight decay, dropout, and data augmentation help regulate the model's learning, making it more robust in real-world scenarios.


## Statistical Approaches to Model Evaluation

### Evaluating Model Performance in Medical Machine Learning: Key Metrics and Statistical Testing

After training a model and minimizing the loss, it's crucial to assess whether the model performs well enough to be actionable in medical contexts. Simple metrics like **accuracy** can be misleading, especially in imbalanced datasets, such as those used to detect rare diseases. A model predicting the majority class can achieve high accuracy while completely missing critical cases. Therefore, more nuanced metrics are needed, especially in healthcare where stakes are high.

#### 1. **Confusion Matrix**: 
A confusion matrix provides a breakdown of model predictions:
- **True Positives (TP)**: Correctly predicted positive cases.
- **True Negatives (TN)**: Correctly predicted negative cases.
- **False Positives (FP)**: Incorrectly predicted positives (Type I error).
- **False Negatives (FN)**: Incorrectly predicted negatives (Type II error).

This matrix is the basis for calculating other important metrics.

#### 2. **Metrics for Evaluating Medical Models**:
Here are key performance metrics derived from the confusion matrix:
- **Accuracy**: $(TP + TN) / (TP + TN + FP + FN)$  
  Measures the overall correctness of the model, but may not reflect performance well on imbalanced datasets.
  
- **Sensitivity/Recall**: $TP / (TP + FN)$  
  Also known as **recall**, this measures the proportion of actual positives correctly identified. This is critical in healthcare to avoid missing true cases (e.g., correctly identifying disease cases).

- **Specificity**: $TN / (TN + FP)$  
  Measures the proportion of negatives correctly identified. High specificity is essential to minimize false alarms (e.g., predicting disease in healthy individuals).

- **Precision**: $TP / (TP + FP)$  
  Also called **positive predictive value**, this measures how often positive predictions are correct. In medical testing, it’s crucial to ensure that a predicted disease case truly has the disease.

- **Negative Predictive Value (NPV)**: $TN / (TN + FN)$  
  Measures how often negative predictions are correct.

#### 3. **Imbalanced Datasets and the Need for Robust Metrics**:
In rare disease detection, accuracy can be misleading. For instance, if only 1% of cases are positive, a model that predicts everything as negative can still have 99% accuracy but be useless for detecting the disease. This is why sensitivity, specificity, precision, and other measures are crucial.

#### 4. **Receiver Operating Characteristic (ROC) Curve**:
The ROC curve is a graphical representation that plots the **true positive rate** (sensitivity) against the **false positive rate** (1 - specificity) across all possible thresholds. It helps in evaluating the model's performance for different classification thresholds, beyond just relying on an arbitrary threshold like 0.5.

- **Area Under the ROC Curve (AUC)**: This single value summarizes the model's performance across all thresholds. AUC ranges from 0 to 1, with a value closer to 1 indicating better performance. In multi-class classification, you can compute one ROC curve for each class.

#### 5. **Precision-Recall Curve**:
For imbalanced datasets, the **precision-recall curve** might be more informative than the ROC curve. It plots **precision** against **recall** for different thresholds. This curve focuses on the balance between catching positive cases and avoiding false positives, which is especially relevant when positive cases (such as disease) are rare.

### Key Considerations for Medical Machine Learning:
- **Threshold selection**: Choosing a classification threshold (e.g., 0.5) should be done carefully based on the model's intended use. For instance, in critical scenarios, a lower threshold may prioritize sensitivity (detecting true positives) over specificity.
  
- **Prevalence of the condition**: The **positive predictive value (precision)** and **negative predictive value** can be highly influenced by the prevalence of the condition in the dataset. In datasets where disease is rare, even high precision can be misleading.

### Conclusion:
When evaluating medical machine learning models, relying on accuracy alone is inadequate. Metrics like recall, specificity, precision, and NPV, combined with tools like ROC curves, provide a deeper understanding of how the model performs, particularly in high-stakes, imbalanced settings common in healthcare.

# Important metrics for clinical machine learning

## Receiver Operator and Precision Recall Curves as Evaluation Metrics

The ROC (Receiver Operating Characteristic) curve is a key tool for evaluating the performance of a classification model, particularly in healthcare and biomedical sciences. It plots **sensitivity** (true positive rate) on the y-axis against **1 - specificity** (false positive rate) on the x-axis, providing insight into the model's performance across various thresholds, ranging from 0.0 to 1.0. 

The **area under the ROC curve** (AUC or AUROC) is often reported as a single metric that summarizes the model’s ability to distinguish between positive and negative classes. 

### Key Concepts:

1. **Random Classifier**:
   - A random classifier would have an AUC of 0.5, producing equal numbers of true and false positives across thresholds. This creates a diagonal line on the ROC curve, reflecting random guessing.

2. **Perfect Classifier**:
   - A perfect classifier has an AUC of 1.0. At a threshold of 0.5, it perfectly separates the true positives from the false positives, leading to a sharp vertical curve followed by a horizontal line.

3. **Real-World Classifiers**:
   - Most classifiers fall between these two extremes, with good classifiers achieving AUCs closer to 1.0 (e.g., 0.9). At intermediate thresholds, they manage to balance the true positive and false positive rates in a way that reflects better-than-random performance, though not perfectly.

4. **Threshold Selection**:
   - The choice of threshold is critical and varies based on the clinical context:
     - **Cancer Screening**: Prioritize sensitivity (true positives) to catch as many cases as possible, even at the cost of more false positives.
     - **Risky Treatment Decisions**: Prioritize specificity (avoiding false positives) to prevent unnecessary treatment for those who don’t need it.

5. **Operating Points**:
   - Each threshold corresponds to an operating point on the ROC curve, representing a specific trade-off between sensitivity and specificity. Comparing models often involves looking at AUCs, but operating points must be chosen based on clinical requirements.

6. **Utility-Based Thresholds**:
   - Instead of minimizing overall misclassifications, some models use utility-based thresholds to reflect different costs associated with false negatives or false positives, depending on the severity of the outcomes (e.g., missing a cancer diagnosis versus overdiagnosing).

7. **Precision-Recall Curves**:
   - In cases of **imbalanced datasets** (common in healthcare), precision-recall curves may be more informative than ROC curves. Precision-recall curves plot **recall** (sensitivity) versus **precision** (the proportion of predicted positives that are true positives).
   - Precision-recall curves focus on the true positives and ignore the true negatives, making them more suitable when positive examples are rare. 

In summary, ROC curves and AUC are valuable for assessing classification models, but threshold choice and understanding the clinical context are crucial. Additionally, precision-recall curves may offer better insight in imbalanced datasets. Both tools are essential for evaluating and fine-tuning machine learning models in healthcare applications.