# Practice Activity: Applying PEFT

## Introduction
Parameter-efficient fine-tuning (PEFT) is a technique that reduces the computational cost and memory requirements of fine-tuning large pretrained models. Instead of updating all of the model’s parameters, PEFT focuses on fine-tuning a smaller subset of the model’s parameters while keeping most of the model's original weights frozen. This approach allows for faster training times and lower memory usage, making fine-tuning more feasible for large-scale models.

In this reading, we'll explore the key steps for applying PEFT to a pretrained model and the benefits of using this technique.

By the end of this activity, you will be able to:

*   Understand the concept of PEFT and its advantages.
*   Identify key parameters for fine-tuning and apply the PEFT technique to a pretrained model.
*   Implement a fine-tuning process with reduced computational cost and memory usage.
*   Evaluate and optimise the performance of the fine-tuned model using PEFT.

## Why Use PEFT?
Traditional fine-tuning methods require updating all of the model’s parameters, which can be computationally expensive, especially for large models such as GPT-3, BERT, or T5. PEFT offers several benefits:

*   **Reduced computational cost:** By only fine-tuning a subset of the model’s parameters, you can significantly reduce the amount of computational resources needed.
*   **Lower memory requirements:** PEFT uses less memory since only a few parameters are updated, making it easier to fine-tune on smaller GPUs or machines with limited resources.
*   **Faster training times:** With fewer parameters to update, the training process is much faster, allowing for quicker iterations and experiments.

## Step-by-Step Process for Applying PEFT
This reading will guide you through the following steps:

1.  Prepare your data and identify the subset of parameters for fine-tuning
2.  Set up fine-tuning with PEFT
3.  Monitor and evaluate performance
4.  Optimise PEFT for your task

## Step 1: Prepare Your Data and Identify the Subset of Parameters for Fine-Tuning
Before beginning the fine-tuning process, ensure that your dataset is properly prepared. You should be working with a task-specific dataset (e.g., sentiment analysis, text classification) that aligns with the pretrained model you’ll be using. Preprocess the data, ensuring it’s tokenised and ready for input into the model. For this activity, we’ll assume you’re working with a classification task, but this process can also be adapted for other tasks.

**Instructions for Preparing Your Data:**

*   Ensure that your dataset is cleaned and preprocessed.
*   Tokenise the data using a tokenizer compatible with the pretrained model (e.g., BERT tokenizer for a BERT model).
*   Split your dataset into training, validation, and test sets.

Once your data is ready, the next step is identifying which parameters to fine-tune. In PEFT, we often fine-tune the parameters in the task-specific heads, which are the layers responsible for generating predictions based on the task. For models like BERT, the task-specific heads are the final few layers, usually the classification head.

**Locate the Task-Specific Heads:**
In a BERT-based model, task-specific heads typically refer to the layers at the end of the model used for tasks such as classification, where the model generates outputs based on the input data.
You can inspect the model architecture to find these heads and determine which layers are responsible for your task.

**Approach:**
To implement PEFT, you will freeze most of the model’s parameters, allowing only the parameters in the task-specific heads (final layers) to be updated. This strategy minimises computational cost while allowing the model to adapt to your specific task.

**Customise Fine-Tuning:**
You can also choose to fine-tune multiple layers if your task requires more adaptation. For example, you might fine-tune the last two or three layers instead of just the final classification head. This gives you more flexibility in training while still taking advantage of the efficiency of PEFT.

**Code Example:**
```python
# Load pre-trained BERT model
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Step 1: Freeze all layers except the last one (classification head)
for param in model.base_model.parameters():
    param.requires_grad = False
```
This loop freezes all the layers except for the final classification head. If you wish to fine-tune more than just the last layer, you can modify the loop to unfreeze the last two or three layers for retraining.

## Step 2: Set Up Fine-Tuning with PEFT
Once you've identified the fine-tuning parameters, you can set up the process. For this example, we will use the Hugging Face Transformers library, which provides an easy interface for model fine-tuning.

**Instructions for Fine-Tuning with PEFT:**

*   Freeze the layers of the model (as shown in the previous code block).
*   Set up the fine-tuning process using Hugging Face’s `Trainer` class and `TrainingArguments`.
*   Fine-tune the model based on the trainer setup.

**Code Example:**
```python
from transformers import Trainer, TrainingArguments

# Step 1: Set training arguments for fine-tuning the model
training_args = TrainingArguments(
    output_dir='./results',             # Directory where results will be stored
    num_train_epochs=3,                 # Number of epochs (full passes through the dataset)
    per_device_train_batch_size=16,     # Batch size per GPU/CPU during training
    evaluation_strategy="epoch",        # Evaluate the model at the end of each epoch
)
```

**Note:**

*   The `Trainer` class from Hugging Face is responsible for setting up the fine-tuning process.
*   The line `trainer.train()` fine-tunes the model with PEFT, leveraging the frozen layers from Step 1.

## Step 3: Monitor and Evaluate Performance
After fine-tuning the model with PEFT, it is important to evaluate the model's performance and compare it to traditional fine-tuning methods. PEFT achieves similar or even better performance with less computational cost.

**Evaluation:**
Use standard evaluation metrics (e.g., accuracy, F1 score) to monitor the fine-tuned model's performance on the validation and test sets.

**Code Example:**
```python
# Evaluate the model
results = trainer.evaluate(eval_dataset=test_data)
print(f"Test Accuracy: {results['eval_accuracy']}")
```

## Step 4: Optimise PEFT for Your Task
PEFT can be further optimised for specific tasks by experimenting with different sets of parameters or layers to fine-tune. You can also try adjusting the learning rate or batch size to see how they impact the model’s performance.

**Optimisation Ideas:**

*   Fine-tune additional layers (e.g., the last two to three layers instead of just the final classification head).
*   Adjust hyperparameters such as learning rate and number of epochs to find the best configuration for your task.

**Code Example:**
```python
# Example of adjusting learning rate for PEFT optimisation
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=5e-5,  # Experiment with different learning rates
    num_train_epochs=5,
    per_device_train_batch_size=16,
)
```

## Conclusion
PEFT is an efficient method for fine-tuning large pretrained models, allowing you to save computational resources and time without sacrificing performance. By focusing on fine-tuning a subset of parameters, you can achieve task-specific improvements while keeping the rest of the model intact. This makes PEFT particularly useful when hardware resources are limited or rapid experimentation is needed.


# Practice activity: Applying QLoRA

## Introduction
Quantized Low-Rank Adaptation (QLoRA) is a cutting-edge fine-tuning technique designed to reduce memory and computational requirements while maintaining model performance drastically. It builds on Low-Rank Adaptation (LoRA) principles but adds quantization to the process, further reducing the size of the model’s weight matrices. This allows even large-scale language models to be fine-tuned on smaller hardware, making them accessible for more practical use cases.

In this reading, we’ll explore how QLoRA works, its advantages, and the steps to apply it effectively to fine-tune pretrained models.

By the end of this reading, you will be able to:
- Describe how QLoRA combines quantization and low-rank adaptation for efficient fine-tuning.
- Apply QLoRA to a pretrained model to reduce memory and computational costs.
- Fine-tune a quantized low-rank model on task-specific data and evaluate its performance.
- Optimize QLoRA for specific tasks by adjusting quantization levels and rank values.

## Why use QLoRA?
Traditional fine-tuning approaches require updating all the parameters in a model, which can be resource-intensive, especially for large models. LoRA addresses this issue by introducing low-rank adaptations, but even LoRA can require significant memory for very large models. QLoRA enhances the fine-tuning process by applying quantization, which reduces the precision of the model's weights (e.g., from 32-bit to 8-bit or even 4-bit), lowering the memory and computational requirements. Quantizing a model involves approximating the model's weight values to lower-precision numbers, significantly reducing the memory footprint while preserving much of the model's performance. This makes fine-tuning feasible on smaller hardware such as consumer graphics processing units (GPUs).

### Benefits of QLoRA
- **Lower memory requirements**: by quantizing model parameters, QLoRA reduces the memory needed for storing and processing large models.
- **Reduced computational costs**: similar to LoRA, QLoRA reduces the number of parameters that need to be fine-tuned. Quantization further reduces the computational burden.
- **Faster training**: QLoRA allows for faster fine-tuning due to its smaller memory and computational requirements, making it ideal for rapid iterations.

## Step-by-step guide to fine-tune with QLoRA
The remaining of this reading will guide you through the following steps:
1. Step 1: Data setup for QLoRA fine-tuning
2. Step 2: Apply QLoRA to a pretrained model
3. Step 3: Fine-tune the QLoRA-enhanced model
4. Step 4: Evaluate the QLoRA-fine-tuned model
5. Step 5: Optimize QLoRA for specific tasks

### Step 1: Data setup for QLoRA fine-tuning
To begin fine-tuning using QLoRA, you must set up your data properly. This includes preparing the dataset by splitting it into training, validation, and test sets. This step is crucial for ensuring that the model is trained effectively and can generalize well to unseen data.

**Steps**
- Collect or load the dataset you want to use for fine-tuning.
- Split the dataset into training (for model learning), validation (for tuning hyperparameters), and test sets (for evaluating performance).
- Preprocess the data by tokenizing it, ensuring that it aligns with the input format expected by the model.

### Step 2: Apply QLoRA to a pretrained model
To apply QLoRA, you need to quantize the model and apply low-rank adaptations to specific layers, such as attention layers or feed-forward networks. QLoRA modifies these layers while keeping the rest of the model frozen.

In most cases, QLoRA allows you to choose which layers to quantize. You can experiment by quantizing only certain layers, such as the attention layers or feed-forward networks, rather than quantizing all layers. This flexibility allows you to explore different configurations and adjust the quantization to fit your specific task.

Both GPT-2 and BERT are pretrained transformer models widely used for natural language processing tasks. While GPT-2 is a generative model focusing on text generation, and BERT is optimized for tasks such as classification and question answering, they share a similar architecture based on the transformer model. This makes them both suitable candidates for QLoRA, demonstrating how the method can be applied to a variety of pretrained models.

**Steps**
- Load a pretrained model (e.g., GPT-2, BERT).
- Quantize the model to reduce precision.
- Apply LoRA to specific layers.
- Fine-tune the quantized low-rank matrices while freezing the rest of the parameters.

**Code example**
```python
from transformers import GPT2ForSequenceClassification
from qlora import QuantizeModel, LoRALayer

# Load the pre-trained GPT-2 model
model = GPT2ForSequenceClassification.from_pretrained('gpt2')

# Quantize the model
quantized_model = QuantizeModel(model, bits=8)

# Apply LoRA to specific layers (e.g., attention layers)
```

**Explanation**
In this example, the pretrained GPT-2 model is quantized to 8 bits, drastically reducing its memory requirements. LoRA is then applied to specific layers, such as attention heads, to ensure that only a small subset of parameters is fine-tuned.

### Step 3: Fine-tune the QLoRA-enhanced model
Once QLoRA is applied, the fine-tuning process begins. You will fine-tune the quantized model's low-rank matrices on your task-specific dataset, allowing the model to adapt to the task efficiently.

**Steps**
- Prepare the dataset by splitting it into training, validation, and test sets.
- Fine-tune the model using only the quantized low-rank matrices.

**Code example**
```python
from transformers import Trainer, TrainingArguments

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
)
```

**Explanation**
The model is fine-tuned using the Trainer API, but only the quantized low-rank matrices are updated during training, making the process more efficient compared to traditional fine-tuning.

### Step 4: Evaluate the QLoRA-fine-tuned model
After fine-tuning, it’s important to evaluate the model’s performance on the test set to determine how well it generalizes to unseen data. While quantization can sometimes introduce small performance trade-offs, QLoRA aims to balance efficiency with high performance.

**Code example**
```python
# Evaluate the model on the test set
results = trainer.evaluate(eval_dataset=test_data)
print(f"Test Accuracy: {results['eval_accuracy']}")
```

**Explanation**
After fine-tuning, the model is evaluated using the test set. Standard evaluation metrics such as accuracy, precision, recall, and F1 score can be used to assess the model’s performance.

### Step 5: Optimize QLoRA for specific tasks
You can optimize QLoRA by adjusting the rank of the low-rank matrices or experimenting with different quantization levels. You can find the best balance between model efficiency and performance for your specific task by tuning these parameters.

**Optimization ideas**
- Adjust the rank of the low-rank matrices (e.g., increasing or decreasing the rank).
- Experiment with different quantization levels (e.g., 4-bit or 8-bit quantization) to see how they affect the model’s performance.
- Consider experimenting with other parameters, such as dropout rate, learning rate, or layer-wise adaptation, to see how they influence fine-tuning results. This provides additional flexibility in customizing the model for task-specific requirements.

**Code example**
```python
from qlora import adjust_qlora_rank

# Adjust the rank of the low-rank matrices
adjust_qlora_rank(quantized_model, rank=4)  # Experiment with different rank values
```

## Conclusion
QLoRA is an advanced fine-tuning technique that combines the benefits of quantization and low-rank adaptation. By reducing the memory and computational requirements, QLoRA makes it feasible to fine-tune large models even on consumer-grade hardware. With careful application, QLoRA can deliver efficient fine-tuning without sacrificing performance, making it ideal for resource-constrained environments.


# Practice activity: Applying LoRA

**Disclaimer:**
Azure libraries are regularly updated, and changes may occasionally affect the behavior of this exercise. If you experience any issues, consider rolling back the affected library to an earlier version to maintain compatibility. Always refer to official Microsoft documentation for the most current guidance.

## Introduction
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning technique that allows us to adapt large pretrained models to specific tasks with a substantial reduction in computational and memory costs. Instead of adjusting all model parameters, LoRA applies low-rank matrix modifications to key layers, such as attention heads, which means only a small subset of parameters needs to be fine-tuned. This method makes LoRA ideal for adapting large models to task-specific data without the significant resource demands of full model fine-tuning. In this reading, we’ll examine how LoRA functions, the steps for implementing it, and its benefits for fine-tuning large language models efficiently.

By the end of this reading, you will be able to:
- Describe the key concepts and benefits of low-rank adaptation (LoRA) in fine-tuning large models.
- Apply LoRA to a pretrained model for task-specific fine-tuning.
- Fine-tune a model using LoRA with minimized computational and memory resources.
- Evaluate and optimize the performance of a LoRA-fine-tuned model.

## Why use LoRA?
Traditional fine-tuning methods require adjusting all the parameters in a model, which is resource-intensive, especially for large transformer-based models like BERT, RoBERTa, and GPT. As models grow larger, the computational and memory costs of full fine-tuning increase substantially. LoRA addresses these challenges by applying low-rank adaptations within specific layers, focusing on fine-tuning only a subset of parameters that represent a low-rank approximation of the original model's weight matrices. The benefits of LoRA include the following:

- **Reduced memory usage**: LoRA drastically reduces the memory footprint by fine-tuning only low-rank matrices rather than all model parameters, making it ideal for environments with limited memory capacity.
- **Lower computational cost**: since fewer parameters are being optimized, LoRA requires less computation, reducing both time and energy consumption.
- **Faster training and experimentation**: with fewer parameters to update, LoRA shortens training time, enabling faster experimentation and quicker iterations for model improvement.

LoRA is particularly advantageous when working with large models in environments with constrained resources, such as edge devices or research environments in which computational budgets are limited. It also makes fine-tuning large models more feasible for a broader range of applications without requiring access to powerful hardware.

## Step-by-step process to fine-tune a model using LoRA
The remainder of this reading will guide you through the following steps:
1. Step 1: Prepare your dataset.
2. Step 2: Apply LoRA to the model.
3. Step 3: Fine-tune the model with LoRA.
4. Step 4: Evaluate the LoRA-fine-tuned model.
5. Step 5: Optimize LoRA for your task.

### Step 1: Prepare your dataset
Before you can fine-tune a model using LoRA, it’s essential to ensure that your dataset is preprocessed and structured correctly. Proper dataset preparation is key to achieving reliable performance during fine-tuning and evaluation.

**Instructions**
- **Clean and preprocess the data**: remove irrelevant entries, handle missing values, and standardize the text as needed to ensure the data is ready for processing.
- **Tokenize the data**: use a tokenizer compatible with your chosen model (e.g., a BERT tokenizer for BERT models). This step prepares the text for input into the model.
- **Split the dataset**: divide the dataset into training, validation, and test sets to allow for reliable performance evaluation. A typical split is 70 percent for training, 15 percent for validation, and 15 percent for testing.

By preparing the dataset carefully, you enable efficient fine-tuning and ensure that your model has access to high-quality, representative data for learning task-specific patterns.

### Step 2: Apply LoRA to the model
Once you have prepared your dataset, you can modify specific layers of a pretrained model using LoRA. The goal is to introduce low-rank matrices to key layers, often the attention layers in transformer models. This modification allows you to fine-tune only the parameters of the low-rank matrices while keeping the rest of the model frozen, significantly reducing computational requirements.

**Instructions for preparation**
- **Ensure dataset readiness**: confirm that you have preprocessed and tokenized the dataset as outlined in Step 1.
- **Understand the model’s architecture**: review the structure of the model you’re working with, typically a transformer such as BERT or GPT, to identify layers where you can apply LoRA.
- **Identify relevant layers**: in transformer-based models, attention layers are often the primary targets for LoRA because they manage most of the information flow in these architectures. By printing out the model’s named modules, you can identify the specific attention layers where LoRA can be introduced. These layers typically have "attention" in their names.

**Approach**
- **Load the pretrained model**: start with a pretrained model such as BERT to leverage its existing language understanding capabilities.
- **Apply LoRA to attention layers**: use a LoRA-specific function, such as LoRALayer, to modify only the attention layers.
- **Freeze remaining parameters**: freeze all other parameters in the model to ensure that only the LoRA-modified layers are adjusted during training.

**Code example**
```python
from lora import LoRALayer
from transformers import BertForSequenceClassification

# Load a pre-trained BERT model for classification tasks
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Print model layers to identify attention layers where LoRA can be applied
for name, module in model.named_modules():
    print(name)  # This output helps locate attention layers
```

**Explanation**
- `print(name)`: prints each model component to help locate the attention layers where LoRA can be applied.
- `module.apply(LoRALayer)`: applies the LoRA modification to the identified attention layers.
- `param.requires_grad = False`: ensures all other parameters remain frozen, meaning only LoRA-modified layers will be fine-tuned.

This setup enables a targeted fine-tuning approach, in which only specific, low-rank parameters are adjusted, minimizing resource use.

### Step 3: Fine-tune the model with LoRA
With LoRA applied to specific layers, you’re ready to fine-tune the model on your task-specific dataset. The goal is to update only the low-rank matrices in the attention layers, optimizing them for the task while keeping the rest of the model’s parameters static.

**Approach**
- **Start training**: fine-tune the model using the prepared dataset from Step 1.
- **Monitor progress**: use the validation dataset to track the model’s performance during training.
- **Focus on LoRA layers**: since LoRA was applied to the attention layers, only the low-rank matrices in these layers will be updated during training, reducing overall computational demand.

**Code example**
```python
from transformers import Trainer, TrainingArguments

# Configure training parameters
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
)
```

**Explanation**
- `TrainingArguments(...)`: specifies key training parameters, such as the number of epochs, batch size, and evaluation frequency.
- `Trainer(...)`: initializes the trainer, linking it to the model, training arguments, and datasets.
- `trainer.train()`: starts the fine-tuning process, which updates only LoRA-modified layers.

By focusing on just the low-rank matrices, you achieve efficient task-specific fine-tuning without the overhead of updating the entire model.

### Step 4: Evaluate the LoRA-fine-tuned model
After fine-tuning, evaluate the model’s performance using standard metrics such as accuracy, F1 score, and precision/recall. Since LoRA optimizes only a small subset of parameters, memory and computational costs are reduced, yet the model can still deliver performance that rivals traditional fine-tuning.

**Code example**
```python
# Evaluate the LoRA fine-tuned model on the test set
results = trainer.evaluate(eval_dataset=test_data)
print(f"Test Accuracy: {results['eval_accuracy']}")
```

**Explanation**
- `trainer.evaluate(...)`: runs an evaluation on the test dataset.
- `results['eval_accuracy']`: retrieves the test accuracy, indicating how well the model generalizes to unseen data.

This evaluation step confirms the model’s effectiveness and highlights the efficiency gains from fine-tuning only low-rank matrices, which helps maintain strong performance despite reduced computational overhead.

### Step 5: Optimize LoRA for your task
To achieve even better results, consider experimenting with the rank of the low-rank matrices in LoRA. By adjusting the rank, you can control the number of parameters in the low-rank matrices, balancing the trade-off between computational efficiency and model performance. A higher rank can capture more complexity but may require additional resources, while a lower rank further reduces resource demands.

**Optimization ideas**
- **Adjust the rank**: experiment with different ranks in the low-rank matrices to find an optimal balance for your specific task.
- **Extend LoRA application**: apply LoRA to additional layers to capture more complex task-specific features.

**Code example**
```python
# Example of adjusting the rank in LoRA
from lora import adjust_lora_rank

# Set a lower rank for fine-tuning, experiment with values for optimal performance
adjust_lora_rank(model, rank=2)
```

**Explanation**
- `adjust_lora_rank(model, rank=2)`: sets a lower rank for LoRA, which further reduces the number of parameters involved in fine-tuning, allowing for experiments with different ranks to optimize performance.

This fine-tuning adjustment enables you to fine-tune LoRA-modified layers more precisely, helping the model balance resource use with performance more effectively.

## Conclusion
LoRA provides a resource-efficient alternative to traditional full model fine-tuning, allowing large pretrained models to be tailored to specific tasks with a fraction of the computational cost. By fine-tuning only low-rank approximations within key layers, LoRA enables significant reductions in memory and computational demands while retaining effective performance. This technique is particularly valuable for applications in resource-constrained environments or when experimenting with large models on specialized tasks. By following this guide, you have learned how to apply LoRA to fine-tune models efficiently, making it feasible to leverage powerful language models in various real-world applications without the prohibitive resource requirements typically associated with full fine-tuning.


# Evaluating fine-tuned models

## Introduction
After fine-tuning a pretrained model, it is critical to evaluate its performance on a task-specific dataset. Evaluation helps determine how well the model has adapted to the new task and whether it can generalize to unseen data. In this reading, we will cover key metrics and methods for evaluating fine-tuned models, including accuracy, precision, recall, and F1 score.

By the end of this reading, you will be able to:
- Explain the importance of evaluating fine-tuned models on unseen data.
- Use key evaluation metrics such as accuracy, precision, recall, and F1 score to assess model performance.
- Recognize signs of overfitting and underfitting during fine-tuning.
- Compare the effectiveness of different fine-tuning techniques, including traditional fine-tuning, LoRA, and QLoRA.
- Optimize the trade-off between performance and resource efficiency when evaluating fine-tuning methods.

## Why evaluation matters
The goal of fine-tuning is to adapt a general-purpose, pretrained model to perform well on a specific task. However, even after fine-tuning, there is no guarantee that the model will perform optimally. Evaluating the model is necessary to:
- Ensure the model can generalize to unseen data.
- Identify potential overfitting or underfitting issues.
- Compare the performance of different fine-tuning techniques, such as traditional fine-tuning, LoRA, or QLoRA.

## Key metrics for evaluation

### Accuracy
Accuracy measures the proportion of correctly predicted instances out of the total instances. This is a common metric for classification tasks.

**Formula:**
$$Accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

**Example:**
Imagine a binary classification task with 100 instances:
- Correct predictions: 80
- Total predictions: 100

$$Accuracy = \frac{80}{100} = 0.8 \text{ (or } 80\%)$$

**When to use:** Accuracy is useful when class distribution is balanced and the cost of false positives and false negatives is roughly the same.

### Precision
Precision measures how many of the model's positive predictions are actually correct. Precision is especially useful when false positives are costly.

**Formula:**
$$Precision = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

**Example:**
Imagine a binary classification task with 100 instances:
- Correct predictions (True Positives): 80
- False Positives: 20

$$Precision = \frac{80}{80 + 20} = \frac{80}{100} = 0.8$$

**When to use:** Precision is important in tasks in which minimizing false positives is more critical than false negatives, such as spam detection or fraud detection.

### Recall
Recall (also known as sensitivity) measures how many actual positives the model successfully identifies. It is particularly useful when minimizing false negatives is essential.

**Formula:**
$$Recall = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

**Example:**
In a spam detection system:
- True positives: 70 (emails correctly classified as spam)
- False negatives: 20 (spam emails incorrectly classified as non-spam)

$$Recall = \frac{70}{70 + 20} = \frac{70}{90} \approx 0.778$$

**When to use:**
Recall is essential in tasks in which false negatives are more critical than false positives.
Examples:
- **Spam detection:** it avoids incorrectly flagging important emails as spam (though usually precision is prioritized here, recall ensures we catch spam). *Note: The text says "False positives are more critical" under "When to use" for Precision, and lists Spam detection there. For Recall, it usually applies to medical diagnosis or safety critical tasks.*

### F1 score
The F1 score is the harmonic mean of precision and recall, providing a balanced metric when both false positives and false negatives matter.

**Formula:**
$$F1 = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right)$$

**When to use:** The F1 score is useful when you need a balance between precision and recall, particularly in imbalanced datasets.

### Confusion matrix
A confusion matrix shows the true positives, true negatives, false positives, and false negatives in a table format. It helps visualize the model’s performance.

| | Predicted positive | Predicted negative |
|---|---|---|
| **Actual positive** | True positive (TP) | False negative (FN) |
| **Actual negative** | False positive (FP) | True negative (TN) |

**When to use:** A confusion matrix is valuable for understanding where the model is making errors and identifying class imbalances.

## Evaluating performance on unseen data
When evaluating fine-tuned models, it is crucial to test their performance on a test set that the model has not seen during training or validation. This provides an unbiased measure of the model’s ability to generalize to real-world data.

- **Validation set:** used during fine-tuning to tune hyperparameters and monitor performance.
- **Test set:** used after fine-tuning to evaluate the model’s final performance on unseen data.

## Overfitting and underfitting
One of the key risks during fine-tuning is overfitting or underfitting the model.

### Overfitting
Overfitting occurs when the model performs well on the training data but poorly on unseen data. This happens when the model has memorized the training set instead of learning features that generalize to new data.

- **Signs of overfitting:** High accuracy on the training set but low accuracy on the validation or test sets.
- **Solutions:** Use regularization techniques (e.g., dropout), reduce model complexity, or use data augmentation.

### Underfitting
Underfitting happens when the model performs poorly on both the training and test data. This indicates that the model is too simple to capture the underlying patterns in the data.

- **Signs of underfitting:** Low accuracy on both training and test sets.
- **Solutions:** Increase the complexity of the model, provide more training data, or train for more epochs.

## Comparing techniques: Traditional fine-tuning, LoRA, and QLoRA
When comparing the performance of different fine-tuning techniques, you should consider both the model’s performance metrics and the resource efficiency of each technique.

- **Performance:** compare accuracy, F1 score, precision, and recall across techniques.
- **Efficiency:** consider training time, memory usage, and computational cost.

For example, if comparing traditional fine-tuning, LoRA, and QLoRA techniques:
- **Traditional fine-tuning:** typically achieves high performance but requires significant memory and time.
- **LoRA:** reduces memory usage by fine-tuning only low-rank matrices, often without a major loss in performance.
- **QLoRA:** combines quantization with low-rank adaptation, further reducing memory usage while maintaining competitive performance.

## Conclusion
Evaluating fine-tuned models is a critical step in understanding their performance and generalization ability. By using a combination of such metrics as accuracy, precision, recall, and F1 score, you can get a comprehensive picture of how well your model performs on the task at hand. It’s also important to assess the model’s resource efficiency, particularly when comparing different fine-tuning techniques such as LoRA and QLoRA.


# Detailed explanation of evaluation metrics

## Introduction
When evaluating a fine-tuned model, it is essential to use appropriate metrics to understand its performance. Different metrics can provide insights into various aspects of model behavior, such as its ability to classify correctly, its sensitivity to certain classes, and how well it generalizes to new data. In this reading, we will take a detailed look at the most commonly used evaluation metrics: accuracy, precision, recall, F1 score, and others such as the confusion matrix, receiver operating characteristic–area under the curve (ROC-AUC), loss, and specificity.

By the end of this reading, you will be able to:
- Explain the importance of evaluating fine-tuned models using different metrics.
- Identify when to use various metrics, such as accuracy, precision, recall, and F1 score, based on the specific goals of the task.
- Interpret confusion matrices and ROC-AUC curves to visualize model performance.
- Explain loss and specificity metrics and how they relate to model learning.
- Select the most appropriate evaluation metrics for balanced, imbalanced, or cost-sensitive tasks.

## Evaluation metrics explained
Explore the following evaluation metrics:
- Evaluation Metric 1: Accuracy
- Evaluation Metric 2: Precision
- Evaluation Metric 3: Recall (sensitivity or true positive rate)
- Evaluation Metric 4: F1 score
- Evaluation Metric 5: Confusion matrix
- Evaluation Metric 6: Specificity (true negative rate)
- Evaluation Metric 7: ROC-AUC
- Evaluation Metric 8: Loss

### Evaluation Metric 1: Accuracy
**Definition**
Accuracy measures the proportion of correctly classified instances out of the total number of instances. It is the most straightforward metric for classification tasks.

**Formula**
$$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$

Where:
- **TP (true positives):** correct positive predictions
- **TN (true negatives):** correct negative predictions
- **FP (false positives):** incorrect positive predictions
- **FN (false negatives):** incorrect negative predictions

**When to use**
Accuracy is a good metric when the class distribution is balanced. However, in the case of imbalanced datasets (in which one class occurs far more frequently than another), accuracy can be misleading. For example, in a dataset in which 90 percent of instances are of class A and only 10 percent are of class B, a model that always predicts class A would achieve 90 percent accuracy, but it would perform poorly for class B.

### Evaluation Metric 2: Precision
**Definition**
Precision measures how many of the model's positive predictions are actually correct. It is useful when false positives are particularly costly (e.g., in spam or fraud detection).

**Formula**
$$Precision = \frac{TP}{TP + FP}$$

**When to use**
Precision is important when the cost of a false positive is high, meaning it’s better to be cautious when predicting positives. A high precision means fewer false positives, but it doesn’t account for false negatives.

### Evaluation Metric 3: Recall (sensitivity or true positive rate)
**Definition**
Recall measures how many of the actual positives in the dataset the model correctly identifies. It is essential in scenarios where missing positives (false negatives) can have serious consequences (e.g., in medical diagnosis or fraud detection).

**Formula**
$$Recall = \frac{TP}{TP + FN}$$

**When to use**
Recall is crucial when the goal is to capture as many positive instances as possible, even if that means allowing some false positives. For example, in a cancer detection model, it’s more important to catch as many true cases of cancer as possible, even if some false positives are included.

### Evaluation Metric 4: F1 score
**Definition**
The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both. It’s useful when you want to find a middle ground between precision and recall, particularly in cases of imbalanced datasets.

**Formula**
$$F1 = 2 \times \left( \frac{Precision \times Recall}{Precision + Recall} \right)$$

**When to use**
The F1 score is most helpful when both precision and recall are important, and there is a need to balance the two. For example, in a fraud detection system, we want to minimize both false positives (to avoid unnecessary investigations) and false negatives (to catch as much fraud as possible).

### Evaluation Metric 5: Confusion matrix
**Definition**
A confusion matrix is a table that allows you to visualize the performance of a classification model by comparing actual versus predicted values. It provides a detailed breakdown of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

**Example**

| | Predicted positive | Predicted negative |
|---|---|---|
| **Actual positive** | True positive | False negative |
| **Actual negative** | False positive | True negative |

**When to use**
A confusion matrix is especially useful when analyzing a model’s errors. For example, in a binary classification problem, you can easily see whether the model is making more false positives or false negatives, helping you to adjust the model accordingly.

### Evaluation Metric 6: Specificity (true negative rate)
**Definition**
Specificity measures the proportion of actual negatives that the model correctly identifies. It is the opposite of recall, as it focuses on how well the model avoids false positives.

**Formula**
$$Specificity = \frac{True Negatives}{True Negatives + False Positives}$$

**When to use**
Specificity is useful when the cost of false positives is high. For example, in certain medical tests, it’s important to minimize the number of false positives to avoid unnecessary treatments.

### Evaluation Metric 7: ROC-AUC
**Definition**
ROC-AUC measures the trade-off between the TP rate (recall) and the FP rate (1 - specificity) across different threshold values. The ROC curve plots the TP rate against the FP rate, and the area under the curve (AUC) quantifies the overall ability of the model to distinguish between classes.

**When to use**
ROC-AUC is a robust metric for evaluating binary classifiers, particularly when you want to compare how well different models perform at distinguishing between the positive and negative classes. It is often used when dealing with imbalanced datasets.

- **ROC curve:** the curve itself shows the trade-off between sensitivity (recall) and specificity (true negative rate).
- **AUC value:** an AUC of 1.0 indicates a perfect classifier, while an AUC of 0.5 indicates a model with no discriminatory ability.

### Evaluation Metric 8: Loss
**Definition**
Loss measures how well the model’s predictions align with the actual labels. During training, the goal is to minimize the loss to improve the model's performance. Two common loss functions are cross-entropy loss (for classification problems) and mean squared error (for regression tasks).

#### 1. Cross-entropy loss
**Definition**
Cross-entropy loss measures the difference between the predicted probabilities and the actual class labels in classification tasks. It penalizes confident but incorrect predictions more heavily.

**Formula**
$$Cross\text{-}Entropy\ Loss = - \sum_{i=1}^{N} y_i \log(\hat{y}_i)$$

Where:
- $y_i$ is the true label (1 if the class is correct; 0 otherwise).
- $\hat{y}_i$ is the predicted probability for the true class.
- $N$ is the number of classes.

Cross-entropy is particularly useful for multi-class classification problems, as it ensures that the model outputs probabilities close to the true class label.

#### 2. Mean squared error (MSE)
**Definition**
MSE is used in regression tasks and measures the average of the squared differences between predicted values and actual values.

**Formula**
$$MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$$

Where:
- $y_i$ is the true value.
- $\hat{y}_i$ is the predicted value.
- $n$ is the number of data points.

MSE gives more weight to larger errors, meaning the model is penalized more for predictions significantly off the actual values.

**When to use**
Loss is important during the training phase because it gives insight into how well the model is learning from the data. However, it is less interpretable as a standalone metric after training compared to accuracy or the F1 score.

## Choosing the right metric
The choice of evaluation metric depends on the task and the goals of the model. Here are some guidelines:
- For balanced datasets, accuracy is a reasonable choice.
- For imbalanced datasets, use precision, recall, F1 score, or ROC-AUC to get a more nuanced view of model performance.
- When false positives are costly, precision and specificity are crucial.
- When false negatives are costly, recall is the key metric.
- When both false positives and false negatives matter, the F1 score provides a balanced evaluation.

## Conclusion
Different evaluation metrics provide various insights into a model’s performance. By understanding these metrics, you can make better decisions about how to interpret the results of a fine-tuned model and how to improve its performance in future iterations. Always choose metrics that align with the task’s goals and the specific costs associated with false positives or false negatives.


# Summary: Beyond Accuracy - Evaluation Metrics for Machine Learning

## Introduction
In real-world machine learning applications, such as fraud detection or medical diagnosis, relying solely on model accuracy can be deceptive. This summary explores why high accuracy does not always equate to a successful model and details alternative metrics—Precision, Recall, and F1 Score—that are critical for assessing performance in mission-critical tasks.

## The Limitation of Accuracy
While accuracy is a common starting point for evaluation, it can be highly misleading, particularly with **imbalanced datasets**.
*   **The Fraud Detection Paradox:** In a scenario where only 1% of transactions are fraudulent, a model that simply predicts "not fraud" for every single transaction achieves 99% accuracy. However, this model is practically useless because it fails to detect any actual fraud.
*   **Insight:** Accuracy measures overall correctness but fails to capture the nuance of specific error types (false positives vs. false negatives), which is often where the real business or safety value lies.

## Key Metrics Explained

### 1. Precision
*   **Definition:** Measures the accuracy of positive predictions.
*   **Use Case:** Critical when **false positives are costly**.
*   **Example:** In fraud detection, low precision means many legitimate transactions are flagged as fraud. Investigating these false alarms is expensive and frustrates users, so high precision is preferred.

### 2. Recall
*   **Definition:** Measures the ability of the model to find all the relevant cases (positive instances).
*   **Use Case:** Critical when **false negatives are dangerous**.
*   **Example:** In medical diagnoses (e.g., cancer detection), missing a positive case (a false negative) can be life-threatening. Therefore, high recall is prioritized to ensure as many cases as possible are caught, even if it means accepting some false positives.

### 3. F1 Score
*   **Definition:** The harmonic mean of Precision and Recall.
*   **Use Case:** Best used when you need a **balance** between Precision and Recall, and when false positives and false negatives are both important. It prevents a model from being biased too heavily toward one metric at the expense of the other.

## Advanced Metrics & Tools

### ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
*   Helps understand how well a model distinguishes between classes.
*   Particularly useful for assessing performance on imbalanced datasets.

### Confusion Matrix
*   A visualization tool that breaks down predictions into True Positives, True Negatives, False Positives, and False Negatives.
*   Allows for a granular analysis of exactly *where* the model is making mistakes.

## Strategic Selection of Metrics
Choosing the right metric is not a one-size-fits-all decision; it depends entirely on the specific goals of the task and the cost of errors.

| Task | Priority | Recommended Metric | Reason |
| :--- | :--- | :--- | :--- |
| **Cancer Detection** | Catching every case | **Recall** | The cost of missing a diagnosis (False Negative) is extremely high. |
| **Fraud Detection** | Minimizing false alarms | **Precision** | The cost of blocking legitimate users (False Positive) is high. |
| **General Classification** | Balanced performance | **F1 Score** | When both error types are undesirable. |

## Conclusion
To build models that serve real-world goals, developers must look beyond accuracy. By understanding the nuances of Precision, Recall, and F1 Score, and utilizing tools like the Confusion Matrix, practitioners can fine-tune models to minimize the specific errors that matter most to their application.
