# Lesson 11: Methods and Metrics for Model Evaluation

## Introduction (2 minutes)

Welcome to our lesson on Methods and Metrics for Model Evaluation. In this 30-minute session, we'll explore various techniques for assessing the performance of language models across different tasks.

## Lesson Objectives

By the end of this lesson, you will:
1. Understand key evaluation metrics for different NLP tasks
2. Learn how to implement these metrics
3. Recognize the importance of task-specific evaluation
4. Understand the process of model selection for different use cases

## 1. Evaluation Metrics for Various Tasks (20 minutes)

### 1.1 Classification Tasks (7 minutes)

For tasks like sentiment analysis, spam detection, etc.

Key Metrics:
- Accuracy
- Precision
- Recall
- F1 Score

Example implementation using scikit-learn:

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate_classification(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

# Example usage
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 1, 1]
evaluate_classification(y_true, y_pred)

### 1.2 Generation Tasks (7 minutes)

For tasks like machine translation, text summarization, etc.

Key Metrics:
- BLEU (Bilingual Evaluation Understudy)
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- METEOR (Metric for Evaluation of Translation with Explicit ORdering)

Example implementation of BLEU score:

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu(reference, candidate):
    return sentence_bleu([reference.split()], candidate.split())

# Example usage
reference = "The cat is on the mat"
candidate = "There is a cat on the mat"
bleu_score = calculate_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score:.4f}")

### 1.3 Language Model Specific Metrics (6 minutes)

Metrics specific to language modeling tasks:

- Perplexity
- Bits per Character (BPC)

Example calculation of perplexity:

In [None]:
import numpy as np

def calculate_perplexity(probabilities):
    return np.exp(-np.mean(np.log(probabilities)))

# Example usage
probabilities = [0.1, 0.2, 0.05, 0.6, 0.05]
perplexity = calculate_perplexity(probabilities)
print(f"Perplexity: {perplexity:.4f}")

## 2. Model Selection for Different Use Cases (6 minutes)

When selecting a model for a specific use case, consider:

1. Task requirements (e.g., classification vs. generation)
2. Performance on relevant metrics
3. Computational resources available
4. Inference time requirements
5. Fine-tuning potential

Example decision process:

In [None]:
def select_model(task, performance_threshold, max_size, max_inference_time):
    models = {
        "ModelA": {"performance": 0.85, "size": 500, "inference_time": 100},
        "ModelB": {"performance": 0.90, "size": 1000, "inference_time": 200},
        "ModelC": {"performance": 0.95, "size": 2000, "inference_time": 300}
    }
    
    selected_model = None
    for model, specs in models.items():
        if (specs["performance"] >= performance_threshold and
            specs["size"] <= max_size and
            specs["inference_time"] <= max_inference_time):
            selected_model = model
            break
    
    return selected_model

# Example usage
task = "classification"
performance_threshold = 0.88
max_size = 1500  # MB
max_inference_time = 250  # ms

selected_model = select_model(task, performance_threshold, max_size, max_inference_time)
print(f"Selected Model: {selected_model}")

## Conclusion and Q&A (2 minutes)

We've covered various evaluation metrics for different NLP tasks, including classification, generation, and language modeling. Remember, the choice of metric depends on your specific task and requirements. Always consider multiple metrics for a comprehensive evaluation of your model's performance.

Are there any questions about the evaluation methods or metrics we've discussed?

## Additional Resources

1. "BLEU: a Method for Automatic Evaluation of Machine Translation" paper: https://www.aclweb.org/anthology/P02-1040.pdf
2. "ROUGE: A Package for Automatic Evaluation of Summaries" paper: https://www.aclweb.org/anthology/W04-1013.pdf
3. Hugging Face's Evaluate library: https://huggingface.co/docs/evaluate/index
4. scikit-learn metrics documentation: https://scikit-learn.org/stable/modules/model_evaluation.html

In our next lesson, we'll dive into practical aspects of model inference and function calling.