## Robustness Testing (Article 15)

### Adversarial Robustness

Test model resilience to adversarial inputs:

- [ ] Small perturbations to inputs
- [ ] Out-of-distribution samples
- [ ] Edge cases and boundary conditions

### Data Distribution Shifts

Test performance under distribution shifts:

- [ ] Temporal shifts (data from different time periods)
- [ ] Geographic shifts (data from different regions)
- [ ] Demographic shifts (different population characteristics)

### Error Analysis

Systematic analysis of failure modes:

- [ ] Confusion matrix analysis
- [ ] Error patterns by input characteristics
- [ ] Failure case documentation

In [None]:
# Robustness Testing

# 1. Confusion Matrix
# cm = confusion_matrix(y_test, y_pred)
# plt.figure(figsize=(10, 8))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
# plt.title('Confusion Matrix')
# plt.ylabel('True Label')
# plt.xlabel('Predicted Label')
# plt.show()

# 2. Error Analysis
# errors = X_test[y_test != y_pred]
# print(f"Total errors: {len(errors)}")
# print(f"Error rate: {len(errors) / len(y_test):.2%}")

# 3. Performance by Confidence
# confidence = y_pred_proba.max(axis=1)
# confidence_bins = pd.cut(confidence, bins=[0, 0.6, 0.8, 0.9, 1.0])
#
# print("\nPerformance by Confidence Level:")
# for bin_range in confidence_bins.cat.categories:
#     mask = confidence_bins == bin_range
#     if mask.sum() > 0:
#         acc = accuracy_score(y_test[mask], y_pred[mask])
#         print(f"  {bin_range}: {acc:.3f} ({mask.sum()} samples)")

print("Robustness testing complete. Document findings above.")

# Model Evaluation: GPT-4 Baseline for French Government FAQ

**Category**: evaluations
**Purpose**: Establish baseline performance metrics for GPT-4 on French government FAQ dataset before fine-tuning
**Author**: Data Science Team
**Created**: 2024-10-15

**Data Sources**:
- Training data: `data/faq-training-v2.csv` (10,000 samples, 2023-2024)
- Validation data: `data/faq-validation-v2.csv` (2,000 samples)
- Test data: `data/faq-test-v2.csv` (1,000 samples, held out)
- Baseline results: `results/gpt-3.5-baseline.json` (previous baseline)

**Dependencies**:
```
# Model evaluation dependencies
scikit-learn>=1.3.0
numpy>=1.24.0
pandas>=2.1.0
matplotlib>=3.8.0
seaborn>=0.13.0
openai>=1.3.0  # For GPT-4 API
```

**Model Requirements**:
- Model: GPT-4-turbo (gpt-4-1106-preview)
- API version: 2024-02-01
- Temperature: 0.7
- Max tokens: 500

**EU AI Act Context**:
- System risk level: High-risk (essential public services)
- Intended purpose: Automated FAQ responses for French government services
- Target performance: 
  - Accuracy ≥ 90% (correct answer identification)
  - Hallucination rate < 5% (factually incorrect responses)
  - Response time < 3 seconds (95th percentile)
  - French language quality ≥ 4.0/5.0 (human evaluation)

## EU AI Act Evaluation Requirements

### Article 15: Accuracy Requirements

For high-risk AI systems, demonstrate:
- [ ] Appropriate level of accuracy
- [ ] Robustness against errors
- [ ] Resilience to manipulation
- [ ] Cybersecurity measures

### Fairness and Bias Assessment

Required for systems affecting individuals:
- [ ] Performance across demographic groups
- [ ] Disparate impact analysis
- [ ] Bias mitigation effectiveness
- [ ] Fairness metrics documented

### Validation Methodology

- [ ] Independent test set (not used in training)
- [ ] Representative of deployment conditions
- [ ] Edge cases and boundary conditions tested
- [ ] Methodology documented and reproducible

In [None]:
# Standard imports

# Load test dataset
# test_df = pd.read_csv('data/test-dataset.csv')

## Model Inference

In [None]:
# Run model inference on test set
# predictions = model.predict(test_df)

In [None]:
# EU AI Act Evaluation Metrics


# Load model and test data
# model = load_model('path/to/model')
# X_test = pd.read_csv('data/test-features.csv')
# y_test = pd.read_csv('data/test-labels.csv')

# Make predictions
# y_pred = model.predict(X_test)
# y_pred_proba = model.predict_proba(X_test)

# Core Performance Metrics (Article 15)
metrics = {
    "accuracy": 0.0,  # accuracy_score(y_test, y_pred)
    "precision": 0.0,  # precision_score(y_test, y_pred, average='weighted')
    "recall": 0.0,  # recall_score(y_test, y_pred, average='weighted')
    "f1_score": 0.0,  # f1_score(y_test, y_pred, average='weighted')
    "roc_auc": 0.0,  # roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
}

print("Core Performance Metrics:")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.4f}")

# Target thresholds (define based on use case)
thresholds = {
    "accuracy": 0.90,  # Minimum 90% accuracy
    "precision": 0.85,  # Minimum 85% precision
    "recall": 0.85,  # Minimum 85% recall
    "f1_score": 0.85,  # Minimum 85% F1
}

print("\nThreshold Compliance:")
for metric, threshold in thresholds.items():
    status = "✓ PASS" if metrics[metric] >= threshold else "✗ FAIL"
    print(f"  {metric}: {metrics[metric]:.4f} >= {threshold:.4f} {status}")

## Results

## Fairness and Bias Assessment

### Demographic Parity

Evaluate if model predictions are independent of protected attributes:

```python
# For each protected attribute (e.g., gender, age, ethnicity)
# Calculate: P(ŷ=1 | A=a) for each group a
```

**Acceptable threshold**: Ratio between groups should be > 0.8

### Equal Opportunity

Evaluate if true positive rates are similar across groups:

```python
# Calculate: TPR(A=a) = P(ŷ=1 | y=1, A=a) for each group
```

**Acceptable threshold**: Difference between groups should be < 0.1

### Equalized Odds

Evaluate if both TPR and FPR are similar across groups:

```python
# Calculate both:
# TPR(A=a) = P(ŷ=1 | y=1, A=a)
# FPR(A=a) = P(ŷ=1 | y=0, A=a)
```

**Acceptable threshold**: Differences should be < 0.1

In [None]:
# Calculate metrics
# accuracy = accuracy_score(test_df['label'], predictions)
# f1 = f1_score(test_df['label'], predictions, average='weighted')
# print(f'Accuracy: {accuracy:.3f}')
# print(f'F1 Score: {f1:.3f}')

In [None]:
# Fairness Metrics Implementation


def calculate_fairness_metrics(y_true, y_pred, protected_attribute):
    """Calculate fairness metrics for a protected attribute."""
    groups = protected_attribute.unique()
    metrics = {}

    for group in groups:
        mask = protected_attribute == group
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]

        # Demographic parity: P(ŷ=1 | A=a)
        positive_rate = (y_pred_group == 1).mean()

        # Equal opportunity: TPR
        tp = ((y_true_group == 1) & (y_pred_group == 1)).sum()
        fn = ((y_true_group == 1) & (y_pred_group == 0)).sum()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0

        # False positive rate for equalized odds
        fp = ((y_true_group == 0) & (y_pred_group == 1)).sum()
        tn = ((y_true_group == 0) & (y_pred_group == 0)).sum()
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0

        metrics[group] = {
            "positive_rate": positive_rate,
            "tpr": tpr,
            "fpr": fpr,
            "sample_size": len(y_true_group),
        }

    return metrics


# Example usage (uncomment when data available):
# protected_attr = X_test['gender']  # or age_group, ethnicity, etc.
# fairness_results = calculate_fairness_metrics(y_test, y_pred, protected_attr)
#
# print("Fairness Metrics by Group:")
# for group, metrics in fairness_results.items():
#     print(f"\n{group}:")
#     print(f"  Positive rate: {metrics['positive_rate']:.3f}")
#     print(f"  TPR: {metrics['tpr']:.3f}")
#     print(f"  FPR: {metrics['fpr']:.3f}")
#     print(f"  Sample size: {metrics['sample_size']}")

# Calculate disparate impact ratio
# max_rate = max(m['positive_rate'] for m in fairness_results.values())
# min_rate = min(m['positive_rate'] for m in fairness_results.values())
# disparate_impact = min_rate / max_rate if max_rate > 0 else 0
# print(f"\nDisparate Impact Ratio: {disparate_impact:.3f}")
# print(f"Status: {'✓ PASS' if disparate_impact > 0.8 else '✗ FAIL'} (threshold: 0.8)")

## Conclusion and Recommendations

### Performance Summary

**Core Metrics**:
- Accuracy: [X%] (Target: ≥Y%)
- Precision: [X%] (Target: ≥Y%)
- Recall: [X%] (Target: ≥Y%)
- F1 Score: [X] (Target: ≥Y)

**Fairness Assessment**:
- Disparate Impact: [X] (Target: >0.8)
- Equal Opportunity: [Max difference: X] (Target: <0.1)
- Protected attributes analyzed: [List]

**Robustness**:
- Adversarial testing: [Pass/Fail]
- Distribution shift testing: [Pass/Fail]
- Edge case handling: [Pass/Fail]

### EU AI Act Compliance

- [ ] Accuracy requirements met (Article 15)
- [ ] Fairness requirements met
- [ ] Robustness demonstrated
- [ ] Validation methodology documented

### Recommendation

- [ ] **Approved for production**: All requirements met
- [ ] **Approved with monitoring**: Meets requirements, recommend ongoing monitoring
- [ ] **Not approved**: Requires improvements (list below)

**Required improvements** (if any):
1. [Improvement 1]
2. [Improvement 2]

### Next Steps

1. **If approved**: Create compliance documentation in `notebooks/compliance/`
2. **Tag this evaluation**: 
   ```bash
   just notebook tag notebooks/evaluations/[this-notebook].ipynb \
     --identifier [model]-[version]-eval \
     --message "Evaluation approved by [Your Name]"
   ```
3. **Reference in deployment**: Link to this evaluation in deployment documentation

**Evaluation Date**: [YYYY-MM-DD]
**Evaluator**: [Your Name]
**Next Evaluation**: [YYYY-MM-DD] (recommended: quarterly for high-risk systems)