# Classification Pipeline Example

This notebook demonstrates the **pure config-driven binary classification pipeline**.

## Key Features
- ✅ **Pure config-driven**: All parameters defined in YAML with variable references
- ✅ **No parameter duplication**: Variables defined once and referenced throughout
- ✅ **Template resolution**: Uses `${variable}` syntax for dynamic paths
- ✅ **Binary classification**: Supports logistic regression, SVM, and random forest
- ✅ **Hyperparameter tuning**: Grid search with cross-validation
- ✅ **Auto class balancing**: Detects and handles imbalanced datasets
- ✅ **Comprehensive metrics**: Accuracy, precision, recall, F1-score, ROC-AUC

## Prerequisites
This pipeline requires embeddings generated by the embedding pipeline.
Make sure you have:
- `train_embeddings.csv` 
- `test_embeddings.csv`

If you don't have these, run the embedding pipeline first (see `01_embedding_pipeline_example.ipynb`).

## Step 1: Import Dependencies

In [ ]:
import sys
sys.path.append('..')

from models.config_models import PipelineConfig
from pipelines.classification_pipeline import ClassificationPipeline
import json
from pathlib import Path

## Step 2: Load Configuration

The configuration uses template variables for complete config-driven approach:
- `${job.name}` → `binary_classification_example`
- `${job.output_dir}` → `examples/outputs`
- `${output.embeddings_dir}` → `examples/outputs/embeddings`
- `${output.logs_dir}` → `examples/outputs/logs`

In [ ]:
# Load the pure config-driven configuration
config = PipelineConfig.from_yaml('configs/classification_example_config.yaml')

print(f"Job name: {config.job.name}")
print(f"Output directory: {config.job.output_dir}")
print(f"Available classifiers: {config.classification.models}")
print(f"Cross-validation folds: {config.classification.cross_validation.n_folds}")
print(f"Scoring metric: {config.classification.cross_validation.scoring}")

# Show template resolution
print("\n=== Template Resolution ====")
print(f"Input train path: {config.input.train_embeddings_path}")
print(f"Resolved train path: {config.resolve_template_string(config.input.train_embeddings_path)}")
print(f"Log file template: {config.logging.file}")
print(f"Resolved log file: {config.resolve_template_string(config.logging.file)}")

## Step 3: Initialize Pipeline

The pipeline is completely config-driven - no parameters needed!

In [ ]:
# Initialize the classification pipeline
pipeline = ClassificationPipeline(config)
print("Classification pipeline initialized successfully!")
print(f"Random seed: {config.job.random_seed}")
print(f"Available classifiers: {list(pipeline.classifiers.keys())}")

## Step 4: Check Input Data

Verify that the embedding files exist and have the correct format.

In [ ]:
# Check if embedding files exist
train_path = config.resolve_template_string(config.input.train_embeddings_path)
test_path = config.resolve_template_string(config.input.test_embeddings_path)

print(f"Train embeddings: {train_path}")
print(f"  Exists: {Path(train_path).exists()}")
if Path(train_path).exists():
    import pandas as pd
    train_df = pd.read_csv(train_path)
    print(f"  Rows: {len(train_df)}")
    print(f"  Columns: {list(train_df.columns)}")
    print(f"  Label distribution: {train_df['label'].value_counts().to_dict()}")

print(f"\nTest embeddings: {test_path}")
print(f"  Exists: {Path(test_path).exists()}")
if Path(test_path).exists():
    test_df = pd.read_csv(test_path)
    print(f"  Rows: {len(test_df)}")
    print(f"  Columns: {list(test_df.columns)}")
    print(f"  Label distribution: {test_df['label'].value_counts().to_dict()}")

## Step 5: Run Classification Pipeline

### 5.1 Single Classifier Example

In [ ]:
# Run with logistic regression (first classifier in config)
print("=== Running Logistic Regression ===")
result_lr = pipeline.run(classifier_type='logistic_regression')

print(f"\nResults:")
print(f"Classifier: {result_lr['classifier_type']}")
print(f"Best CV Score: {result_lr['best_cv_score']:.4f}")
print(f"Best Parameters: {result_lr['best_params']}")
print(f"\nTest Metrics:")
for metric, value in result_lr['test_metrics'].items():
    if metric != 'confusion_matrix' and value is not None:
        print(f"  {metric}: {value:.4f}")
print(f"\nModel saved to: {result_lr['model_path']}")

### 5.2 All Classifiers Comparison

In [ ]:
# Run all classifiers from config and compare results
print("=== Running All Classifiers ===")
results = {}

for classifier_type in config.classification.models:
    print(f"\n--- {classifier_type.upper()} ---")
    result = pipeline.run(classifier_type=classifier_type)
    results[classifier_type] = result
    
    print(f"Best CV Score: {result['best_cv_score']:.4f}")
    print(f"Test Accuracy: {result['test_metrics']['accuracy']:.4f}")
    print(f"Test F1-Score: {result['test_metrics']['f1_score']:.4f}")
    if result['test_metrics']['roc_auc'] is not None:
        print(f"Test ROC-AUC: {result['test_metrics']['roc_auc']:.4f}")
    print(f"Model path: {result['model_path']}")

### 5.3 Results Summary

In [ ]:
# Create a comparison summary
print("=== CLASSIFICATION RESULTS SUMMARY ===")
print(f"{'Classifier':<20} {'CV Score':<10} {'Accuracy':<10} {'F1-Score':<10} {'ROC-AUC':<10}")
print("-" * 70)

for classifier_type, result in results.items():
    cv_score = result['best_cv_score']
    accuracy = result['test_metrics']['accuracy']
    f1_score = result['test_metrics']['f1_score']
    roc_auc = result['test_metrics']['roc_auc']
    roc_auc_str = f"{roc_auc:.4f}" if roc_auc is not None else "N/A"
    
    print(f"{classifier_type:<20} {cv_score:<10.4f} {accuracy:<10.4f} {f1_score:<10.4f} {roc_auc_str:<10}")

# Find best performer
best_classifier = max(results.keys(), key=lambda k: results[k]['test_metrics']['accuracy'])
print(f"\nBest Performer: {best_classifier} (Accuracy: {results[best_classifier]['test_metrics']['accuracy']:.4f})")

## Step 6: Inspect Saved Models

In [ ]:
# Check what models were saved
import pickle

models_base_dir = config.resolve_template_string("${job.output_dir}/models")
models_path = Path(models_base_dir)

print(f"Models directory: {models_path}")
print(f"Models saved:")

if models_path.exists():
    for model_dir in models_path.iterdir():
        if model_dir.is_dir():
            print(f"\n  {model_dir.name}/")
            for file in model_dir.iterdir():
                print(f"    {file.name}")
                
                # Load and inspect one model
                if file.suffix == '.pkl' and 'logistic_regression' in file.name:
                    with open(file, 'rb') as f:
                        model_data = pickle.load(f)
                    print(f"      Pipeline version: {model_data.get('pipeline_version', 'N/A')}")
                    print(f"      Job name: {model_data.get('job_name', 'N/A')}")
                    print(f"      Timestamp: {model_data.get('timestamp', 'N/A')}")

In [ ]:
# Verify that all paths are properly resolved from config
print("=== Configuration Validation ===")
print(f"Job name: {config.job.name}")
print(f"Output directory: {config.job.output_dir}")
print(f"Random seed: {config.job.random_seed}")

print("\nPath Resolution:")
print(f"Train embeddings: {config.input.train_embeddings_path}")
print(f"  → {config.resolve_template_string(config.input.train_embeddings_path)}")
print(f"Test embeddings: {config.input.test_embeddings_path}")
print(f"  → {config.resolve_template_string(config.input.test_embeddings_path)}")
print(f"Log file: {config.logging.file}")
print(f"  → {config.resolve_template_string(config.logging.file)}")

print("\nHyperparameter Grids:")
for classifier in config.classification.models:
    params = getattr(config.classification.hyperparameter_search, classifier)
    print(f"  {classifier}: {len(params)} parameters")
    for param, values in params.items():
        print(f"    {param}: {values}")

print(f"\nCross-validation: {config.classification.cross_validation.n_folds} folds")
print(f"Scoring metric: {config.classification.cross_validation.scoring}")
print(f"Parallel jobs: {config.classification.cross_validation.n_jobs}")

## Summary

This notebook demonstrated the **pure config-driven classification pipeline** with:

### ✅ **Key Achievements**
- **No parameter duplication**: All parameters defined once in config
- **Template variables**: Dynamic path resolution with `${variable}` syntax
- **Sequential config usage**: Load config → Initialize pipeline → Run
- **Multiple classifiers**: Logistic regression, SVM, random forest
- **Hyperparameter tuning**: Grid search with cross-validation
- **Comprehensive evaluation**: Multiple metrics with model saving

### 🔧 **Configuration Features**
- Variable references: `${job.name}`, `${job.output_dir}`, `${output.logs_dir}`
- Binary classification focus (labels: 0, 1)
- Automatic class imbalance detection
- Configurable hyperparameter grids
- Cross-validation settings
- Logging configuration

### 📊 **Output Structure**
```
examples/outputs/
├── models/
│   ├── binary_classification_example_logistic_regression/
│   ├── binary_classification_example_svm/
│   └── binary_classification_example_random_forest/
└── logs/
    └── classification_pipeline.log
```

The pipeline is now fully config-driven with no hardcoded parameters!