# Classification Pipeline Example

This notebook demonstrates how to train binary classifiers using pre-generated embeddings from the MediClaimGPT embedding pipeline.

## Overview
- **Input**: Pre-generated embedding files from embedding pipeline
- **Process**: Train multiple classifiers (Logistic Regression, SVM, Random Forest)
- **Output**: Trained models, performance metrics, and comparison reports

## Prerequisites
1. Run the embedding pipeline first (see `01_embedding_pipeline_example.ipynb`)
2. Ensure embedding files exist in `examples/outputs/embeddings/`

## Setup and Imports

In [1]:
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path
import json

# Add project root to path
project_root = Path().cwd().parent
sys.path.append(str(project_root))

from pipelines.classification_pipeline import ClassificationPipeline
from models.config_models import PipelineConfig
from utils.logging_utils import get_logger

print(f"Project root: {project_root}")
print(f"Working directory: {os.getcwd()}")

Project root: /home/kosaraju/mgpt_eval
Working directory: /home/kosaraju/mgpt_eval/examples


## Configuration Setup

In [2]:
# Load configuration
config_path = "configs/classification_example_config.yaml"
config = PipelineConfig.from_yaml(config_path)

print("Configuration loaded successfully!")
print(f"Job name: {config.job.name}")
print(f"Output directory: {config.job.output_dir}")
print(f"Pipeline stages: {config.pipeline_stages}")
print(f"Classifiers: {config.classification.models}")

Configuration loaded successfully!
Job name: classification_training_example
Output directory: /home/kosaraju/mgpt_eval/examples/outputs
Pipeline stages: embeddings=False classification=True evaluation=True target_word_eval=False summary_report=True method_comparison=True
Classifiers: ['logistic_regression', 'svm', 'random_forest']


## Input Data Verification

In [3]:
# Check if embedding files exist
train_embeddings_path = Path(config.input.train_embeddings_path)
test_embeddings_path = Path(config.input.test_embeddings_path)

print("Checking for embedding files...")
print(f"Train embeddings: {train_embeddings_path}")
print(f"  - Exists: {train_embeddings_path.exists()}")
if train_embeddings_path.exists():
    train_df = pd.read_csv(train_embeddings_path)
    print(f"  - Shape: {train_df.shape}")
    print(f"  - Columns: {list(train_df.columns)}")
    print(f"  - Sample data:")
    print(train_df.head(2))

print(f"\nTest embeddings: {test_embeddings_path}")
print(f"  - Exists: {test_embeddings_path.exists()}")
if test_embeddings_path.exists():
    test_df = pd.read_csv(test_embeddings_path)
    print(f"  - Shape: {test_df.shape}")
    print(f"  - Columns: {list(test_df.columns)}")

Checking for embedding files...
Train embeddings: /home/kosaraju/mgpt_eval/examples/outputs/embeddings/train_embeddings.csv
  - Exists: True
  - Shape: (16, 3)
  - Columns: ['mcid', 'label', 'embedding']
  - Sample data:
   mcid  label                                          embedding
0    20      0  [0.02810053527355194, -0.2548440098762512, -1....
1    16      0  [-0.134785994887352, -0.01184933539479971, -0....

Test embeddings: /home/kosaraju/mgpt_eval/examples/outputs/embeddings/test_embeddings.csv
  - Exists: True
  - Shape: (4, 3)
  - Columns: ['mcid', 'label', 'embedding']


### Fallback: Use Available Data if Embeddings Don't Exist

In [4]:
# If embeddings don't exist, check for available embedding files
embeddings_dir = Path("outputs/embeddings")
if embeddings_dir.exists():
    embedding_files = list(embeddings_dir.glob("*.csv"))
    print(f"Available embedding files in {embeddings_dir}:")
    for file in embedding_files:
        print(f"  - {file.name}")
        
    if embedding_files:
        # Use first available embedding file for demonstration
        sample_file = embedding_files[0]
        print(f"\nUsing {sample_file.name} for this example...")
        
        # Update config to use available data
        config.input.train_embeddings_path = str(sample_file)
        config.input.test_embeddings_path = str(sample_file)  # Same file for demo
        
        print("Updated config to use available data.")
else:
    print("No embedding files found. Please run the embedding pipeline first.")

Available embedding files in outputs/embeddings:
  - sample_embeddings.csv
  - train_embeddings.csv
  - test_embeddings.csv

Using sample_embeddings.csv for this example...
Updated config to use available data.


## Initialize Classification Pipeline

In [5]:
# Setup logging
logger = get_logger("classification_pipeline", config.logging)

# Initialize classification pipeline
pipeline = ClassificationPipeline(config)


print("Classification pipeline initialized successfully!")

Classification pipeline initialized successfully!


## Run Classification Pipeline

In [6]:
# Run the classification pipeline for each classifier
results_summary = {}

try:
    for classifier_type in config.classification.models:
        print(f"\n🚀 Training {classifier_type} classifier...")
        
        # Run pipeline for this classifier
        results = pipeline.run(
            train_embeddings=config.input.train_embeddings_path,
            test_embeddings=config.input.test_embeddings_path,
            classifier_type=classifier_type,
            output_dir=config.job.output_dir
        )
        
        results_summary[classifier_type] = results
        print(f"✅ {classifier_type} completed successfully!")
        print(f"   Test accuracy: {results.get('test_accuracy', 'N/A'):.4f}" if isinstance(results.get('test_accuracy'), float) else f"   Test accuracy: {results.get('test_accuracy', 'N/A')}")
    
    print(f"\n🎉 All classifiers completed successfully!")
    print(f"Trained {len(results_summary)} classifiers: {list(results_summary.keys())}")
    
except Exception as e:
    print(f"❌ Pipeline execution failed: {e}")
    import traceback
    traceback.print_exc()

Limited samples (20) for 48 parameter combinations. Consider reducing parameter grid complexity.



🚀 Training logistic_regression classifier...
Fitting 5 folds for each of 48 candidates, totalling 240 fits


Limited samples (20) for 144 parameter combinations. Consider reducing parameter grid complexity.


✅ logistic_regression completed successfully!
   Test accuracy: N/A

🚀 Training svm classifier...
Fitting 5 folds for each of 144 candidates, totalling 720 fits


Limited samples (20) for 432 parameter combinations. Consider reducing parameter grid complexity.


✅ svm completed successfully!
   Test accuracy: N/A

🚀 Training random_forest classifier...
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
❌ Pipeline execution failed: Grid search failed: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'None' instead.


joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/kosaraju/miniconda3/envs/mgpt-eval/lib/python3.13/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
  File "/home/kosaraju/miniconda3/envs/mgpt-eval/lib/python3.13/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kosaraju/miniconda3/envs/mgpt-eval/lib/python3.13/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
            ~~~~^^^^^^^^^^^^^^^^^
  File "/home/kosaraju/miniconda3/envs/mgpt-eval/lib/python3.13/site-packages/sklearn/utils/parallel.py", line 139, in __call__
    return self.function(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/kosaraju/miniconda3/envs/mgpt-eval/lib/python3.13/site-packages/sklearn/model_sel

## Results Analysis

In [9]:
# Check output directory for results
output_dir = Path(config.job.output_dir)
models_dir = output_dir / "models"
metrics_dir = output_dir / "metrics"

print(f"Output directory: {output_dir}")
print(f"Models directory: {models_dir}")
print(f"Metrics directory: {metrics_dir}")

# List generated models (they may be in the root output directory)
model_files = list(output_dir.glob("*_model_*.pkl"))
metric_files = list(output_dir.glob("*_metrics_*.json"))

print(f"\nGenerated model files: {len(model_files)}")
for model_file in model_files:
    print(f"  - {model_file.name}")

print(f"\nGenerated metrics files: {len(metric_files)}")
for metric_file in metric_files:
    print(f"  - {metric_file.name}")

# Also check if there are any files in the models subdirectory
if models_dir.exists():
    subdir_models = list(models_dir.glob("**/*.pkl"))
    subdir_metrics = list(models_dir.glob("**/*.json"))
    if subdir_models or subdir_metrics:
        print(f"\nFiles in models subdirectory:")
        print(f"  Models: {len(subdir_models)}, Metrics: {len(subdir_metrics)}")
        for file in subdir_models + subdir_metrics:
            print(f"    - {file.name}")
else:
    print(f"\nModels directory does not exist: {models_dir}")

Output directory: /home/kosaraju/mgpt_eval/examples/outputs
Models directory: /home/kosaraju/mgpt_eval/examples/outputs/models
Metrics directory: /home/kosaraju/mgpt_eval/examples/outputs/metrics

Generated model files: 3
  - svm_model_20250604_002255.pkl
  - logistic_regression_model_20250604_002255.pkl
  - logistic_regression_model_20250604_002100.pkl

Generated metrics files: 3
  - logistic_regression_metrics_20250604_002255.json
  - svm_metrics_20250604_002255.json
  - logistic_regression_metrics_20250604_002100.json

Models directory does not exist: /home/kosaraju/mgpt_eval/examples/outputs/models


## Load and Display Model Performance

In [10]:
# Load and display model metrics
output_dir = Path(config.job.output_dir)

# Get all metric files from the output directory
metric_files = list(output_dir.glob("*_metrics_*.json"))

if metric_files:
    performance_summary = []
    
    print("🏆 MODEL PERFORMANCE SUMMARY")
    print("=" * 50)
    
    for metrics_file in metric_files:
        try:
            with open(metrics_file, 'r') as f:
                metrics = json.load(f)
            
            # Extract classifier type from filename
            # Format: classifier_type_metrics_timestamp.json
            model_name = metrics_file.name.split('_metrics_')[0]
            
            # Get test metrics
            test_metrics = metrics.get('test_metrics', {})
            
            perf = {
                'model': model_name.replace('_', ' ').title(),
                'accuracy': test_metrics.get('accuracy', 'N/A'),
                'precision': test_metrics.get('precision', 'N/A'),
                'recall': test_metrics.get('recall', 'N/A'),
                'f1_score': test_metrics.get('f1_score', 'N/A'),
                'roc_auc': test_metrics.get('roc_auc', 'N/A')
            }
            
            performance_summary.append(perf)
            
            print(f"\n{perf['model'].upper()} Performance:")
            for metric, value in perf.items():
                if metric != 'model':
                    if isinstance(value, float):
                        print(f"  {metric.replace('_', ' ').title():12}: {value:.4f}")
                    else:
                        print(f"  {metric.replace('_', ' ').title():12}: {value}")
                        
        except Exception as e:
            print(f"Error loading metrics for {metrics_file.name}: {e}")
    
    # Create comparison DataFrame
    if performance_summary:
        comparison_df = pd.DataFrame(performance_summary)
        
        print("\n" + "="*60)
        print("MODEL COMPARISON SUMMARY")
        print("="*60)
        
        # Format numeric columns for better display
        numeric_cols = ['accuracy', 'precision', 'recall', 'f1_score', 'roc_auc']
        for col in numeric_cols:
            if col in comparison_df.columns:
                comparison_df[col] = comparison_df[col].apply(
                    lambda x: f"{x:.4f}" if isinstance(x, float) else str(x)
                )
        
        print(comparison_df.to_string(index=False))
        
        # Find best performing model for each metric
        print("\n🥇 BEST PERFORMERS:")
        for col in numeric_cols:
            if col in comparison_df.columns:
                # Convert back to float for comparison
                numeric_values = pd.to_numeric(comparison_df[col], errors='coerce')
                if not numeric_values.isna().all():
                    best_idx = numeric_values.idxmax()
                    best_model = comparison_df.loc[best_idx, 'model']
                    best_value = comparison_df.loc[best_idx, col]
                    print(f"  {col.replace('_', ' ').title():12}: {best_model} ({best_value})")
else:
    print("No model metrics found. The classification pipeline may not have completed successfully.")
    print(f"Checked directory: {output_dir}")
    print("Available files:", list(output_dir.glob("*.json")))

🏆 MODEL PERFORMANCE SUMMARY

LOGISTIC REGRESSION Performance:
  Accuracy    : 1.0000
  Precision   : 1.0000
  Recall      : 1.0000
  F1 Score    : 1.0000
  Roc Auc     : 1.0000

SVM Performance:
  Accuracy    : 1.0000
  Precision   : 1.0000
  Recall      : 1.0000
  F1 Score    : 1.0000
  Roc Auc     : 0.0000

LOGISTIC REGRESSION Performance:
  Accuracy    : 1.0000
  Precision   : 1.0000
  Recall      : 1.0000
  F1 Score    : 1.0000
  Roc Auc     : 1.0000

MODEL COMPARISON SUMMARY
              model accuracy precision recall f1_score roc_auc
Logistic Regression   1.0000    1.0000 1.0000   1.0000  1.0000
                Svm   1.0000    1.0000 1.0000   1.0000  0.0000
Logistic Regression   1.0000    1.0000 1.0000   1.0000  1.0000

🥇 BEST PERFORMERS:
  Accuracy    : Logistic Regression (1.0000)
  Precision   : Logistic Regression (1.0000)
  Recall      : Logistic Regression (1.0000)
  F1 Score    : Logistic Regression (1.0000)
  Roc Auc     : Logistic Regression (1.0000)


## Next Steps

After running this classification pipeline:

1. **Model Evaluation**: Review the performance metrics above to compare classifiers
2. **Best Model Selection**: Choose the model with the best performance for your use case
3. **Hyperparameter Tuning**: Adjust the hyperparameter grids in the config for better performance
4. **End-to-End Pipeline**: Run the complete pipeline (embeddings + classification) using the end-to-end example

## Configuration Customization

To customize this example for your data:

1. **Input Paths**: Update `train_embeddings_path` and `test_embeddings_path` in the config
2. **Classifiers**: Modify the `models` list to include/exclude classifiers
3. **Hyperparameters**: Adjust the search grids for each classifier
4. **Evaluation Metrics**: Add/remove metrics in the evaluation configuration
5. **Cross-Validation**: Change the number of folds or scoring metric

## Files Generated

This pipeline generates:
- **Models**: Trained classifier models (.pkl files)
- **Metrics**: Performance metrics (JSON files)
- **Logs**: Detailed execution logs
- **Reports**: Comparison reports (if enabled)