# Protein Analysis Pipeline Orchestrator

This notebook orchestrates the complete protein-based machine learning analysis pipeline, reproducing the R methodology with modular Python components.

## Overview

This pipeline implements:
1. **Data Loading & Validation** - Load protein expression and phenotype data
2. **PWAS Analysis** - Protein-Wide Association Study with multiple testing correction
3. **Feature Importance** - Random Forest importance ranking
4. **Feature Selection** - Recursive Feature Elimination (RFE)
5. **Model Training** - Multiple ML models with cross-validation
6. **Results Analysis** - Performance metrics and visualization
7. **Checkpoint System** - Save/load intermediate results

The analysis reproduces the methodology from:
`/home/itg/oleg.vlasovets/projects/protein-benchmark/archive/R_analysis/2_train_agnostic_with_checkpoints.Rmd`

---

### Activate conda env protein-benchmark and run jupyter inside of that env

## Section 1: Import Required Libraries and Modules

Import necessary Python libraries and custom pipeline modules for each analysis step.

In [1]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import sys
import os
from datetime import datetime

# Machine learning libraries
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Import custom pipeline modules
sys.path.append('/home/itg/oleg.vlasovets/projects/protein-benchmark')

try:
    from pipeline import (
        DataLoader, PWASAnalyzer, FeatureImportanceAnalyzer,
        FeatureSelector, ModelTrainer, CheckpointSystem,
        validate_data_compatibility, create_results_summary,
        print_pipeline_summary, timer, get_system_info
    )
    print("✅ Successfully imported all pipeline modules")
except ImportError as e:
    print(f"❌ Error importing pipeline modules: {e}")
    print("Make sure the pipeline modules are properly installed")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('default')
sns.set_palette("husl")

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

print(f"📅 Notebook started at: {datetime.now()}")
print(f"💻 System info: {get_system_info() if 'get_system_info' in globals() else 'N/A'}")

   Install with: pip install imbalanced-learn
✅ Successfully imported all pipeline modules
📅 Notebook started at: 2025-10-17 20:30:26.376392
💻 System info: {'platform': 'Linux-5.14.0-570.25.1.el9_6.x86_64-x86_64-with-glibc2.34', 'python_version': '3.10.19', 'cpu_count': 32, 'memory_total': '754.0 GB', 'memory_available': '660.0 GB'}


## Section 2: Load and Validate Configuration

Set up configuration parameters and validate input paths, model parameters, and checkpoint settings.

In [3]:
!pip install imbalanced-learn



In [6]:
# Configuration parameters
CONFIG = {
    # Data paths (updated to use raw UKBB data files from R analysis)
    'data_paths': {
        'protein_train': '/home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_proteins_train.csv',
        'phenotype_train': '/home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_phenotypes_train.csv',
        'protein_val': '/home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_proteins_val.csv',
        'phenotype_val': '/home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_phenotypes_val.csv'
    },
    
    # Analysis parameters
    'analysis_params': {
        'target_column': 'oa_status',
        'pred_mode': 'prot_only',  # 'prot_only', 'sexagebmi', 'comb'
        'random_state': 42,
        'cv_folds': 10
    },
    
    # PWAS parameters  
    'pwas_params': {
        'fdr_threshold': 0.05,
        'p_threshold': 0.05,
        'max_proteins': 200
    },
    
    # Feature selection parameters
    'feature_params': {
        'rf_n_estimators': 500,
        'rfe_n_features': 20,
        'rfe_step': 1,
        'use_rfe_cv': True
    },
    
    # Model parameters
    'model_params': {
        'hyperparameter_tuning': False,
        'scale_features': False,
        'model_types': ['random_forest', 'logistic_regression', 'xgboost'] 
    },
    
    # Output paths
    'output_paths': {
        'base_dir': '/home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run',
        'checkpoints': '/home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/checkpoints',
        'plots': '/home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/plots',
        'reports': '/home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/reports'
    }
}

# Create output directories
for path_key, path_value in CONFIG['output_paths'].items():
    Path(path_value).mkdir(parents=True, exist_ok=True)
    print(f"📁 Created directory: {path_value}")

# Validate input files exist
print("\n📋 Validating input files...")
for file_key, file_path in CONFIG['data_paths'].items():
    if Path(file_path).exists():
        file_size = Path(file_path).stat().st_size / (1024**2)  # MB
        print(f"✅ {file_key}: {file_path} ({file_size:.1f} MB)")
    else:
        print(f"❌ {file_key}: {file_path} (NOT FOUND)")

print(f"\n🎯 Target variable: {CONFIG['analysis_params']['target_column']}")
print(f"🧬 Prediction mode: {CONFIG['analysis_params']['pred_mode']}")
print(f"🎲 Random seed: {CONFIG['analysis_params']['random_state']}")
print(f"📊 Cross-validation folds: {CONFIG['analysis_params']['cv_folds']}")
print(f"🤖 Models to compare: {', '.join(CONFIG['model_params']['model_types'])}")
print(f"💾 Results will be saved to: {CONFIG['output_paths']['base_dir']}")

📁 Created directory: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run
📁 Created directory: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/checkpoints
📁 Created directory: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/plots
📁 Created directory: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/reports

📋 Validating input files...
✅ protein_train: /home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_proteins_train.csv (109.4 MB)
✅ phenotype_train: /home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_phenotypes_train.csv (0.3 MB)
✅ protein_val: /home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_proteins_val.csv (12.1 MB)
✅ phenotype_val: /home/itg/oleg.vlasovets/projects/protein-benchmark/data/raw/ukbb_phenotypes_val.csv (0.0 MB)

🎯 Target variable: oa_status
🧬 Prediction mode: prot_only
🎲 Random seed: 42
📊 Cross-validation folds: 10
🤖 Models to comp

## Section 3: Data Loading and Preprocessing

Load protein benchmark data, perform preprocessing steps, and prepare training/validation datasets using the data processing module.

In [7]:
# Initialize checkpoint system
checkpoint_system = CheckpointSystem(CONFIG['output_paths']['checkpoints'])

# Initialize data loader
data_loader = DataLoader(CONFIG['output_paths']['base_dir'])

print("📥 Loading protein expression and phenotype data...")

# Load data matrices
data_dict = data_loader.load_matrices(
    protein_train_path=CONFIG['data_paths']['protein_train'],
    phenotype_train_path=CONFIG['data_paths']['phenotype_train'],
    protein_val_path=CONFIG['data_paths']['protein_val'],
    phenotype_val_path=CONFIG['data_paths']['phenotype_val'],
    pred_mode=CONFIG['analysis_params']['pred_mode'],
    target_column=CONFIG['analysis_params']['target_column']
)

# Extract matrices for easier access
X_train = data_dict['p_mtx_traintest']
y_train = data_dict['pheno_train_test'][CONFIG['analysis_params']['target_column']]
X_val = data_dict['p_mtx_val'] 
y_val = data_dict['pheno_val'][CONFIG['analysis_params']['target_column']]

print(f"\n📊 Data loaded successfully!")
print(f"🏋️  Training: {X_train.shape[0]} samples × {X_train.shape[1]} features")
print(f"🧪 Validation: {X_val.shape[0]} samples × {X_val.shape[1]} features")
print(f"🎯 Target distribution (train): {y_train.value_counts().to_dict()}")
print(f"🎯 Target distribution (val): {y_val.value_counts().to_dict()}")

# Validate data compatibility
validation_results = validate_data_compatibility(
    X_train, data_dict['pheno_train_test'], 
    CONFIG['analysis_params']['target_column']
)

if validation_results['compatible']:
    print("✅ Data validation passed!")
else:
    print("⚠️ Data validation warnings:")
    for warning in validation_results['warnings']:
        print(f"  - {warning}")

# Save data loading checkpoint
checkpoint_data = {
    'data_dict': data_dict,
    'X_train': X_train,
    'y_train': y_train,
    'X_val': X_val,
    'y_val': y_val,
    'validation_results': validation_results
}

checkpoint_system.save_checkpoint(checkpoint_data, '01_data_loading', "Data loading and validation completed")
print("💾 Data loading checkpoint saved")

📁 Checkpoint directory: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/checkpoints
📥 Loading protein expression and phenotype data...
📥 Loading data for prediction mode: prot_only
  📊 Loading protein expression matrices...
  📋 Loading phenotype data...
  ⚠️  Found 11456 missing values in training phenotypes
  ⚠️  Found 1261 missing values in validation phenotypes

📊 DATA LOADING SUMMARY
  🎯 Prediction mode: prot_only
  🏋️  Training: 2963 samples × 2131 features
  🧪 Validation: 328 samples × 2131 features
  📋 Target variable: oa_status
  🎪 Training target distribution: {0: 2803, 1: 160}
  🎪 Validation target distribution: {0: 311, 1: 17}
  🧬 Features: ['A1BG', 'AAMDC', 'AARSD1', 'ABCA2', 'ABHD14B'] ... ['ZFYVE19', 'ZHX2', 'ZNRD2', 'ZNRF4', 'ZP3']

📊 Data loaded successfully!
🏋️  Training: 2963 samples × 2131 features
🧪 Validation: 328 samples × 2131 features
🎯 Target distribution (train): {0: 2803, 1: 160}
🎯 Target distribution (val): {0: 311, 1: 17}
✅ Data val

#### TO DO: check missing values

## Section 4: PWAS Analysis (Protein-Wide Association Study)

Perform PWAS analysis to identify proteins significantly associated with the target variable, implementing the R fallback strategy.

In [8]:
# Initialize PWAS analyzer
pwas_analyzer = PWASAnalyzer()

print("🧬 Starting PWAS analysis...")

# Perform PWAS analysis with corrected implementation
pwas_results = pwas_analyzer.perform_pwas(
    protein_matrix=X_train,
    phenotype_data=data_dict['pheno_train_test'],
    target_column=CONFIG['analysis_params']['target_column'],
    fdr_threshold=CONFIG['pwas_params']['fdr_threshold'],
    p_threshold=CONFIG['pwas_params']['p_threshold']
)

# Check if we got valid results
if len(pwas_results['results']) == 0:
    print("⚠️ No PWAS results obtained. This may be due to:")
    print("  - Categorical variable encoding issues")
    print("  - Insufficient sample size") 
    print("  - Data quality issues")
    print("\nUsing all proteins for downstream analysis...")
    
    # Fallback: use all proteins
    selected_proteins = X_train.columns.tolist()[:CONFIG['pwas_params']['max_proteins']]
    print(f"📊 Selected {len(selected_proteins)} proteins (all available, limited by max_proteins)")
    
else:
    # Implement R fallback strategy for protein selection
    print("\n🎯 Applying R fallback strategy for protein selection:")

    # Strategy 1: Try FDR-significant proteins
    selected_proteins = pwas_analyzer.get_selected_proteins(
        selection_strategy="fdr_significant",
        max_proteins=CONFIG['pwas_params']['max_proteins']
    )

    if len(selected_proteins) == 0:
        print("  📉 No FDR-significant proteins found. Trying nominal significance...")
        # Strategy 2: Try nominally significant proteins
        selected_proteins = pwas_analyzer.get_selected_proteins(
            selection_strategy="nominal_significant", 
            max_proteins=CONFIG['pwas_params']['max_proteins']
        )

    if len(selected_proteins) == 0:
        print("  📉 No nominally significant proteins found. Using top proteins by p-value...")
        # Strategy 3: Use top proteins by p-value
        selected_proteins = pwas_analyzer.get_selected_proteins(
            selection_strategy="top_n",
            max_proteins=CONFIG['pwas_params']['max_proteins']
        )

    print(f"\n✅ Selected {len(selected_proteins)} proteins for analysis")
    print(f"🏆 Top 10 selected proteins:")
    for i, protein in enumerate(selected_proteins[:10], 1):
        # Get p-value for this protein
        results_df = pwas_results['results']
        protein_row = results_df[results_df['protein'] == protein]
        if len(protein_row) > 0:
            p_val = protein_row['p_value'].iloc[0]
            fdr_sig = "***" if protein_row['fdr_significant'].iloc[0] else ""
            nom_sig = "**" if protein_row['nominal_significant'].iloc[0] else ""
            sig_marker = fdr_sig or nom_sig
            print(f"  {i:2d}. {protein} (p = {p_val:.2e}) {sig_marker}")

# Create feature matrix with selected proteins
X_train_selected = X_train[selected_proteins].copy()
X_val_selected = X_val[selected_proteins].copy()

print(f"\n📊 Selected feature matrices:")
print(f"🏋️  Training: {X_train_selected.shape[0]} samples × {X_train_selected.shape[1]} features")
print(f"🧪 Validation: {X_val_selected.shape[0]} samples × {X_val_selected.shape[1]} features")

# Save PWAS checkpoint
pwas_checkpoint = {
    'pwas_results': pwas_results,
    'selected_proteins': selected_proteins,
    'X_train_selected': X_train_selected,
    'X_val_selected': X_val_selected
}

checkpoint_system.save_checkpoint(pwas_checkpoint, '02_pwas_analysis', "PWAS analysis and protein selection completed")
print("💾 PWAS analysis checkpoint saved")

🧬 Starting PWAS analysis...
🧬 Starting PWAS analysis...
  🎯 Target: oa_status
  🧪 Proteins: 2131
  👥 Samples: 2963
  🔧 Covariates: ['Age_at_recruitment', 'Sex', 'bmi', 'mean_NPX', 'Plate0', 'Plate2', 'Plate3']
    🔧 Sex encoding: {'Female': np.int64(0), 'Male': np.int64(1)}
  🔧 Using covariates: ['Age_at_recruitment', 'Sex_encoded', 'bmi', 'mean_NPX', 'Plate0', 'Plate2', 'Plate3']
  📊 Covariate matrix shape: (2963, 7)
    Age_at_recruitment: min=39.000, max=70.000, has_nan=False
    Sex_encoded: min=0.000, max=1.000, has_nan=False
    bmi: min=16.159, max=53.370, has_nan=False
    mean_NPX: min=-0.697, max=0.969, has_nan=False
    Plate0: min=890000000001.000, max=890000000672.000, has_nan=False
    Plate2: min=890000000578.000, max=890000000671.000, has_nan=False
    Plate3: min=890000000633.000, max=890000000671.000, has_nan=False
  🔄 Testing 2131 proteins...
    Processing protein 1/2131
    Processing protein 501/2131
    Processing protein 1001/2131
    Processing protein 1501/213

## Section 5: Feature Importance Analysis

Analyze feature importance using Random Forest to rank the selected proteins by their predictive power.

In [9]:
# Initialize feature importance analyzer
importance_analyzer = FeatureImportanceAnalyzer(
    random_state=CONFIG['analysis_params']['random_state']
)

print("🌲 Starting Random Forest feature importance analysis...")

# Analyze feature importance
importance_results = importance_analyzer.analyze_rf_importance(
    X_train=X_train_selected,
    y_train=y_train,
    n_estimators=CONFIG['feature_params']['rf_n_estimators'],
    cv_folds=CONFIG['analysis_params']['cv_folds']
)

# Get top features by importance
top_features_importance = importance_analyzer.get_top_features(
    n_features=CONFIG['feature_params']['rfe_n_features'],
    method='top_n'
)

print(f"\n🏆 Top {len(top_features_importance)} features by RF importance:")
importance_df = importance_results['feature_importance']
for i, row in importance_df.head(CONFIG['feature_params']['rfe_n_features']).iterrows():
    print(f"  {i+1:2d}. {row['feature']:20s} {row['importance']:.4f}")

# Visualize feature importance (text-based)
print(f"\n📊 Feature Importance Distribution:")
importance_analyzer._print_text_importance_plot(20)

# Save importance checkpoint - FIX THE PARAMETER ORDER
importance_checkpoint = {
    'importance_results': importance_results,
    'top_features_importance': top_features_importance
}

# Correct parameter order: data, step_name, description
checkpoint_system.save_checkpoint(
    importance_checkpoint, 
    '03_feature_importance', 
    "Feature importance analysis completed"
)
print("💾 Feature importance checkpoint saved")

🌲 Starting Random Forest feature importance analysis...
🌲 Starting Random Forest feature importance analysis...
  📊 Features: 200
  👥 Samples: 2963
  🌳 Trees: 500
  🔄 Training Random Forest...
  📈 Evaluating model performance...

🌲 RANDOM FOREST IMPORTANCE SUMMARY
  🎯 Training AUC: 1.0000
  📊 CV AUC: 0.8118 ± 0.0519
  🎲 OOB Score: 0.9457

  🏆 Top 10 most important features:
     1. TREM2                0.0206
     2. ADM                  0.0196
     3. BGLAP                0.0173
     4. ELN                  0.0165
     5. CA6                  0.0131
     6. TNFRSF10B            0.0121
     7. GDF15                0.0115
     8. PGF                  0.0113
     9. CCL23                0.0112
    10. SIGLEC8              0.0107

  📈 Importance distribution:
    Mean: 0.0050
    Std:  0.0030
    Max:  0.0206
    Min:  0.0017
🎯 Selected top 20 features by importance

🏆 Top 20 features by RF importance:
   1. TREM2                0.0206
   2. ADM                  0.0196
   3. BGLAP        

## Section 6: Recursive Feature Elimination (RFE)

Perform RFE to systematically select the most predictive features, reducing dimensionality while maintaining performance.

In [10]:
# Initialize feature selector
feature_selector = FeatureSelector(
    random_state=CONFIG['analysis_params']['random_state']
)

print("🔧 Starting Recursive Feature Elimination...")

# Run RFE with cross-validation
rfe_results = feature_selector.run_rfe(
    X_train=X_train_selected,
    y_train=y_train,
    estimator_type='random_forest',
    n_features_to_select=CONFIG['feature_params']['rfe_n_features'] if not CONFIG['feature_params']['use_rfe_cv'] else None,
    step=CONFIG['feature_params']['rfe_step'],
    cv_folds=CONFIG['analysis_params']['cv_folds'],
    use_cv=CONFIG['feature_params']['use_rfe_cv']
)

# Get selected features from RFE
selected_features_rfe = feature_selector.get_selected_features()

print(f"\n🎯 RFE selected {len(selected_features_rfe)} features:")
for i, feature in enumerate(selected_features_rfe, 1):
    print(f"  {i:2d}. {feature}")

# Create final feature matrices
X_train_final = feature_selector.transform_features(X_train_selected)
X_val_final = feature_selector.transform_features(X_val_selected)

print(f"\n📊 Final feature matrices after RFE:")
print(f"🏋️  Training: {X_train_final.shape[0]} samples × {X_train_final.shape[1]} features")
print(f"🧪 Validation: {X_val_final.shape[0]} samples × {X_val_final.shape[1]} features")

# Print feature reduction summary
print(f"\n📉 Feature reduction summary:")
print(f"  Original features: {X_train.shape[1]:,}")
print(f"  After PWAS: {X_train_selected.shape[1]:,}")
print(f"  After RFE: {X_train_final.shape[1]:,}")
print(f"  Total reduction: {(1 - X_train_final.shape[1]/X_train.shape[1]):.1%}")

# Save RFE checkpoint
rfe_checkpoint = {
    'rfe_results': rfe_results,
    'selected_features_rfe': selected_features_rfe,
    'X_train_final': X_train_final,
    'X_val_final': X_val_final
}

checkpoint_system.save_checkpoint(
    rfe_checkpoint, 
    '04_feature_selection', 
    "Feature selection and RFE completed"
)
print("💾 Feature selection checkpoint saved")

🔧 Starting Recursive Feature Elimination...
🔧 Starting Recursive Feature Elimination...
  📊 Initial features: 200
  🎯 Target features: auto (CV)
  🏗️  Estimator: random_forest
  🔄 Running RFECV with 10-fold CV...
  📈 Evaluating feature selection performance...

🔧 RFE FEATURE SELECTION SUMMARY
  📊 Features selected: 126/200
  🎯 Selection ratio: 63.00%
  🏗️  Estimator: random_forest

  📈 Performance with selected features:
    Training AUC: 1.0000
    CV AUC: 0.8178 ± 0.0522

  🏆 Selected features:
     1. ELN
     2. TNFRSF10B
     3. ADM
     4. TREM2
     5. IGFBP4
     6. CXCL17
     7. FUT3_FUT5
     8. GFRA1
     9. LAMP3
    10. PGF
    ... and 116 more

  📊 RFECV Results:
    Optimal features: 126
    Best CV score: No scores available

🎯 RFE selected 126 features:
   1. ELN
   2. TNFRSF10B
   3. ADM
   4. TREM2
   5. IGFBP4
   6. CXCL17
   7. FUT3_FUT5
   8. GFRA1
   9. LAMP3
  10. PGF
  11. LECT2
  12. COL9A1
  13. EDA2R
  14. GDF15
  15. CCN3
  16. DNER
  17. FSTL3
  18. PIK3I

## Section 7: Model Training and Evaluation

Train multiple machine learning models using the selected features and evaluate their performance with cross-validation.

In [13]:
# Initialize model trainer
model_trainer = ModelTrainer(
    random_state=CONFIG['analysis_params']['random_state']
)

print("🤖 Starting model training and evaluation...")
print(f"📋 Models to compare: {CONFIG['model_params']['model_types']}")

# Compare multiple models (now including XGBoost)
model_comparison = model_trainer.compare_models(
    X_train=X_train_final,
    y_train=y_train,
    X_val=X_val_final,
    y_val=y_val,
    model_types=CONFIG['model_params']['model_types'],
    cv_folds=CONFIG['analysis_params']['cv_folds']
)

print("\n🏆 Model Comparison Results:")
print("="*80)

best_model_type = None
best_cv_auc = 0

for model_type, results in model_comparison.items():
    performance = results['performance']
    
    # Extract key metrics
    train_auc = performance.get('train', {}).get('auc', 'N/A')
    cv_auc = performance.get('cross_validation', {}).get('auc_mean', 'N/A')
    val_auc = performance.get('validation', {}).get('auc', 'N/A')
    
    # Add emoji for model type
    model_emoji = {
        'random_forest': '🌲',
        'logistic_regression': '📈', 
        'xgboost': '🚀'
    }.get(model_type, '🤖')
    
    print(f"\n{model_emoji} {model_type.upper().replace('_', ' ')}")
    print(f"  🎯 Training AUC:   {train_auc:.4f}" if isinstance(train_auc, (int, float)) else f"  🎯 Training AUC:   {train_auc}")
    print(f"  📊 CV AUC:        {cv_auc:.4f}" if isinstance(cv_auc, (int, float)) else f"  📊 CV AUC:        {cv_auc}")
    print(f"  🧪 Validation AUC: {val_auc:.4f}" if isinstance(val_auc, (int, float)) else f"  🧪 Validation AUC: {val_auc}")
    
    # Track best model by CV AUC
    if isinstance(cv_auc, (int, float)) and cv_auc > best_cv_auc:
        best_cv_auc = cv_auc
        best_model_type = model_type

print(f"\n🥇 Best model: {best_model_type.replace('_', ' ').title()} (CV AUC: {best_cv_auc:.4f})")


# Train the best model with hyperparameter tuning if requested
if CONFIG['model_params']['hyperparameter_tuning'] and best_model_type:
    print(f"\n🔧 Training {best_model_type.replace('_', ' ').title()} with hyperparameter tuning...")
    
    final_model_results = model_trainer.train_model(
        X_train=X_train_final,
        y_train=y_train,
        X_val=X_val_final,
        y_val=y_val,
        model_type=best_model_type,
        hyperparameter_tuning=True,
        cv_folds=CONFIG['analysis_params']['cv_folds']
    )
else:
    final_model_results = model_comparison[best_model_type] if best_model_type else None

# Generate predictions on validation set
if best_model_type and final_model_results:
    predictions, probabilities = model_trainer.predict(
        X_val_final, 
        model_type=best_model_type,
        return_probabilities=True
    )
    
    # Calculate additional metrics
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    final_metrics = {
        'accuracy': accuracy_score(y_val, predictions),
        'precision': precision_score(y_val, predictions),
        'recall': recall_score(y_val, predictions),
        'f1_score': f1_score(y_val, predictions),
        'auc': roc_auc_score(y_val, probabilities)
    }
    
    print(f"\n📊 Final validation metrics ({best_model_type.replace('_', ' ').title()}):")
    for metric, value in final_metrics.items():
        print(f"  {metric.upper()}: {value:.4f}")
    
    # Calculate and display differences from R pipeline
    if isinstance(final_metrics['auc'], (int, float)):
        cv_diff = best_cv_auc - 0.9892
        val_diff = final_metrics['auc'] - 0.6521
        
        print(f"\nDifferences from R Pipeline:")
        print(f"  📊 CV AUC: {cv_diff:+.4f}")
        print(f"  🧪 Validation AUC: {val_diff:+.4f}")
        
        # Performance assessment
        if cv_diff > 0 and val_diff > 0:
            print("🎉 Python pipeline outperforms R pipeline on both metrics!")
        elif val_diff > 0:
            print("✅ Python pipeline has better validation performance")
        elif cv_diff > 0:
            print("✅ Python pipeline has better cross-validation performance")
        else:
            print("📝 Python pipeline has different performance characteristics")

# Save model training checkpoint
model_checkpoint = {
    'model_comparison': model_comparison,
    'best_model_type': best_model_type,
    'final_model_results': final_model_results,
    'final_metrics': final_metrics if 'final_metrics' in locals() else None,
    'predictions': predictions if 'predictions' in locals() else None,
    'probabilities': probabilities if 'probabilities' in locals() else None
}

checkpoint_system.save_checkpoint(
    model_checkpoint, 
    '05_model_training', 
    f"Model training and evaluation completed with {len(CONFIG['model_params']['model_types'])} models including XGBoost"
)
print("💾 Model training checkpoint saved")

🤖 Starting model training and evaluation...
📋 Models to compare: ['random_forest', 'logistic_regression', 'xgboost']
🏆 Comparing 3 models...

  🤖 Training random_forest...
🤖 Training random_forest model...
  📊 Training: 2963 samples × 126 features
  🧪 Validation: 328 samples
  🔄 Training model...
  📈 Evaluating performance...

🤖 MODEL TRAINING SUMMARY: RANDOM_FOREST
  📊 Features: 126
  👥 Training samples: 2963
  🧪 Validation samples: 328
  🎯 Training AUC: 1.0000
  🎯 Training Accuracy: 1.0000
  📊 CV AUC: 0.8255 ± 0.0525
  🧪 Validation AUC: 0.7785
  🧪 Validation Accuracy: 0.9482
  🎲 OOB Score: 0.9457

  🤖 Training logistic_regression...
🤖 Training logistic_regression model...
  📊 Training: 2963 samples × 126 features
  🧪 Validation: 328 samples
  🔄 Training model...
  📈 Evaluating performance...

🤖 MODEL TRAINING SUMMARY: LOGISTIC_REGRESSION
  📊 Features: 126
  👥 Training samples: 2963
  🧪 Validation samples: 328
  🎯 Training AUC: 0.9168
  🎯 Training Accuracy: 0.8252
  📊 CV AUC: 0.7969 ±

## Section 8: Results Visualization and Summary

Generate plots and comprehensive summary to visualize the pipeline results and compare with original R analysis.

In [14]:
# Create comprehensive results summary
print("📊 Generating comprehensive pipeline results summary...")

# Collect all results
all_results = {
    'data_loading': checkpoint_system.load_checkpoint('01_data_loading') if checkpoint_system.checkpoint_exists('01_data_loading') else None,
    'pwas_analysis': checkpoint_system.load_checkpoint('02_pwas_analysis') if checkpoint_system.checkpoint_exists('02_pwas_analysis') else None,
    'feature_importance': checkpoint_system.load_checkpoint('03_feature_importance') if checkpoint_system.checkpoint_exists('03_feature_importance') else None,
    'feature_selection': checkpoint_system.load_checkpoint('04_feature_selection') if checkpoint_system.checkpoint_exists('04_feature_selection') else None,
    'model_training': checkpoint_system.load_checkpoint('05_model_training') if checkpoint_system.checkpoint_exists('05_model_training') else None
}

# Create results summary
summary_path = Path(CONFIG['output_paths']['reports']) / 'pipeline_summary.json'
pipeline_summary = create_results_summary(
    {k: v for k, v in all_results.items() if v is not None},
    save_path=str(summary_path)
)

# Print formatted summary
print_pipeline_summary(pipeline_summary)

# Generate text-based visualizations
print(f"\n📈 PERFORMANCE COMPARISON WITH R PIPELINE")
print("="*80)

# Compare with known R results (from the exact_pipeline_reproduction.py)
r_pipeline_results = {
    'proteins_selected': 200,  # Top 200 proteins selected in R
    'final_features': 8,      # Final features after RFE in R  
    'cv_auc': 0.9892,        # Cross-validation AUC in R
    'val_auc': 0.6521        # Validation AUC in R
}

print("R Pipeline Results (Reference):")
print(f"  🧬 Proteins selected: {r_pipeline_results['proteins_selected']}")
print(f"  🎯 Final features: {r_pipeline_results['final_features']}")
print(f"  📊 CV AUC: {r_pipeline_results['cv_auc']:.4f}")
print(f"  🧪 Validation AUC: {r_pipeline_results['val_auc']:.4f}")

print(f"\nPython Pipeline Results (Current):")
print(f"  🧬 Proteins selected: {len(selected_proteins) if 'selected_proteins' in globals() else 'N/A'}")
print(f"  🎯 Final features: {len(selected_features_rfe) if 'selected_features_rfe' in globals() else 'N/A'}")

if 'model_checkpoint' in locals() and model_checkpoint['final_model_results']:
    final_perf = model_checkpoint['final_model_results']['performance']
    cv_auc_python = final_perf.get('cross_validation', {}).get('auc_mean', 'N/A')
    val_auc_python = final_perf.get('validation', {}).get('auc', 'N/A')
    
    print(f"  📊 CV AUC: {cv_auc_python:.4f}" if isinstance(cv_auc_python, (int, float)) else f"  📊 CV AUC: {cv_auc_python}")
    print(f"  🧪 Validation AUC: {val_auc_python:.4f}" if isinstance(val_auc_python, (int, float)) else f"  🧪 Validation AUC: {val_auc_python}")

# Create feature importance visualization (text-based)
if 'importance_results' in locals():
    print(f"\n🧬 SELECTED FEATURES SUMMARY")
    print("="*80)
    
    print(f"Final selected features ({len(selected_features_rfe)}):")
    for i, feature in enumerate(selected_features_rfe, 1):
        # Find importance score
        importance_row = importance_results['feature_importance'][
            importance_results['feature_importance']['feature'] == feature
        ]
        if len(importance_row) > 0:
            importance_score = importance_row['importance'].iloc[0]
            print(f"  {i:2d}. {feature:25s} (importance: {importance_score:.4f})")
        else:
            print(f"  {i:2d}. {feature}")

# Save final summary
summary_text_path = Path(CONFIG['output_paths']['reports']) / 'pipeline_summary.txt'
with open(summary_text_path, 'w') as f:
    f.write("PROTEIN ANALYSIS PIPELINE SUMMARY\n")
    f.write("="*50 + "\n\n")
    f.write(f"Execution Date: {datetime.now()}\n")
    f.write(f"Configuration: {CONFIG}\n\n")
    
    f.write("RESULTS:\n")
    f.write(f"- Proteins selected: {len(selected_proteins) if 'selected_proteins' in globals() else 'N/A'}\n")
    f.write(f"- Final features: {len(selected_features_rfe) if 'selected_features_rfe' in globals() else 'N/A'}\n")
    f.write(f"- Best model: {best_model_type if 'best_model_type' in locals() else 'N/A'}\n")
    
    if 'final_metrics' in locals():
        f.write(f"- Final metrics: {final_metrics}\n")

print(f"\n💾 Results saved to:")
print(f"  📊 JSON summary: {summary_path}")
print(f"  📝 Text summary: {summary_text_path}")

# Display checkpoint system summary
checkpoint_system.print_summary()

print(f"\n🎉 Pipeline execution completed successfully!")
print(f"⏱️  Total execution time: {checkpoint_system.get_total_execution_time():.2f} seconds")

📊 Generating comprehensive pipeline results summary...
✅ Checkpoint loaded: 01_data_loading
   📅 Saved: 2025-10-17T20:37:52.076664
✅ Checkpoint loaded: 02_pwas_analysis
   📅 Saved: 2025-10-17T20:38:07.888061
✅ Checkpoint loaded: 03_feature_importance
   📅 Saved: 2025-10-17T20:39:59.376388
✅ Checkpoint loaded: 04_feature_selection
   📅 Saved: 2025-10-17T20:57:57.811384
✅ Checkpoint loaded: 05_model_training
   📅 Saved: 2025-10-17T21:04:57.974893
📊 Results summary saved to: /home/itg/oleg.vlasovets/projects/protein-benchmark/results/pipeline_run/reports/pipeline_summary.json

🧬 PROTEIN ANALYSIS PIPELINE SUMMARY

📋 PIPELINE OVERVIEW
  ✅ Steps completed: 5
  🕒 Timestamp: 2025-10-17T21:06:15.007180
  🔧 Steps: data_loading, pwas_analysis, feature_importance, feature_selection, model_training


📈 PERFORMANCE COMPARISON WITH R PIPELINE
R Pipeline Results (Reference):
  🧬 Proteins selected: 200
  🎯 Final features: 8
  📊 CV AUC: 0.9892
  🧪 Validation AUC: 0.6521

Python Pipeline Results (Current

## Section 9: Checkpoint Management and Recovery

Demonstrate checkpoint loading functionality, resume training from saved states, and validate checkpoint integrity across pipeline runs.

In [16]:
print("💾 Checkpoint Management and Recovery Demo")
print("="*50)

# List all available checkpoints
available_checkpoints = checkpoint_system.list_checkpoints()
print(f"\n📋 Available checkpoints ({len(available_checkpoints)}):")
for checkpoint in available_checkpoints:
    step_name = checkpoint['step_name']
    metadata = checkpoint_system.get_checkpoint_info(step_name)
    print(f"  ✅ {step_name}: {metadata.get('timestamp', 'Unknown time')}")

# Demonstrate checkpoint loading
print(f"\n🔄 Demonstrating checkpoint recovery...")

# Example: Load data loading checkpoint
if checkpoint_system.checkpoint_exists('01_data_loading'):
    print(f"\n📥 Loading data checkpoint...")
    data_checkpoint = checkpoint_system.load_checkpoint('01_data_loading')
    print(f"  ✅ Recovered data shapes:")
    print(f"    Training: {data_checkpoint['X_train'].shape}")
    print(f"    Validation: {data_checkpoint['X_val'].shape}")
    print(f"    Target distribution: {data_checkpoint['y_train'].value_counts().to_dict()}")

# Example: Load PWAS results
if checkpoint_system.checkpoint_exists('02_pwas_analysis'):
    print(f"\n📥 Loading PWAS checkpoint...")
    pwas_checkpoint_loaded = checkpoint_system.load_checkpoint('02_pwas_analysis')
    print(f"  ✅ Recovered PWAS results:")
    print(f"    Selected proteins: {len(pwas_checkpoint_loaded['selected_proteins'])}")
    print(f"    Feature matrix shape: {pwas_checkpoint_loaded['X_train_selected'].shape}")

# Example: Load feature selection results  
if checkpoint_system.checkpoint_exists('04_feature_selection'):
    print(f"\n📥 Loading feature selection checkpoint...")
    rfe_checkpoint_loaded = checkpoint_system.load_checkpoint('04_feature_selection')
    print(f"  ✅ Recovered RFE results:")
    print(f"    Final features: {len(rfe_checkpoint_loaded['selected_features_rfe'])}")
    print(f"    Final matrix shape: {rfe_checkpoint_loaded['X_train_final'].shape}")

# Validate checkpoint integrity
print(f"\n🔍 Checkpoint integrity validation:")

integrity_results = {}
for checkpoint in available_checkpoints:
    step_name = checkpoint['step_name']
    try:
        checkpoint_data = checkpoint_system.load_checkpoint(step_name)
        metadata = checkpoint_system.get_checkpoint_info(step_name)
        # Basic validation checks
        is_valid = True
        validation_messages = []
        if checkpoint_data is None:
            is_valid = False
            validation_messages.append("Checkpoint data is None")
        if not metadata:
            validation_messages.append("No metadata found")
        checkpoint_path = Path(checkpoint['file_path'])
        if checkpoint_path.exists():
            file_size = checkpoint_path.stat().st_size
            if file_size == 0:
                is_valid = False
                validation_messages.append("Checkpoint file is empty")
            else:
                validation_messages.append(f"File size: {file_size/1024:.1f} KB")
        integrity_results[step_name] = {
            'valid': is_valid,
            'messages': validation_messages
        }
        status_icon = "✅" if is_valid else "❌"
        print(f"  {status_icon} {step_name}: {'Valid' if is_valid else 'Invalid'}")
        for msg in validation_messages:
            print(f"    - {msg}")
    except Exception as e:
        integrity_results[step_name] = {
            'valid': False,
            'messages': [f"Error loading: {str(e)}"]
        }
        print(f"  ❌ {step_name}: Error - {str(e)}")

# Demonstrate pipeline restart from checkpoint
print(f"\n🔄 Pipeline restart demonstration:")
print(f"To restart the pipeline from any checkpoint, use:")
print(f"")
print(f"```python")
print(f"# Load checkpoint")
print(f"checkpoint_data = checkpoint_system.load_checkpoint('02_pwas_analysis')")
print(f"")
print(f"# Resume from that point")
print(f"X_train_selected = checkpoint_data['X_train_selected']")
print(f"selected_proteins = checkpoint_data['selected_proteins']")
print(f"")
print(f"# Continue with feature importance analysis...")
print(f"importance_analyzer = FeatureImportanceAnalyzer()")
print(f"# ... rest of pipeline")
print(f"```")

# Create checkpoint recovery script
recovery_script_path = Path(CONFIG['output_paths']['base_dir']) / 'checkpoint_recovery.py'
recovery_script = f'''#!/usr/bin/env python3
"""
Checkpoint Recovery Script
==========================

This script demonstrates how to recover and resume the pipeline from any checkpoint.
Generated automatically by the pipeline orchestrator.
"""

import sys
sys.path.append('/home/itg/oleg.vlasovets/projects/protein-benchmark')

from pipeline import CheckpointSystem

# Initialize checkpoint system
checkpoint_system = CheckpointSystem('{CONFIG['output_paths']['checkpoints']}')

# List available checkpoints
print("Available checkpoints:")
for checkpoint in checkpoint_system.list_checkpoints():
    step_name = checkpoint['step_name']
    metadata = checkpoint_system.get_checkpoint_info(step_name)
    print(f"  - {{step_name}}: {{metadata.get('timestamp', 'Unknown')}}")

# Example: Resume from PWAS analysis
if checkpoint_system.checkpoint_exists('02_pwas_analysis'):
    print("\\nResuming from PWAS analysis checkpoint...")
    pwas_data = checkpoint_system.load_checkpoint('02_pwas_analysis')
    # Extract recovered data
    selected_proteins = pwas_data['selected_proteins']
    X_train_selected = pwas_data['X_train_selected']
    print(f"Recovered {{len(selected_proteins)}} selected proteins")
    print(f"Feature matrix shape: {{X_train_selected.shape}}")
    # Continue with next step...
    # (Add your continuation logic here)
else:
    print("PWAS checkpoint not found. Run the full pipeline first.")
'''

with open(recovery_script_path, 'w') as f:
    f.write(recovery_script)

print(f"\n📜 Checkpoint recovery script created: {recovery_script_path}")

# Final checkpoint summary
print(f"\n📊 Final Checkpoint System Summary:")
checkpoint_system.print_summary()

print(f"\n🎯 Checkpoint management demonstration completed!")
print(f"💡 All checkpoints are ready for pipeline recovery and resumption.")

💾 Checkpoint Management and Recovery Demo

📋 Available checkpoints (5):
  ✅ 01_data_loading: 2025-10-17T20:37:52.076664
  ✅ 02_pwas_analysis: 2025-10-17T20:38:07.888061
  ✅ 03_feature_importance: 2025-10-17T20:39:59.376388
  ✅ 04_feature_selection: 2025-10-17T20:57:57.811384
  ✅ 05_model_training: 2025-10-17T21:04:57.974893

🔄 Demonstrating checkpoint recovery...

📥 Loading data checkpoint...
✅ Checkpoint loaded: 01_data_loading
   📅 Saved: 2025-10-17T20:37:52.076664
  ✅ Recovered data shapes:
    Training: (2963, 2131)
    Validation: (328, 2131)
    Target distribution: {0: 2803, 1: 160}

📥 Loading PWAS checkpoint...
✅ Checkpoint loaded: 02_pwas_analysis
   📅 Saved: 2025-10-17T20:38:07.888061
  ✅ Recovered PWAS results:
    Selected proteins: 200
    Feature matrix shape: (2963, 200)

📥 Loading feature selection checkpoint...
✅ Checkpoint loaded: 04_feature_selection
   📅 Saved: 2025-10-17T20:57:57.811384
  ✅ Recovered RFE results:
    Final features: 126
    Final matrix shape: (296

## Conclusion

This notebook successfully demonstrates the complete protein analysis pipeline, reproducing the R methodology with modular Python components. The pipeline includes:

### ✅ **Completed Steps:**
1. **Data Loading**: Loaded protein expression and phenotype data with validation
2. **PWAS Analysis**: Performed protein-wide association study with R fallback strategy
3. **Feature Importance**: Ranked proteins using Random Forest importance
4. **Feature Selection**: Applied RFE to select optimal feature set
5. **Model Training**: Trained and compared multiple ML models
6. **Results Analysis**: Generated comprehensive performance metrics
7. **Checkpoint System**: Implemented robust checkpoint management for reproducibility