# 🔋 Battery Capacity Prediction - Clean PyCaret AutoML Analysis

## 📋 Methodology Overview
- **Training Set (80%)**: Used for model training and cross-validation
- **Test Set (20%)**: Completely unseen data for final evaluation
- **Cross-Validation**: 5-fold CV on training data only

## 🎯 Analysis Goals
1. Compare 17 different ML algorithms using cross-validation
2. Evaluate best models on completely unseen test data  
3. Detect overfitting by comparing CV vs test performance
4. Provide clear recommendations for model deployment

## ✅ Data Split Guarantee
- Cross-validation is performed ONLY on training data
- Test set remains completely unseen during model selection
- Final evaluation shows true generalization performance


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import PyCaret
from pycaret.regression import *

# Import custom modules
from src.data_loader import DataLoader

# Import sklearn for additional metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

print("✅ All libraries imported successfully!")
print("📊 Ready for PyCaret AutoML analysis")


✅ All libraries imported successfully!
📊 Ready for PyCaret AutoML analysis


In [13]:
# Load and prepare data
print("📁 Loading battery capacity dataset...")

current_dir = Path.cwd()
data_path = current_dir / "dataset_299.xlsx"
results_dir = current_dir / "results"
results_dir.mkdir(exist_ok=True)

# Load data using custom data loader
data_loader = DataLoader(data_path)
df = data_loader.load_data()
X, y = data_loader.split_features_target(df)

# Create combined dataset for PyCaret
data = X.copy()
data['capacity'] = y  # Use 'capacity' as target name for clarity

print(f"📊 Dataset shape: {data.shape}")
print(f"🔢 Features: {X.shape[1]}")
print(f"📈 Target statistics:")
print(y.describe().round(4))

print("\n✅ Data loaded and prepared successfully!")

📁 Loading battery capacity dataset...
Successfully loaded data with shape: (299, 63)
Index(['Cell ID', 'Average Capacity'], dtype='object')
📊 Dataset shape: (299, 62)
🔢 Features: 61
📈 Target statistics:
count     299.0000
mean     7758.9939
std       624.9480
min      4106.4765
25%      7617.6001
50%      7879.6715
75%      8123.5052
max      8579.0650
Name: Average Capacity, dtype: float64

✅ Data loaded and prepared successfully!


In [4]:
# Setup PyCaret environment with proper train/test split
print("⚙️ Setting up PyCaret environment...")
print("📊 Data split: 80% train (for CV) + 20% test (unseen)")

# Setup PyCaret with explicit parameters
reg = setup(
    data=data,
    target='capacity',
    session_id=123,           # For reproducibility
    train_size=0.8,          # 80% for training (CV will be done on this)
    fold=5,                  # 5-fold cross-validation on training data
    verbose=False,           # Minimize output
    use_gpu=False,           # Set to True if GPU available
    normalize=True,          # Normalize features
    transformation=True,     # Apply transformations
    remove_multicollinearity=True,  # Remove highly correlated features
    multicollinearity_threshold=0.9
)

print("✅ PyCaret environment configured successfully!")
print("🔄 Cross-validation will be performed on training data only")
print("🔒 Test data is held out and completely unseen")


⚙️ Setting up PyCaret environment...
📊 Data split: 80% train (for CV) + 20% test (unseen)
✅ PyCaret environment configured successfully!
🔄 Cross-validation will be performed on training data only
🔒 Test data is held out and completely unseen


In [5]:
# Compare all available models using cross-validation
print("🤖 Comparing 17 different machine learning algorithms...")
print("⏱️ This may take several minutes...")
print("\n📊 Models being tested:")
models_list = [
    ('lr', 'Linear Regression'),
    ('lasso', 'Lasso Regression'),
    ('ridge', 'Ridge Regression'),
    ('en', 'Elastic Net'),
    ('huber', 'Huber Regressor'),
    ('rf', 'Random Forest'),
    ('et', 'Extra Trees'),
    ('gbr', 'Gradient Boosting'),
    ('lightgbm', 'LightGBM'),
    ('xgboost', 'XGBoost'),
    ('catboost', 'CatBoost'),
    ('knn', 'K-Nearest Neighbors'),
    ('mlp', 'Multi-Layer Perceptron'),
    ('svm', 'Support Vector Machine'),
    ('dt', 'Decision Tree'),
    ('ada', 'AdaBoost'),
    ('br', 'Bayesian Ridge')
]

for code, name in models_list:
    print(f"   • {name} ({code})")

print("\n🚀 Starting model comparison...")


🤖 Comparing 17 different machine learning algorithms...
⏱️ This may take several minutes...

📊 Models being tested:
   • Linear Regression (lr)
   • Lasso Regression (lasso)
   • Ridge Regression (ridge)
   • Elastic Net (en)
   • Huber Regressor (huber)
   • Random Forest (rf)
   • Extra Trees (et)
   • Gradient Boosting (gbr)
   • LightGBM (lightgbm)
   • XGBoost (xgboost)
   • CatBoost (catboost)
   • K-Nearest Neighbors (knn)
   • Multi-Layer Perceptron (mlp)
   • Support Vector Machine (svm)
   • Decision Tree (dt)
   • AdaBoost (ada)
   • Bayesian Ridge (br)

🚀 Starting model comparison...


In [6]:
# Run the model comparison
best_model = compare_models(
    include=[code for code, _ in models_list],
    sort='RMSE',      # Sort by RMSE (lower is better)
    verbose=False,    # Reduce output
    fold=5           # Ensure 5-fold CV
)

print("✅ Model comparison completed!")
print("\n📊 CROSS-VALIDATION RESULTS (on training data):")
print("=" * 70)

# Get the comparison results
cv_results = pull()
print("Note: These results are from 5-fold cross-validation on TRAINING data only")
print("The test set remains completely unseen at this point.\n")
display(cv_results.round(6))

# Save CV results
cv_results.to_csv(results_dir / 'cross_validation_results.csv')
print(f"💾 Cross-validation results saved to: {results_dir / 'cross_validation_results.csv'}")


✅ Model comparison completed!

📊 CROSS-VALIDATION RESULTS (on training data):
Note: These results are from 5-fold cross-validation on TRAINING data only
The test set remains completely unseen at this point.



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,295.2745,176588.0,410.5731,0.378,0.0576,0.0402,0.016
catboost,CatBoost Regressor,299.3522,190130.8,424.1653,0.3167,0.0594,0.0407,0.16
rf,Random Forest Regressor,314.751,199611.4,438.2125,0.3312,0.0619,0.043,0.034
et,Extra Trees Regressor,319.5912,210257.0,448.2403,0.2617,0.0626,0.0433,0.026
ada,AdaBoost Regressor,327.0233,216012.4,458.0947,0.2845,0.0644,0.0444,0.02
xgboost,Extreme Gradient Boosting,331.8213,231124.4,468.6088,0.1939,0.0651,0.0453,0.018
knn,K Neighbors Regressor,333.5187,233254.5,476.6887,0.2415,0.0689,0.0462,0.016
lightgbm,Light Gradient Boosting Machine,327.0036,238876.9,483.9251,0.2147,0.0714,0.0457,0.086
dt,Decision Tree Regressor,377.1489,264871.7,509.255,0.0903,0.0699,0.0508,0.012
en,Elastic Net,397.0344,361287.8,583.5852,0.083,0.0862,0.0571,0.014


💾 Cross-validation results saved to: /Users/amirbabamahmoudi/Documents/Battery-Capacity/results/cross_validation_results.csv


In [8]:
# Analyze top 5 models on the unseen test set
print("\n" + "=" * 80)
print("🎯 EVALUATING TOP 5 MODELS ON UNSEEN TEST SET")
print("=" * 80)
print("Now we evaluate the top performers on completely unseen test data")
print("This will reveal any overfitting issues.\n")

# Get top 5 models
top_5_models = cv_results.head(5).index.tolist()
print(f"🏆 Top 5 models based on CV RMSE:")
for i, model_name in enumerate(top_5_models, 1):
    cv_rmse = cv_results.loc[model_name, 'RMSE']
    cv_MAE = cv_results.loc[model_name, 'MAE']
    cv_r2 = cv_results.loc[model_name, 'R2']
    print(f"   {i}. {model_name}: CV RMSE = {cv_rmse:.6f}, CV R² = {cv_r2:.6f}, CV MAE = {cv_MAE:.6f}")

print("\n🔍 Now testing these models on unseen test data...")



🎯 EVALUATING TOP 5 MODELS ON UNSEEN TEST SET
Now we evaluate the top performers on completely unseen test data
This will reveal any overfitting issues.

🏆 Top 5 models based on CV RMSE:
   1. gbr: CV RMSE = 410.573100, CV R² = 0.378000, CV MAE = 295.274500
   2. catboost: CV RMSE = 424.165300, CV R² = 0.316700, CV MAE = 299.352200
   3. rf: CV RMSE = 438.212500, CV R² = 0.331200, CV MAE = 314.751000
   4. et: CV RMSE = 448.240300, CV R² = 0.261700, CV MAE = 319.591200
   5. ada: CV RMSE = 458.094700, CV R² = 0.284500, CV MAE = 327.023300

🔍 Now testing these models on unseen test data...


In [9]:
# Detailed evaluation of top models
test_results = []

for i, model_name in enumerate(top_5_models, 1):
    print(f"\n{'='*60}")
    print(f"📊 EVALUATING MODEL {i}: {model_name.upper()}")
    print(f"{'='*60}")
    
    # Create the model
    print(f"🔧 Creating {model_name} model...")
    model = create_model(model_name, verbose=False)
    
    # Get predictions on test set
    print(f"🔮 Making predictions on test set...")
    test_predictions = predict_model(model, verbose=False)
    
    # Calculate test metrics
    y_test_true = test_predictions['capacity']
    y_test_pred = test_predictions['prediction_label']
    
    test_rmse = np.sqrt(mean_squared_error(y_test_true, y_test_pred))
    test_mae = mean_absolute_error(y_test_true, y_test_pred)
    test_r2 = r2_score(y_test_true, y_test_pred)
    test_mape = np.mean(np.abs((y_test_true - y_test_pred) / y_test_true)) * 100
    
    # Get CV metrics
    cv_rmse = cv_results.loc[model_name, 'RMSE']
    cv_mae = cv_results.loc[model_name, 'MAE']
    cv_r2 = cv_results.loc[model_name, 'R2']
    
    # Calculate performance differences
    rmse_diff = test_rmse - cv_rmse
    rmse_diff_pct = (rmse_diff / cv_rmse) * 100
    r2_diff = cv_r2 - test_r2  # Positive means CV is better (potential overfitting)
    
    # Display results
    comparison_df = pd.DataFrame({
        'Metric': ['RMSE', 'MAE', 'R²', 'MAPE'],
        'Cross-Validation': [cv_rmse, cv_mae, cv_r2, np.nan],
        'Test Set': [test_rmse, test_mae, test_r2, test_mape],
        'Difference': [rmse_diff, test_mae - cv_mae, r2_diff, np.nan],
        'Diff %': [rmse_diff_pct, ((test_mae - cv_mae) / cv_mae) * 100, (r2_diff / cv_r2) * 100 if cv_r2 != 0 else 0, np.nan]
    })
    
    print("\n📊 Performance Comparison:")
    display(comparison_df.round(6))
    
    # Overfitting assessment
    print("\n🔍 Overfitting Assessment:")
    if rmse_diff_pct > 20 or r2_diff > 0.15:
        overfitting_status = "🔴 HIGH OVERFITTING"
        recommendation = "❌ Not recommended for deployment"
    elif rmse_diff_pct > 10 or r2_diff > 0.1:
        overfitting_status = "🟡 MODERATE OVERFITTING"
        recommendation = "⚠️ Use with caution, monitor performance"
    else:
        overfitting_status = "🟢 GOOD GENERALIZATION"
        recommendation = "✅ Good candidate for deployment"
    
    print(f"   Status: {overfitting_status}")
    print(f"   RMSE increase: {rmse_diff_pct:.2f}%")
    print(f"   R² drop: {r2_diff:.4f}")
    print(f"   Recommendation: {recommendation}")
    
    # Store results
    test_results.append({
        'Model': model_name,
        'CV_RMSE': cv_rmse,
        'Test_RMSE': test_rmse,
        'CV_R2': cv_r2,
        'Test_R2': test_r2,
        'RMSE_Diff_%': rmse_diff_pct,
        'R2_Drop': r2_diff,
        'Status': overfitting_status.split()[1],  # Just the color indicator
        'Recommended': '✅' if 'GOOD' in overfitting_status else ('⚠️' if 'MODERATE' in overfitting_status else '❌')
    })

print(f"\n{'='*80}")
print("✅ Individual model evaluation completed!")



📊 EVALUATING MODEL 1: GBR
🔧 Creating gbr model...
🔮 Making predictions on test set...

📊 Performance Comparison:


Unnamed: 0,Metric,Cross-Validation,Test Set,Difference,Diff %
0,RMSE,410.5731,455.740176,45.167076,11.000983
1,MAE,295.2745,326.720067,31.445567,10.649605
2,R²,0.378,0.293498,0.084502,22.355062
3,MAPE,,4.398652,,



🔍 Overfitting Assessment:
   Status: 🟡 MODERATE OVERFITTING
   RMSE increase: 11.00%
   R² drop: 0.0845
   Recommendation: ⚠️ Use with caution, monitor performance

📊 EVALUATING MODEL 2: CATBOOST
🔧 Creating catboost model...
🔮 Making predictions on test set...

📊 Performance Comparison:


Unnamed: 0,Metric,Cross-Validation,Test Set,Difference,Diff %
0,RMSE,424.1653,365.194383,-58.970917,-13.902815
1,MAE,299.3522,284.415047,-14.937153,-4.989826
2,R²,0.3167,0.546344,-0.229644,-72.511405
3,MAPE,,3.773877,,



🔍 Overfitting Assessment:
   Status: 🟢 GOOD GENERALIZATION
   RMSE increase: -13.90%
   R² drop: -0.2296
   Recommendation: ✅ Good candidate for deployment

📊 EVALUATING MODEL 3: RF
🔧 Creating rf model...
🔮 Making predictions on test set...

📊 Performance Comparison:


Unnamed: 0,Metric,Cross-Validation,Test Set,Difference,Diff %
0,RMSE,438.2125,397.23172,-40.98078,-9.351805
1,MAE,314.751,302.120719,-12.630281,-4.012785
2,R²,0.3312,0.463257,-0.132057,-39.872172
3,MAPE,,4.042727,,



🔍 Overfitting Assessment:
   Status: 🟢 GOOD GENERALIZATION
   RMSE increase: -9.35%
   R² drop: -0.1321
   Recommendation: ✅ Good candidate for deployment

📊 EVALUATING MODEL 4: ET
🔧 Creating et model...
🔮 Making predictions on test set...

📊 Performance Comparison:


Unnamed: 0,Metric,Cross-Validation,Test Set,Difference,Diff %
0,RMSE,448.2403,402.901633,-45.338667,-10.114813
1,MAE,319.5912,311.848087,-7.743113,-2.422818
2,R²,0.2617,0.447825,-0.186125,-71.121434
3,MAPE,,4.175674,,



🔍 Overfitting Assessment:
   Status: 🟢 GOOD GENERALIZATION
   RMSE increase: -10.11%
   R² drop: -0.1861
   Recommendation: ✅ Good candidate for deployment

📊 EVALUATING MODEL 5: ADA
🔧 Creating ada model...
🔮 Making predictions on test set...

📊 Performance Comparison:


Unnamed: 0,Metric,Cross-Validation,Test Set,Difference,Diff %
0,RMSE,458.0947,452.646331,-5.448369,-1.189354
1,MAE,327.0233,303.957283,-23.066017,-7.053325
2,R²,0.2845,0.303058,-0.018558,-6.5229
3,MAPE,,4.186918,,



🔍 Overfitting Assessment:
   Status: 🟢 GOOD GENERALIZATION
   RMSE increase: -1.19%
   R² drop: -0.0186
   Recommendation: ✅ Good candidate for deployment

✅ Individual model evaluation completed!


In [None]:
# Comprehensive summary
print("\n" + "=" * 80)
print("📋 COMPREHENSIVE MODEL EVALUATION SUMMARY")
print("=" * 80)

# Create summary DataFrame
summary_df = pd.DataFrame(test_results)
summary_df = summary_df.sort_values('Test_RMSE')  # Sort by test RMSE

print("\n📊 Complete Performance Summary:")
display(summary_df.round(6))

# Save summary
summary_df.to_csv(results_dir / 'model_evaluation_summary.csv', index=False)
print(f"💾 Summary saved to: {results_dir / 'model_evaluation_summary.csv'}")

# Best model selection
good_models = summary_df[summary_df['Status'] == 'GOOD']
if len(good_models) > 0:
    best_model_name = good_models.iloc[0]['Model']
    best_test_rmse = good_models.iloc[0]['Test_RMSE']
    best_test_r2 = good_models.iloc[0]['Test_R2']
    
    print(f"\n🏆 RECOMMENDED MODEL: {best_model_name.upper()}")
    print(f"   ✅ Shows good generalization")
    print(f"   📊 Test RMSE: {best_test_rmse:.6f}")
    print(f"   📈 Test R²: {best_test_r2:.6f}")
    print(f"   🎯 Ready for deployment")
else:
    print(f"\n⚠️ No models show ideal generalization")
    best_compromise = summary_df.iloc[0]
    print(f"   🔄 Best compromise: {best_compromise['Model']}")
    print(f"   📊 Test RMSE: {best_compromise['Test_RMSE']:.6f}")
    print(f"   ⚠️ Monitor performance carefully")

# Model distribution
status_counts = summary_df['Status'].value_counts()
print(f"\n📊 Model Performance Distribution:")
for status, count in status_counts.items():
    emoji = '🟢' if status == 'GOOD' else ('🟡' if status == 'MODERATE' else '🔴')
    print(f"   {emoji} {status}: {count} models")


In [None]:
# Final summary and recommendations
print("\n" + "=" * 80)
print("🎉 PYCARET AUTOML ANALYSIS COMPLETE")
print("=" * 80)

# Select best model for final summary
if len(good_models) > 0:
    final_model_name = good_models.iloc[0]['Model']
    final_rmse = good_models.iloc[0]['Test_RMSE']
    final_r2 = good_models.iloc[0]['Test_R2']
else:
    final_model_name = summary_df.iloc[0]['Model']
    final_rmse = summary_df.iloc[0]['Test_RMSE']
    final_r2 = summary_df.iloc[0]['Test_R2']

print(f"\n📊 ANALYSIS SUMMARY:")
print(f"   🔢 Models tested: {len(models_list)}")
print(f"   🏆 Best model: {final_model_name}")
print(f"   📏 Best test RMSE: {final_rmse:.6f}")
print(f"   📈 Best test R²: {final_r2:.6f}")

print(f"\n📁 FILES GENERATED:")
print(f"   📊 cross_validation_results.csv - CV performance of all models")
print(f"   📋 model_evaluation_summary.csv - Detailed test set evaluation")

print(f"\n💡 KEY INSIGHTS:")
print(f"   ✅ Cross-validation was performed only on training data")
print(f"   🔒 Test set remained completely unseen during model selection")
print(f"   📊 Overfitting analysis revealed model generalization ability")
print(f"   🎯 Results show true generalization performance")

print(f"\n🚀 NEXT STEPS:")
print(f"   1. Review the model evaluation summary")
print(f"   2. Select the recommended model for deployment")
print(f"   3. Consider hyperparameter tuning of the best model")
print(f"   4. Monitor performance on new data")

print("\n" + "=" * 80)
print("✨ Clean analysis completed successfully! ✨")
print("=" * 80)
