# PyCaret AutoML Analysis for Battery Capacity Prediction

This notebook uses PyCaret's regression module to automatically test and compare multiple machine learning models on the battery capacity dataset.

## ⚠️ Important: SciPy Compatibility Fix

If you get an error like: `ImportError: cannot import name 'interp' from 'scipy'`

**Quick Fix:** Run this in a terminal:
```bash
pip install scipy==1.11.4
pip install pycaret==3.3.2
```

Or run the automated fix script:
```bash
python fix_pycaret_install.py
```

## Models that will be tested:
- **Linear Models**: Linear Regression, Lasso, Ridge, Elastic Net, Huber, Bayesian Ridge
- **Tree-based**: Random Forest, Extra Trees, Decision Tree, Gradient Boosting, AdaBoost
- **Advanced Boosting**: LightGBM, XGBoost, CatBoost
- **Other**: SVM, K-Nearest Neighbors, Multi-Layer Perceptron (Neural Network)


In [7]:
# Fix SciPy compatibility issue if needed
import subprocess
import sys

def fix_scipy_compatibility():
    """Fix the scipy compatibility issue with PyCaret"""
    try:
        # Try importing PyCaret first
        from pycaret.regression import setup
        print("✅ PyCaret is working correctly!")
        return True
    except ImportError as e:
        if "interp" in str(e).lower():
            print("🔧 Fixing SciPy compatibility issue...")
            print("Installing compatible SciPy version...")
            subprocess.run([sys.executable, "-m", "pip", "install", "scipy==1.11.4"], check=True)
            subprocess.run([sys.executable, "-m", "pip", "install", "pycaret==3.3.2"], check=True)
            print("✅ Fix applied! Please restart the kernel and try again.")
            return False
        else:
            print(f"❌ Different import error: {e}")
            return False

# Run the compatibility check
if not fix_scipy_compatibility():
    print("🔄 Please restart the kernel and run this cell again.")


✅ PyCaret is working correctly!


In [8]:
# Import required packages
import os
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# PyCaret imports (should work now after fixing scipy)
from pycaret.regression import *

# Custom modules
from src.data_loader import DataLoader

print("✅ All imports successful!")


✅ All imports successful!


# PyCaret AutoML Analysis for Battery Capacity Prediction

This notebook uses PyCaret's regression module to automatically test and compare multiple machine learning models on the battery capacity dataset.

## Models that will be tested:
- **Linear Models**: Linear Regression, Lasso, Ridge, Elastic Net, Huber, Bayesian Ridge
- **Tree-based**: Random Forest, Extra Trees, Decision Tree, Gradient Boosting, AdaBoost
- **Advanced Boosting**: LightGBM, XGBoost, CatBoost
- **Other**: SVM, K-Nearest Neighbors, Multi-Layer Perceptron (Neural Network)


In [1]:
# Install PyCaret if not already installed
!pip install pycaret==3.3.2


Collecting pycaret==3.3.2
  Obtaining dependency information for pycaret==3.3.2 from https://files.pythonhosted.org/packages/3e/6f/b3d59fac3869a7685e68aecdd35c336800bce8c8d3b45687bb82cf9a2848/pycaret-3.3.2-py3-none-any.whl.metadata
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting ipywidgets>=7.6.5 (from pycaret==3.3.2)
  Obtaining dependency information for ipywidgets>=7.6.5 from https://files.pythonhosted.org/packages/58/6a/9166369a2f092bd286d24e6307de555d63616e8ddb373ebad2b5635ca4cd/ipywidgets-8.1.7-py3-none-any.whl.metadata
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting tqdm>=4.62.0 (from pycaret==3.3.2)
  Obtaining dependency information for tqdm>=4.62.0 from https://files.pythonhosted.org/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl.metadata
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret==3.3.2)
  O

In [9]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# PyCaret imports
from pycaret.regression import *

# Custom modules
from src.data_loader import DataLoader

print("✅ All imports successful!")


✅ All imports successful!


In [10]:
# Initialize paths and load data
current_dir = Path.cwd()
data_path = current_dir / "dataset_299.xlsx"
results_dir = current_dir / "results"
results_dir.mkdir(exist_ok=True)

print(f"📁 Data path: {data_path}")
print(f"📁 Results directory: {results_dir}")
print(f"✅ Data file exists: {data_path.exists()}")


📁 Data path: /Users/amirbabamahmoudi/Documents/Battery-Capacity/dataset_299.xlsx
📁 Results directory: /Users/amirbabamahmoudi/Documents/Battery-Capacity/results
✅ Data file exists: True


In [11]:
# Load and prepare data
print("Loading data...")
data_loader = DataLoader(data_path)
df = data_loader.load_data()
X, y = data_loader.split_features_target(df)

# Create a combined dataset for PyCaret (it expects target column in the dataframe)
data = X.copy()
data['target'] = y

print(f"📊 Dataset shape: {data.shape}")
print(f"🎯 Target variable: 'target' (Average Capacity)")
print(f"🔢 Number of features: {X.shape[1]}")
print(f"📈 Target statistics:")
print(y.describe())


Loading data...
Successfully loaded data with shape: (299, 63)
Index(['Cell ID', 'Average Capacity'], dtype='object')
📊 Dataset shape: (299, 62)
🎯 Target variable: 'target' (Average Capacity)
🔢 Number of features: 61
📈 Target statistics:
count     299.000000
mean     7758.993925
std       624.947969
min      4106.476500
25%      7617.600125
50%      7879.671500
75%      8123.505250
max      8579.065042
Name: Average Capacity, dtype: float64


In [14]:
# Setup PyCaret environment
print("Setting up PyCaret environment...")

reg = setup(
    data=data,
    target='target',
    session_id=123,
    train_size=0.8,
    fold=5,  # 5-fold cross-validation
    verbose=False,
    use_gpu=False  # Set to True if you have GPU support
)

print("✅ PyCaret environment successfully set up!")
print(f"📊 Training set size: {int(0.8 * len(data))} samples")
print(f"📊 Test set size: {int(0.2 * len(data))} samples")
print("🔄 Cross-validation: 5-fold")


Setting up PyCaret environment...
✅ PyCaret environment successfully set up!
📊 Training set size: 239 samples
📊 Test set size: 59 samples
🔄 Cross-validation: 5-fold


In [16]:
print("🤖 Training and comparing multiple regression models...")
print("This may take a few minutes...")

# First, compare all models without n_select to get the DataFrame
model_comparison = compare_models(
    include=[
        'lr',      # Linear Regression
        'lasso',   # Lasso Regression
        'ridge',   # Ridge Regression
        'en',      # Elastic Net
        'huber',   # Huber Regressor
        'rf',      # Random Forest
        'et',      # Extra Trees
        'gbr',     # Gradient Boosting
        'lightgbm', # LightGBM
        'xgboost', # XGBoost
        'catboost', # CatBoost
        'knn',     # K-Nearest Neighbors
        'mlp',     # Multi-Layer Perceptron
        'svm',     # Support Vector Machine
        'dt',      # Decision Tree
        'ada',     # AdaBoost
        'br'       # Bayesian Ridge
    ],
    sort='RMSE',  # Sort by RMSE (lower is better)
    verbose=False
)

print("✅ Model comparison completed!")
print("\n📊 ALL MODELS COMPARISON RESULTS:")
print("="*80)

# Get the comparison results DataFrame
comparison_results = pull()
display(comparison_results.round(4))

# The best model is already returned by compare_models()
print(f"\n🏆 Best model type: {type(model_comparison).__name__}")

🤖 Training and comparing multiple regression models...
This may take a few minutes...
✅ Model comparison completed!

📊 ALL MODELS COMPARISON RESULTS:


AttributeError: 'Index' object has no attribute '_format_flat'

                                    Model        MAE           MSE       RMSE  \
catboost               CatBoost Regressor   241.2144  1.026951e+05   318.1994   
ada                    AdaBoost Regressor   250.5960  1.098178e+05   330.1690   
rf                Random Forest Regressor   251.3261  1.135517e+05   334.6180   
gbr           Gradient Boosting Regressor   256.6278  1.143671e+05   335.9524   
et                  Extra Trees Regressor   256.2150  1.197696e+05   344.2187   
xgboost         Extreme Gradient Boosting   271.5624  1.378923e+05   367.5547   
dt                Decision Tree Regressor   283.0405  1.521431e+05   386.5226   
lightgbm  Light Gradient Boosting Machine   301.9886  1.878845e+05   422.9651   
lr                      Linear Regression   354.5392  2.693535e+05   508.4264   
knn                 K Neighbors Regressor   364.7195  3.162262e+05   555.0810   
huber                     Huber Regressor   367.7854  3.997024e+05   607.2249   
en                          


🏆 Best model type: CatBoostRegressor


In [17]:
# ========================================
# OVERFITTING ANALYSIS FOR PYCARET MODELS
# ========================================

print("\n" + "="*80)
print("🔍 COMPREHENSIVE OVERFITTING ANALYSIS")
print("="*80)

# Get the comparison results (cross-validation scores)
model_comparison = pull()
print("📊 Cross-validation results:")
display(model_comparison.round(4))

# Get top 5 models for detailed analysis
top_models = model_comparison.head(5).index.tolist()
print(f"\n🎯 Analyzing top {len(top_models)} models for overfitting...")

# Store detailed results
detailed_results = []

for i, model_name in enumerate(top_models, 1):
    print(f"\n{'='*60}")
    print(f"📊 MODEL {i}: {model_name.upper()}")
    print(f"{'='*60}")
    
    # Create the model
    model = create_model(model_name, verbose=False)
    
    # Get training and test data from PyCaret setup
    X_train = get_config('X_train')
    y_train = get_config('y_train')
    X_test = get_config('X_test') 
    y_test = get_config('y_test')
    
    # Make predictions on both sets
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    # Calculate metrics
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    import numpy as np
    
    # Training metrics
    train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
    train_r2 = r2_score(y_train, train_pred)
    
    # Test metrics
    test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
    test_r2 = r2_score(y_test, test_pred)
    
    # CV metrics (from comparison)
    cv_rmse = model_comparison.loc[model_name, 'RMSE']
    cv_r2 = model_comparison.loc[model_name, 'R2']
    
    # Create comparison
    performance_df = pd.DataFrame({
        'Dataset': ['Cross-Validation', 'Training Set', 'Test Set'],
        'RMSE': [cv_rmse, train_rmse, test_rmse],
        'R²': [cv_r2, train_r2, test_r2]
    })
    
    print("📊 Performance Comparison:")
    display(performance_df.round(6))
    
    # Overfitting assessment
    r2_drop = train_r2 - test_r2
    rmse_increase_pct = ((test_rmse - train_rmse) / train_rmse) * 100
    
    print(f"\n🔍 Overfitting Indicators:")
    print(f"   R² drop (train→test): {r2_drop:.4f}")
    print(f"   RMSE increase: {rmse_increase_pct:.2f}%")
    
    # Assessment
    if r2_drop > 0.15 or rmse_increase_pct > 25:
        assessment = "🔴 HIGH OVERFITTING"
    elif r2_drop > 0.1 or rmse_increase_pct > 15:
        assessment = "🟡 MODERATE OVERFITTING"
    else:
        assessment = "🟢 GOOD GENERALIZATION"
    
    print(f"   Overall: {assessment}")
    
    # Store results
    detailed_results.append({
        'Model': model_name,
        'CV_RMSE': cv_rmse,
        'Train_RMSE': train_rmse,
        'Test_RMSE': test_rmse,
        'CV_R2': cv_r2,
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'R2_Drop': r2_drop,
        'RMSE_Increase_%': rmse_increase_pct,
        'Assessment': assessment.split()[-1]
    })

# Summary table
print(f"\n{'='*80}")
print("📋 OVERFITTING SUMMARY")
print(f"{'='*80}")

summary_df = pd.DataFrame(detailed_results)
display(summary_df.round(4))

# Recommendations
best_model = summary_df[summary_df['Assessment'] == 'GENERALIZATION']
if len(best_model) > 0:
    recommended = best_model.loc[best_model['Test_RMSE'].idxmin(), 'Model']
    print(f"\n🏆 RECOMMENDED MODEL: {recommended}")
    print("   ✅ Shows good generalization with lowest test error")
else:
    print(f"\n⚠️  All models show some overfitting. Choose the least problematic one.")


🔍 COMPREHENSIVE OVERFITTING ANALYSIS
📊 Cross-validation results:


AttributeError: 'Index' object has no attribute '_format_flat'

                                    Model        MAE           MSE       RMSE  \
catboost               CatBoost Regressor   241.2144  1.026951e+05   318.1994   
ada                    AdaBoost Regressor   250.5960  1.098178e+05   330.1690   
rf                Random Forest Regressor   251.3261  1.135517e+05   334.6180   
gbr           Gradient Boosting Regressor   256.6278  1.143671e+05   335.9524   
et                  Extra Trees Regressor   256.2150  1.197696e+05   344.2187   
xgboost         Extreme Gradient Boosting   271.5624  1.378923e+05   367.5547   
dt                Decision Tree Regressor   283.0405  1.521431e+05   386.5226   
lightgbm  Light Gradient Boosting Machine   301.9886  1.878845e+05   422.9651   
lr                      Linear Regression   354.5392  2.693535e+05   508.4264   
knn                 K Neighbors Regressor   364.7195  3.162262e+05   555.0810   
huber                     Huber Regressor   367.7854  3.997024e+05   607.2249   
en                          


🎯 Analyzing top 5 models for overfitting...

📊 MODEL 1: CATBOOST
📊 Performance Comparison:


AttributeError: 'Index' object has no attribute '_format_flat'

            Dataset        RMSE        R²
0  Cross-Validation  318.199400  0.653400
1      Training Set  278.583539  0.812148
2          Test Set  447.435174  0.319013


🔍 Overfitting Indicators:
   R² drop (train→test): 0.4931
   RMSE increase: 60.61%
   Overall: 🔴 HIGH OVERFITTING

📊 MODEL 2: ADA


ValueError: Input X contains NaN.
AdaBoostRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
# Save comparison results and display top performers
comparison_file = results_dir / "pycaret_model_comparison.csv"
model_comparison.to_csv(comparison_file)
print(f"💾 Results saved to: {comparison_file}")

# Display top 5 models
print("\n🏆 TOP 5 BEST PERFORMING MODELS:")
print("="*50)
top_5 = model_comparison.head(5)
for i, (idx, row) in enumerate(top_5.iterrows(), 1):
    print(f"{i}. {idx}: RMSE={row['RMSE']:.6f}, R²={row['R2']:.6f}, MAE={row['MAE']:.6f}")


In [None]:
# Get the best model and create it
best_model_name = model_comparison.index[0]
print(f"🏆 Best performing model: {best_model_name}")
print(f"📊 Best RMSE: {model_comparison.iloc[0]['RMSE']:.6f}")
print(f"📈 Best R²: {model_comparison.iloc[0]['R2']:.6f}")

# Create the best model
print("\n🔧 Creating and training the best model...")
best_model = create_model(best_model_name, verbose=False)

print(f"✅ Best model created: {type(best_model).__name__}")


In [None]:
# Tune hyperparameters of the best model
print("🔧 Tuning hyperparameters of the best model...")
tuned_model = tune_model(best_model, verbose=False)
print("✅ Hyperparameter tuning completed!")


In [None]:
# Evaluate the tuned model with plots
print("📊 Evaluating the tuned model...")
evaluate_model(tuned_model)


In [None]:
# Finalize the model and make predictions
print("🎯 Finalizing model on full dataset...")
final_model = finalize_model(tuned_model)

# Make predictions on test set
print("🔮 Making predictions on test set...")
predictions = predict_model(final_model)

print("\n📈 PREDICTION RESULTS (First 10 samples):")
print("="*50)
display(predictions[['target', 'prediction_label', 'prediction_residuals']].head(10).round(4))


In [None]:
# Calculate final performance metrics
from sklearn.metrics import mean_absolute_error, r2_score

y_true = predictions['target']
y_pred = predictions['prediction_label']

mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

print("\n📊 FINAL MODEL PERFORMANCE ON TEST SET:")
print("="*50)
print(f"🎯 Mean Absolute Error (MAE): {mae:.6f}")
print(f"📏 Root Mean Square Error (RMSE): {rmse:.6f}")
print(f"📈 R² Score: {r2:.6f}")
print(f"📊 Model Type: {type(final_model).__name__}")

# Performance interpretation
print("\n🔍 PERFORMANCE INTERPRETATION:")
print("="*40)
if r2 > 0.9:
    print("🌟 Excellent performance! The model explains >90% of variance.")
elif r2 > 0.8:
    print("✅ Very good performance! The model explains >80% of variance.")
elif r2 > 0.7:
    print("👍 Good performance! The model explains >70% of variance.")
else:
    print("⚠️  Moderate performance. Consider feature engineering or more data.")

avg_capacity = y_true.mean()
mae_percentage = (mae / avg_capacity) * 100
print(f"📊 Average prediction error: {mae_percentage:.2f}% of average capacity")


In [None]:
# Save the final model and predictions
model_file = results_dir / "pycaret_best_model"
save_model(final_model, str(model_file))
print(f"💾 Best model saved to: {model_file}.pkl")

predictions_file = results_dir / "pycaret_predictions.csv"
predictions.to_csv(predictions_file, index=False)
print(f"💾 Predictions saved to: {predictions_file}")


In [None]:
# Generate visualization plots
print("📊 Generating visualization plots...")

try:
    # Residuals plot
    plot_model(final_model, plot='residuals', display_format='streamlit')
    
    # Prediction error plot  
    plot_model(final_model, plot='error', display_format='streamlit')
    
    # Feature importance (if available)
    if hasattr(final_model, 'feature_importances_') or hasattr(final_model, 'coef_'):
        plot_model(final_model, plot='feature', display_format='streamlit')
        
    print("✅ All plots generated successfully!")
    
except Exception as e:
    print(f"⚠️  Some plots could not be generated: {e}")


In [None]:
# Final Summary
print("\n" + "="*80)
print("🎉 PYCARET ANALYSIS SUMMARY")
print("="*80)
print(f"📊 Total models tested: {len(model_comparison)}")
print(f"🏆 Best model: {type(final_model).__name__}")
print(f"🎯 Final RMSE: {rmse:.6f}")
print(f"📈 Final R² Score: {r2:.6f}")
print(f"🎯 Final MAE: {mae:.6f}")
print(f"📊 Average prediction error: {mae_percentage:.2f}%")
print(f"📁 All results saved in: {results_dir}")

print("\n🔝 TOP 3 MODELS:")
for i in range(min(3, len(model_comparison))):
    model_name = model_comparison.index[i]
    rmse_val = model_comparison.iloc[i]['RMSE']
    r2_val = model_comparison.iloc[i]['R2']
    print(f"  {i+1}. {model_name}: RMSE={rmse_val:.6f}, R²={r2_val:.6f}")

print("="*80)
