<a href="https://colab.research.google.com/github/amkayhani/FAIDM/blob/main/Random_Forest_Battery_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression: Random Forests for Battery Thickness Prediction**

## **Module Context**

This notebook is part of the Regression: Random Forests teaching module delivered at WMG, University of Warwick (January 2026).

## **Overview**

The aim of this notebook is to develop a Random Forest regression model to predict the Equivalent Full Cycles (EFCs) at a specified State of Health threshold (SOH = 0.9) for lithium-ion battery cells.

The model leverages charge–discharge profile characteristics extracted from early-life cycling data to estimate the cycle life of cells operating under diverse conditions. These include both laboratory-controlled test protocols and real-world electric vehicle (EV) usage profiles, highlighting the robustness of ensemble regression methods in practical energy-system applications.

## **Learning Objectives**

By the end of this notebook, students will be able to:
- understand the use of Random Forests for regression in battery health prediction,
- train and configure a Random Forest regressor,
- evaluate regression performance using appropriate metrics, and
- interpret model behaviour across different operating conditions.

## **Notebook Scope**

This notebook presents a complete and reproducible regression pipeline, including:
- feature extraction from charge–discharge profiles,
- definition of input variables and target EFC values,
- Random Forest model training and hyperparameter selection,
- performance evaluation and result interpretation.

## **Module Delivery**

Dr **Mona Faraji Niri** — Associate Professor of Energy Systems

Dr **Hamidreza Farhadi Tolie** — Research Fellow

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from scipy.stats import skew, kurtosis
import os
import matplotlib.pyplot as plt

## 1. Data loading

In [None]:
metadata_url = "https://drive.google.com/uc?id=10plwp_22mc2E5MnKhUSi1ZJBviWb-7E6"
features_url = "https://drive.google.com/uc?id=1_s0oQFlBhiSHRWs60iVX2RH5TulM533t"
target_url   = "https://drive.google.com/uc?id=15q2jV1B6no7a16tafTYt779z5GPM98FU"

# Load CSVs into DataFrames
df_features = pd.read_csv(features_url, index_col=0)
df_soh90    = pd.read_csv(target_url, index_col=0)

# Preview
df_features.head()

## 2. Train–Test Split

In [None]:
X = df_features.values
y = np.array(df_soh90['EFCs (with Diagnostic)'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 3. Random Forest Model
**Hyper parameters of the model:**
- **n_estimators (default=100)** — Number of trees. More trees improve stability but increase training time. Example: 100, 500, 1000.
- **max_depth (default=None)** — Maximum depth of trees. Use None for fully grown trees, or smaller values to reduce overfitting. Example: None, 5, 10.
- **min_samples_split (default=2)** — Minimum samples to split a node. Larger values → more conservative trees. Example: 2, 5, 10.
- **min_samples_leaf (default=1)** — Minimum samples at a leaf. Larger values smooth predictions. Example: 1, 2, 4.
- **max_features (default="auto")** — Features considered at each split. Options: "sqrt", "log2", or integer number of features.
- **random_state** — Ensures reproducible results.
- **n_jobs** — CPU cores to use. -1 = all cores.

## Hyperparameter Optimisation with Random Forests and Validation Metrics

In this step, we perform hyperparameter optimisation for the Random Forest regressor using **Optuna**, while evaluating model performance on a validation set (X_test, y_test).

The goal is to find the combination of hyperparameters that maximises predictive performance (here measured by R²) for battery lifetime prediction, while also monitoring MAE and RMSE on the validation set.

**How it works:**

1. Optuna iteratively suggests hyperparameter combinations.
2. For each trial, the Random Forest is trained on X_train, y_train.
3. Predictions are made on X_test, and MAE, RMSE, and R² are calculated.
4. The progress bar shows optimisation progress, and trial results are printed in real-time.
5. At the end, the best parameter set is selected based on highest R², and final metrics are reported.

In [None]:
!pip install optuna

In [None]:
import optuna
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np

# Objective function for Optuna
def objective(trial):
    rf = RandomForestRegressor(
        n_estimators=trial.suggest_int('n_estimators', 100, 500),
        max_depth=trial.suggest_int('max_depth', 3, 10),
        min_samples_split=trial.suggest_int('min_samples_split', 2, 10),
        min_samples_leaf=trial.suggest_int('min_samples_leaf', 1, 4),
        max_features=trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
        random_state=42,
        n_jobs=-1
    )

    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"Trial {trial.number:03d} | R²: {r2:.3f} | MAE: {mae:.3f} | RMSE: {rmse:.3f} | Params: {trial.params}")
    return r2

study = optuna.create_study(direction='maximize', study_name="RF_Validation_Optimisation")
study.optimize(objective, n_trials=30, show_progress_bar=True)

best_rf = RandomForestRegressor(**study.best_params, random_state=42, n_jobs=-1)
best_rf.fit(X_train, y_train)
y_pred_best = best_rf.predict(X_test)

best_mae = mean_absolute_error(y_test, y_pred_best)
best_rmse = np.sqrt(mean_squared_error(y_test, y_pred_best))
best_r2 = r2_score(y_test, y_pred_best)

print("\n✅ Best Parameters Found:")
print(study.best_params)
print(f"MAE: {best_mae:.3f}, RMSE: {best_rmse:.3f}, R²: {best_r2:.3f}")


In [None]:
rf = RandomForestRegressor(
    n_estimators=354,
    max_depth=9,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

## 4. Prediction and Evaluation

In [None]:
y_pred = rf.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

mae, rmse, r2

## 5. Visualisations

In [None]:
plt.figure()
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()])
plt.xlabel("Computed EFC")
plt.ylabel("Predicted EFC")
plt.title("Predictions vs Reality")
plt.show()