# Ensemble with Node Similarity Forecasting

This notebook demonstrates a forecasting approach called **“Ensemble with Node Similarity”**, using a pre-trained foundation time-series model (*TimesFM*) and multivariate sensor/node data. The goal is to improve forecast accuracy of a target node by blending its own forecast with those of *similar neighbor nodes*.

---

## 🧰 Setup

- **Libraries**:  
  `pandas`, `numpy`, `matplotlib` for data handling/plots;  
  `timesfm` for the foundation model;  
  `scikit-learn` for error metrics and optional regression if needed.

- **Data**:  
  CSVs named e.g. `temperature_data_{samp}min_{num_nodes}.csv`, where:  
  • `samp` = sampling interval in minutes (e.g. 5, 15, 30, 45, 60),  
  • `num_nodes` = number of sensor nodes/time series (e.g. 8, 16, 25).  
  Contains datetime (`ds`) and temperature readings per node column.

- **Train/Test Splits**:  
  Train = from `2018-11-01` to `2018-11-06`.  
  Test = from `2018-11-08` to `2018-11-10`.  
  Forecast horizon (`p_steps`) = 4 hours worth of points, i.e., `4 * 60 / sampling_min`.

---

## 🔍 Method: `forecast_with_ensemble(...)`

### Goal

For each target node:

1. Compute its own forecast using TimesFM.  
2. Compute forecasts for a few other “neighbor” nodes that are *most similar* (by correlation) to the target.  
3. Blend the target’s forecast (with fixed weight) with scaled neighbor forecasts, using weights derived from similarity.

### How it works (step-by-step)

| Step | Operation |
|---|---|
| **1. Correlation ranking** | Use Pearson correlation (lag-0) among all nodes vs target. Take absolute value to rank highest similarity, regardless of sign. |
| **2. Neighbour selection** | Pick up to `n_similar` nodes (excluding the target) as neighbors based on their ranking. If there are fewer available nodes than `n_similar`, use as many as exist. |
| **3. Weight construction** | Normalize the correlations of selected neighbors so their weights sum to 1. These are *similarity weights*. |
| **4. Base forecast** | Forecast the target node alone using TimesFM over `p_steps`. |
| **5. Neighbor forecasts + level adjustment** | For each selected neighbor: forecast with TimesFM, then rescale (via mean ratio) to align neighbor’s scale (level) to target’s mean. |
| **6. Blending** | Combine forecasts as:  
> `ensemble_forecast = 0.6 * target_forecast + 0.4 * (weighted sum of adjusted neighbor forecasts)` |
| **7. Fallback** | If no neighbors exist (or errors happen), just return the target-only forecast. |

---

## 📊 Evaluation Metrics

- **MAE** (Mean Absolute Error)  
- **RMSE** (Root Mean Squared Error)  
- **MAPE** (Mean Absolute Percentage Error)  

These are computed over all nodes at each configuration `(num_nodes, sampling_interval)`.

---

## 🧮 Code Walkthrough

- `init_timesfm(p_steps)`: initializes the TimesFM model with context/Horizon lengths and preset hyperparameters.  
- `forecast_with_ensemble(...)`: implements the ensembling logic as described above.  

---

In [1]:
import timesfm
!pip install jax
!unzip 'Dataset_perSampling_pernodeConfig.zip'

 See https://github.com/google-research/timesfm/blob/master/README.md for updated APIs.


  from .autonotebook import tqdm as notebook_tqdm


Loaded PyTorch TimesFM, likely because python version is 3.10.18 (main, Jun  5 2025, 08:37:47) [Clang 14.0.6 ].


In [2]:
import os
from pathlib import Path

# --- NEW: where to save
OUT_DIR = Path("./outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# --- NEW: tidy row builder
def build_forecast_rows(test_index, approach, nodes, sampling, target_col, y_true_1d, y_pred_1d):
    """
    Returns a list of dict rows for one target series (length = p_steps).
    """
    rows = []
    for step, (ts, yt, yp) in enumerate(zip(test_index, y_true_1d, y_pred_1d), start=1):
        rows.append({
            "ds": ts,                         # timestamp for that forecast step
            "approach": approach,             # e.g., "Baseline (No Covariates)"
            "nodes": nodes,                   # e.g., 8, 16, 25
            "sampling": sampling,             # minutes (5/15/30/45/60)
            "target": target_col,             # node/column name
            "step": step,                     # 1..p_steps
            "y_true": float(yt),
            "y_pred": float(yp),
        })
    return rows

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import timesfm
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor

# Setup TimesFM pretrained model
def init_timesfm(p_steps):
    return timesfm.TimesFm(
        timesfm.TimesFmHparams(
            backend="pt",
            context_len=32,
            horizon_len=p_steps,
            input_patch_len=32,
            output_patch_len=128,
            use_positional_embedding=False
        ),
        checkpoint=timesfm.TimesFmCheckpoint(
            huggingface_repo_id="google/timesfm-1.0-200m-pytorch"
        )
    )

# APPROACH: Ensemble with Node Similarity (Fixed)
def forecast_with_ensemble(train, test, target_col, p_steps, tfm, n_similar=3):
    """
    1) Pearson correlation to find similar nodes
    2) Forecast target + neighbors
    3) Blend (60/40) with correlation weights
    """
    try:
        # --- explicit Pearson ---
        correlations = (
            train.select_dtypes(include=[np.number])        # numeric cols only
                 .corr(method="pearson", min_periods=1)[target_col]  # Pearson is explicit
                 .abs()
                 .sort_values(ascending=False)
        )

        available_nodes = len(correlations) - 1
        n_use = min(n_similar, available_nodes)

        if n_use == 0:
            target_hist = train[target_col].values.astype(float)
            forecast, _ = tfm.forecast([target_hist], freq=[0])
            return forecast[0][:p_steps]

        similar_nodes = correlations.index[1:n_use+1].tolist()

        # weights from Pearson r (normalized)
        weights = correlations[similar_nodes].values
        weights = weights / weights.sum()

        # target forecast
        target_hist = train[target_col].values.astype(float)
        target_forecast, _ = tfm.forecast([target_hist], freq=[0])
        target_forecast = target_forecast[0][:p_steps]

        ensemble_forecast = target_forecast * 0.6

        # neighbor forecasts (level-adjusted by mean ratio)
        for node, w in zip(similar_nodes, weights):
            node_hist = train[node].values.astype(float)
            node_forecast, _ = tfm.forecast([node_hist], freq=[0])

            node_mean = train[node].mean()
            tgt_mean  = train[target_col].mean()
            hist_ratio = (tgt_mean / node_mean) if node_mean != 0 else 1.0

            adjusted = node_forecast[0][:p_steps] * hist_ratio
            ensemble_forecast += adjusted * w * 0.4

        return ensemble_forecast

    except Exception as e:
        print(f"Ensemble failed for {target_col}: {e}")
        target_hist = train[target_col].values.astype(float)
        forecast, _ = tfm.forecast([target_hist], freq=[0])
        return forecast[0][:p_steps]

print("Starting multi-node forecasting comparison...\n")

results_comparison = []
all_rows = []  # NEW: collect tidy rows for all forecasts

for approach_name, approach_func in [
    ("Baseline (No Covariates)", None),
    ("Ensemble Similarity", forecast_with_ensemble),
]:
    print(f"\n🔄 Testing Approach: {approach_name}")
    approach_results = []

    for num_nodes in [8, 16, 25]:
        for samp in [5, 15, 30, 45, 60]:
            try:
                df = pd.read_csv(f"Dataset_perSampling_pernodeConfig/temperature_data_{samp}min_{num_nodes}.csv")
                df['ds'] = pd.to_datetime(df['ds'])
                df = df.set_index('ds').sort_index()
                train = df.loc["2018-11-01":"2018-11-06"]
                test = df.loc["2018-11-08":"2018-11-10"]

                p_steps = 4 * 60 // samp
                tfm = init_timesfm(p_steps)

                y_true_mat = []
                y_pred_mat = []

                # We'll also collect tidy rows per (approach, nodes, sampling) to a file
                per_config_rows = []  # NEW

                for target_col in train.columns:
                    actual = test[target_col].values[:p_steps]

                    if approach_func is None:
                        # Baseline: standard TimesFM without covariates
                        hist = train[target_col].values.astype(float)
                        pred, _ = tfm.forecast([hist], freq=[0])
                        pred = pred[0][:p_steps]
                    else:
                        # Use the specified approach
                        pred = approach_func(train, test, target_col, p_steps, tfm)

                    y_pred_mat.append(pred)
                    y_true_mat.append(actual)

                    # --- NEW: append tidy rows for this target
                    per_config_rows.extend(
                        build_forecast_rows(
                            test_index=test.index[:p_steps],
                            approach=approach_name,
                            nodes=num_nodes,
                            sampling=samp,
                            target_col=target_col,
                            y_true_1d=actual,
                            y_pred_1d=pred
                        )
                    )

                y_true = np.array(y_true_mat).T  # shape [p_steps, n_nodes]
                y_pred = np.array(y_pred_mat).T

                mae = mean_absolute_error(y_true, y_pred)
                rmse = np.sqrt(mean_squared_error(y_true, y_pred))
                mape = mean_absolute_percentage_error(y_true, y_pred)

                approach_results.append({
                    'approach': approach_name,
                    'nodes': num_nodes,
                    'sampling': samp,
                    'MAE': mae,
                    'RMSE': rmse,
                    'MAPE': mape
                })

                print(f"  nodes={num_nodes}, samp={samp} → MAE={mae:.5f}, RMSE={rmse:.5f}, MAPE={mape:.3f}%")

                # --- NEW: save per-config tidy CSV
                per_config_df = pd.DataFrame(per_config_rows)
                per_config_path = OUT_DIR / f"ts_{approach_name.replace(' ', '_')}_nodes{num_nodes}_samp{samp}.csv"
                per_config_df.to_csv(per_config_path, index=False)

                # --- also keep for global concatenation
                all_rows.extend(per_config_rows)

            except Exception as e:
                print(f"  Error for nodes={num_nodes}, samp={samp}: {e}")
                continue

    results_comparison.extend(approach_results)

# Create comparison dataframe
df_comparison = pd.DataFrame(results_comparison)

# Pivot for easy comparison
pivot_mae = df_comparison.pivot_table(
    values='MAE',
    index=['nodes', 'sampling'],
    columns='approach'
)

pivot_rmse = df_comparison.pivot_table(
    values='RMSE',
    index=['nodes', 'sampling'],
    columns='approach'
)

print("\n" + "="*80)
print("📊 MAE Comparison Across Approaches:")
print("="*80)
print(pivot_mae.round(5))

print("\n" + "="*80)
print("📊 RMSE Comparison Across Approaches:")
print("="*80)
print(pivot_rmse.round(5))

# Calculate improvements
print("\n" + "="*80)
print("📈 Performance Improvements vs Baseline:")
print("="*80)

baseline_col = "Baseline (No Covariates)"
if baseline_col in pivot_mae.columns:
    for col in pivot_mae.columns:
        if col != baseline_col:
            mae_improvement = ((pivot_mae[baseline_col] - pivot_mae[col]) / pivot_mae[baseline_col] * 100).mean()
            rmse_improvement = ((pivot_rmse[baseline_col] - pivot_rmse[col]) / pivot_rmse[baseline_col] * 100).mean()

            print(f"\n{col}:")
            print(f"  Average MAE Improvement:  {mae_improvement:+.2f}%")
            print(f"  Average RMSE Improvement: {rmse_improvement:+.2f}%")

# Best configuration per approach
print("\n" + "="*80)
print("🏆 Best Configuration for Each Approach:")
print("="*80)

for approach in df_comparison['approach'].unique():
    approach_data = df_comparison[df_comparison['approach'] == approach]
    best_config = approach_data.loc[approach_data['MAE'].idxmin()]
    print(f"\n{approach}:")
    print(f"  Best config: {best_config['nodes']} nodes, {best_config['sampling']}min sampling")
    print(f"  MAE: {best_config['MAE']:.5f}, RMSE: {best_config['RMSE']:.5f}")

# --- NEW: save global tidy CSV + metrics CSV and print heads for notebook
all_forecasts_df = pd.DataFrame(all_rows)
all_forecasts_csv = OUT_DIR / "all_forecasts_long.csv"
all_forecasts_df.to_csv(all_forecasts_csv, index=False)

metrics_csv = OUT_DIR / "TimesFM_NoCovariates_Ensemble_Similarity.csv"
df_comparison.to_csv(metrics_csv, index=False)

print("\n✅ Saved:")
print(f" - Tidy per-config CSVs in: {OUT_DIR.resolve()}")
print(f" - Global forecasts: {all_forecasts_csv.resolve()}")
print(f" - Summary metrics:  {metrics_csv.resolve()}")

print("\n🔎 Sample rows (forecasts long):")
print(all_forecasts_df.head(12))

print("\n🔎 Sample rows (summary metrics):")
print(df_comparison.head(12))

Starting multi-node forecasting comparison...


🔄 Testing Approach: Baseline (No Covariates)


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 23652.09it/s]


  nodes=8, samp=5 → MAE=6.38853, RMSE=8.66285, MAPE=0.994%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 90524.55it/s]


  nodes=8, samp=15 → MAE=9.34823, RMSE=11.27606, MAPE=1.192%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 87992.39it/s]


  nodes=8, samp=30 → MAE=22.93064, RMSE=24.41010, MAPE=2.966%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 76725.07it/s]


  nodes=8, samp=45 → MAE=8.22409, RMSE=10.20686, MAPE=0.975%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 93902.33it/s]


  nodes=8, samp=60 → MAE=22.38165, RMSE=25.82326, MAPE=2.973%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 87992.39it/s]


  nodes=16, samp=5 → MAE=6.85909, RMSE=9.27070, MAPE=1.008%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 88612.06it/s]


  nodes=16, samp=15 → MAE=11.03376, RMSE=13.27248, MAPE=1.483%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 56679.78it/s]


  nodes=16, samp=30 → MAE=22.44459, RMSE=23.77225, MAPE=2.845%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 93206.76it/s]


  nodes=16, samp=45 → MAE=7.78839, RMSE=9.47345, MAPE=0.921%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 85598.04it/s]


  nodes=16, samp=60 → MAE=21.46845, RMSE=25.04462, MAPE=2.822%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 77195.78it/s]


  nodes=25, samp=5 → MAE=6.95748, RMSE=9.44145, MAPE=1.025%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 87992.39it/s]


  nodes=25, samp=15 → MAE=9.81552, RMSE=12.11797, MAPE=1.272%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 81180.08it/s]


  nodes=25, samp=30 → MAE=21.03804, RMSE=22.96143, MAPE=2.652%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 90524.55it/s]


  nodes=25, samp=45 → MAE=7.30123, RMSE=8.99792, MAPE=0.860%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 93902.33it/s]


  nodes=25, samp=60 → MAE=21.58727, RMSE=25.14624, MAPE=2.787%

🔄 Testing Approach: Ensemble Similarity


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 83886.08it/s]


  nodes=8, samp=5 → MAE=3.75346, RMSE=5.05354, MAPE=0.584%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 83330.54it/s]


  nodes=8, samp=15 → MAE=6.77623, RMSE=8.59362, MAPE=0.843%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 85598.04it/s]


  nodes=8, samp=30 → MAE=17.76173, RMSE=20.02225, MAPE=2.240%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 88612.06it/s]


  nodes=8, samp=45 → MAE=5.99140, RMSE=7.06580, MAPE=0.711%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 78643.20it/s]


  nodes=8, samp=60 → MAE=22.05090, RMSE=25.51288, MAPE=2.935%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 85598.04it/s]


  nodes=16, samp=5 → MAE=4.30093, RMSE=5.73457, MAPE=0.624%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 91180.52it/s]


  nodes=16, samp=15 → MAE=9.69369, RMSE=11.56990, MAPE=1.258%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 90524.55it/s]


  nodes=16, samp=30 → MAE=20.00945, RMSE=21.62207, MAPE=2.492%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 85598.04it/s]


  nodes=16, samp=45 → MAE=6.66894, RMSE=7.60345, MAPE=0.785%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 93902.33it/s]


  nodes=16, samp=60 → MAE=21.11083, RMSE=24.59183, MAPE=2.777%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 83886.08it/s]


  nodes=25, samp=5 → MAE=5.14311, RMSE=6.50636, MAPE=0.748%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 73156.47it/s]


  nodes=25, samp=15 → MAE=8.81374, RMSE=10.66176, MAPE=1.126%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 79137.81it/s]


  nodes=25, samp=30 → MAE=19.13417, RMSE=20.97663, MAPE=2.369%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 41120.63it/s]


  nodes=25, samp=45 → MAE=6.45546, RMSE=7.51748, MAPE=0.759%


Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 73156.47it/s]


  nodes=25, samp=60 → MAE=21.35399, RMSE=24.80217, MAPE=2.763%

📊 MAE Comparison Across Approaches:
approach        Baseline (No Covariates)  Ensemble Similarity
nodes sampling                                               
8     5                          6.38853              3.75346
      15                         9.34823              6.77623
      30                        22.93064             17.76173
      45                         8.22409              5.99140
      60                        22.38165             22.05090
16    5                          6.85909              4.30093
      15                        11.03376              9.69369
      30                        22.44459             20.00945
      45                         7.78839              6.66894
      60                        21.46845             21.11083
25    5                          6.95748              5.14311
      15                         9.81552              8.81374
      30                        