# 📘 LSTM for Biogas Prediction (Production Ready)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/benmola/OpenAD-lib/blob/main/notebooks/03_LSTM_Prediction.ipynb)

This notebook demonstrates **LSTM-based biogas prediction** using temporal feature engineering.

**⚠️ This notebook matches `examples/04_lstm_prediction.py` exactly**

---

## 📚 References
- **LSTM for AD**: [Murali et al. (2025) - LAPSE](https://psecommunity.org/LAPSE:2025.0213)

## 🔬 LSTM Background

### Why LSTM for Time-Series?

Biogas production depends on **past substrate loading**, making it a time-series problem:
- **Input at t-1** affects output at **t**
- LSTM's internal memory captures these temporal dependencies

### LSTM Cell Equations

**Forget Gate** (what to forget from memory):
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

**Input Gate** (what new info to store):
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

**Cell State Update**:
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

**Output Gate**:
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t \odot \tanh(C_t)$$

### Key Preprocessing: Time-Lagged Features

We use `series_to_supervised()` to create features like:
- `Maize(t-1)` → predicts `Biogas(t)`
- `Wholecrop(t-1)` → predicts `Biogas(t)`

This captures the **lag** between feeding and biogas production.

## 1️⃣ Setup (Google Colab)

In [None]:
# Install OpenAD-lib with ML dependencies (PyTorch, etc.)
# !pip install git+https://github.com/benmola/OpenAD-lib.git

import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if not IN_COLAB:
    sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

print(f"Running in Colab: {IN_COLAB}")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from openad_lib.models.ml import LSTMModel

print("✅ All imports successful!")

## 2️⃣ Time-Series Preprocessing Function

**`series_to_supervised()`** transforms time-series into supervised learning format:

**Example:**
```
Original:          Transformed:
t   Maize  Biogas  →  Maize(t-1)  Biogas(t)
0   10     100         NaN         100
1   12     120         10          120
2   15     150         12          150
```

This creates the critical **lag relationship** for prediction.

In [None]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    """
    Convert time series to supervised learning format.
    
    Args:
        data: ndarray or DataFrame - Input time series
        n_in: int - Number of lag observations (default=1)
        n_out: int - Number of future observations (default=1)  
        dropnan: bool - Remove rows with NaN values
    
    Returns:
        DataFrame with lagged features
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = [], []
    
    # Input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    
    # Forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    
    # Concatenate and remove NaN
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    if dropnan:
        agg.dropna(inplace=True)
    return agg

print("✅ series_to_supervised() defined")

## 3️⃣ Load Time-Series Data

**Dataset:** `sample_LSTM_timeseries.csv`
- **424 daily samples** from a real biogas plant
- **Features:** Feedstock composition (Maize, Chicken Litter, etc.) in tonnes/day
- **Target:** Total biogas production (m³/day)

In [None]:
# Download data for Colab
if IN_COLAB:
    !wget -q https://raw.githubusercontent.com/benmola/OpenAD-lib/main/src/openad_lib/data/sample_LSTM_timeseries.csv
    data_path = 'sample_LSTM_timeseries.csv'
else:
    base_path = os.path.dirname(os.getcwd())
    data_path = os.path.join(base_path, 'src', 'openad_lib', 'data', 'sample_LSTM_timeseries.csv')

# Load and inspect
data = pd.read_csv(data_path).dropna()
print(f"📊 Loaded {len(data)} samples")
print(f"\nColumns: {list(data.columns)}")
data.head()

In [None]:
# Define features and target (MUST match example script)
features = ['Maize', 'Wholecrop', 'Chicken Litter', 'Lactose', 'Apple Pomace', 'Rice bran']
target = 'Total_Biogas'

print(f"Features (6): {features}")
print(f"Target: {target}")
print(f"\nFeature Matrix shape: {data[features].shape}")
print(f"Target shape: {data[[target]].shape}")

## 4️⃣ Data Preprocessing Pipeline

**Critical Steps (MUST be done in this order):**

1. **Normalize Features** → StandardScaler on X  
   *Why?* Features have different scales (Maize: 0-50 tonnes, Lactose: 0-5 tonnes)

2. **Create Lag Features** → `series_to_supervised()`  
   *Why?* Biogas at day t depends on feeding at day t-1

3. **Normalize Target** → StandardScaler on y  
   *Why?* Helps LSTM training convergence

4. **80/20 Split** → Chronological (not random!)  
   *Why?* Time-series must preserve temporal order

In [None]:
# Step 1: Normalize input features
print("Step 1: Normalizing features...")
values = data[features].values.astype('float32')
scaler_X = StandardScaler()
scaled_X = scaler_X.fit_transform(values)

print(f"  Original range: [{values.min():.2f}, {values.max():.2f}]")
print(f"  Scaled range: [{scaled_X.min():.2f}, {scaled_X.max():.2f}]")

In [None]:
# Step 2: Create time-lagged features
print("\nStep 2: Creating lag features...")
reframed = series_to_supervised(scaled_X, n_in=1, n_out=1)

print(f"  Before: {scaled_X.shape} (424 rows × 6 features)")
print(f"  After: {reframed.shape} (423 rows × 12 features)")
print(f"\n  New columns: {list(reframed.columns[:6])} ... (t-1 features)")
print(f"  Plus: {list(reframed.columns[6:])} ... (t features)")

In [None]:
# Step 3: Normalize target variable
print("\nStep 3: Normalizing target...")
y = data[[target]].values
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y)

print(f"  Original biogas range: [{y.min():.2f}, {y.max():.2f}] m³/day")
print(f"  Scaled range: [{y_scaled.min():.2f}, {y_scaled.max():.2f}]")

In [None]:
# Step 4: 80/20 chronological split
print("\nStep 4: Train/test split (80/20)...")
split_idx = int(len(reframed) * 0.8)

train = reframed.values[:split_idx]
test = reframed.values[split_idx:]

# CRITICAL: Last column is target (var7(t) → biogas at t)
# All other columns are features (var1-6 at t-1 and t)
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]

print(f"  Training samples: {len(train_X)}")
print(f"  Testing samples: {len(test_X)}")
print(f"  Train X shape: {train_X.shape} (11 lag features)")
print(f"  Train y shape: {train_y.shape}")

## 5️⃣ Build and Train LSTM Model

**Architecture:**
- **Input:** 11 features (6 features at t-1, 5 at t, excluding target)
- **Hidden:** 24 LSTM units (chosen via hyperparameter tuning)
- **Output:** 1 value (biogas prediction)

**Training:**
- 50 epochs
- Adam optimizer (lr=0.001)
- MSE loss

In [None]:
# Initialize LSTM (input_dim MUST match train_X.shape[1])
print("🚀 Training LSTM model...\n")
lstm = LSTMModel(
    input_dim=train_X.shape[1],  # 11 features
    hidden_dim=24,
    output_dim=1,
    num_layers=1,
    dropout=0.1,
    learning_rate=0.001
)

# Train (data already scaled)
lstm.fit(train_X, train_y, epochs=50, batch_size=4, verbose=True)

## 6️⃣ Evaluate Model Performance

**Metrics:**
- **RMSE** (Root Mean Squared Error): Average prediction error in m³/day
- **MAE** (Mean Absolute Error): Average absolute error
- **R²** (Coefficient of Determination): How well model explains variance (1.0 = perfect)

In [None]:
# Make predictions (still in scaled space)
trainPredict = lstm.predict(train_X)
testPredict = lstm.predict(test_X)

# CRITICAL: Inverse transform to original scale for interpretation
trainPredict_inv = scaler_y.inverse_transform(trainPredict)
testPredict_inv = scaler_y.inverse_transform(testPredict)
train_y_inv = scaler_y.inverse_transform(train_y.reshape(-1, 1))
test_y_inv = scaler_y.inverse_transform(test_y.reshape(-1, 1))

print("✅ Predictions generated and inverse-transformed")

In [None]:
# Calculate metrics in original scale
train_rmse = np.sqrt(mean_squared_error(train_y_inv, trainPredict_inv))
test_rmse = np.sqrt(mean_squared_error(test_y_inv, testPredict_inv))
train_mae = mean_absolute_error(train_y_inv, trainPredict_inv)
test_mae = mean_absolute_error(test_y_inv, testPredict_inv)
train_r2 = r2_score(train_y_inv, trainPredict_inv)
test_r2 = r2_score(test_y_inv, testPredict_inv)

print("📊 LSTM Performance Metrics:")
print("=" * 50)
print(f"Train RMSE: {train_rmse:.2f} m³/day, Test RMSE: {test_rmse:.2f} m³/day")
print(f"Train MAE:  {train_mae:.2f} m³/day, Test MAE:  {test_mae:.2f} m³/day")
print(f"Train R²:   {train_r2:.3f},        Test R²:   {test_r2:.3f}")
print("\n✅ These metrics should match examples/04_lstm_prediction.py")

## 7️⃣ Visualize Results

**Side-by-side comparison:**
- **Left:** Training set predictions (should be very good)
- **Right:** Testing set predictions (shows generalization ability)

In [None]:
plt.style.use('bmh')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Training set
ax1.plot(train_y_inv, label='Actual', color='#2E86C1', alpha=0.7, linewidth=2)
ax1.plot(trainPredict_inv, label='LSTM Prediction', color='#E67E22', linestyle='--', linewidth=2)
ax1.set_title(f"Training Set (R² = {train_r2:.3f})", fontsize=14, fontweight='bold')
ax1.set_xlabel("Sample Index", fontsize=12)
ax1.set_ylabel("Biogas Production (m³/day)", fontsize=12)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Testing set  
ax2.plot(test_y_inv, label='Actual', color='#2E86C1', alpha=0.7, linewidth=2)
ax2.plot(testPredict_inv, label='LSTM Prediction', color='#E67E22', linestyle='--', linewidth=2)
ax2.set_title(f"Testing Set (R² = {test_r2:.3f})", fontsize=14, fontweight='bold')
ax2.set_xlabel("Sample Index", fontsize=12)
ax2.set_ylabel("Biogas Production (m³/day)", fontsize=12)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 📝 Summary

This notebook demonstrated:

1. **Temporal Feature Engineering** - `series_to_supervised()` for lag features
2. **Proper Scaling** - StandardScaler on both X and y
3. **LSTM Training** - 50 epochs with 24 hidden units
4. **Evaluation** - RMSE, MAE, R² metrics in original scale

### 🎯 Key Takeaway

**Time-lagged features are critical** for biogas prediction because:
- Substrate fed at day t-1 → digested → biogas at day t
- Without lag features, model can't learn this temporal dependency

### Next Steps

- Compare with [Multi-Task GP](04_MTGP_Prediction.ipynb) for uncertainty quantification
- Try [ADM1 mechanistic model](01_ADM1_Tutorial.ipynb) for process understanding
- Explore [MPC Control](05_MPC_Control.ipynb) for optimization