# ISP Project: Waste Collection Prediction — LSTM & BiLSTM Baselines

Implement the baseline LSTM and BiLSTM approaches using the individual-bin methodology. This notebook follows the pipeline: per-bin normalization, 30-day sequences → next-day prediction, temporal 80/20 split, and visualizations.

## Problems the Original Paper Leaves Ambiguous

- **Dataset preprocessing:** exact gap handling and resampling to daily cadence.
- **Individual vs collective prediction:** how per-bin scaling and splitting are managed.
- **Temporal splitting:** exact strategy to avoid leakage and keep chronology.

### My Solution (LSTM/BiLSTM version)
1. Deep dataset audit: read cleaned dataset and confirm coverage.
2. Daily continuity: gap-filling to ensure continuous daily series per bin, then 30-day windows.
3. Per-bin normalization: MinMax scaling per bin.
4. Temporal 80/20 split: chronological split per bin for train/test.

In [None]:
# Imports & paths
import os, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_CSV = "../../data/wyndham_waste_data_cleaned.csv"
RESULTS_JSON = "../../outputs/lstm_bilstm_comparison.json"
LSTM_PNG = "../../outputs/lstm_training_results.png"
BILSTM_PNG = "../../outputs/bilstm_training_results.png"

# Pretty printing helpers
def hr(title):
    print("\n" + title)
    print("=" * len(title))

SyntaxError: unterminated string literal (detected at line 15) (1716723454.py, line 15)

## Implementation Details (recap)

- Input -> per-bin daily timeline -> MinMax per-bin -> 30-day sliding windows -> chronological 80/20 split.

### Baseline Architectures (paper-faithful)
- LSTM: Input (30, 1) → LSTM(100) → Dropout(0.2) → Dense(1)
- BiLSTM: Input (30, 1) → Bidirectional(LSTM(100)) → Dropout(0.2) → Dense(1)
- Training: Adam(lr=5e-4), MSE, epochs=20, batch=70

In [None]:
# Load model results (generated by the training script)
if os.path.exists(RESULTS_JSON):
    with open(RESULTS_JSON, "r") as f:
        results = json.load(f)
else:
    # Fallback: populate with console metrics if JSON isn't present
    results = {
        "LSTM": {
            "train_metrics": {"MAE": 1.836, "MAPE": np.nan, "RMSE": 2.375, "R2": 0.480},
            "test_metrics":  {"MAE": 1.987, "MAPE": np.nan, "RMSE": 2.575, "R2": 0.418},
        },
        "BiLSTM": {
            "train_metrics": {"MAE": 1.816, "MAPE": np.nan, "RMSE": 2.361, "R2": 0.486},
            "test_metrics":  {"MAE": 1.965, "MAPE": np.nan, "RMSE": 2.563, "R2": 0.423},
        },
    }

hr("Loaded Results (from JSON or fallback)")
print(json.dumps(results, indent=2))

In [None]:
# Optional: read dataset stats (if CSV available)
dataset_info = {}
if os.path.exists(DATA_CSV):
    df = pd.read_csv(DATA_CSV)
    # Expect columns: timestamp, latestFullness, serialNumber, ...
    try:
        df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
        df = df.dropna(subset=["timestamp", "latestFullness", "serialNumber"])
        bins = df["serialNumber"].nunique()
        date_min, date_max = df["timestamp"].min(), df["timestamp"].max()
        dataset_info = {
            "Total Records": f"{len(df):,}",
            "Unique Bins": f"{bins}",
            "Date Range": f"{date_min.date()} to {date_max.date()}",
        }
    except Exception as e:
        dataset_info = {"Note": f"Could not parse dataset due to: {e}"}
else:
    dataset_info = {"Note": "CSV not found. Skipping dataset stats."}

hr("Dataset Snapshot")
print(dataset_info)

## Training Configuration (Applied)
- Epochs: 20
- Batch size: 70
- Sequence length: 30
- Optimizer: Adam (lr = 5e-4)
- Loss: MSE
- Normalization: Per-bin MinMax (0–1), inverse for metrics on 0–10 scale

In [None]:
# Build a tidy comparison table (Our LSTM vs Our BiLSTM vs Paper)
paper_ref = {
    "LSTM":   {"RMSE": 1.579, "MAE": 0.602, "MAPE(%)": 1.86, "R2": 0.93},
    "BiLSTM": {"RMSE": 1.543, "MAE": 0.638, "MAPE(%)": 7.95, "R2": 0.90},
}

def safe_pct(x):
    return None if (x is None or (isinstance(x, float) and np.isnan(x))) else 100.0 * x

rows = []
for model_name in ["LSTM", "BiLSTM"]:
    ours_tr = results[model_name]["train_metrics"]
    ours_te = results[model_name]["test_metrics"]
    rows.append({
        "Model": model_name,
        "Split": "Train",
        "RMSE": ours_tr["RMSE"],
        "MAE":  ours_tr["MAE"],
        "MAPE(%)": safe_pct(ours_tr.get("MAPE")),
        "R²":   ours_tr["R2"],
    })
    rows.append({
        "Model": model_name,
        "Split": "Test",
        "RMSE": ours_te["RMSE"],
        "MAE":  ours_te["MAE"],
        "MAPE(%)": safe_pct(ours_te.get("MAPE")),
        "R²":   ours_te["R2"],
    })
    rows.append({
        "Model": model_name,
        "Split": "Paper (Ref.)",
        "RMSE": paper_ref[model_name]["RMSE"],
        "MAE":  paper_ref[model_name]["MAE"],
        "MAPE(%)": paper_ref[model_name]["MAPE(%)"],
        "R²":   paper_ref[model_name]["R2"],
    })

comp_df = pd.DataFrame(rows)

hr("Performance Comparison Table")
# Pretty display if in notebook environment
try:
    display(comp_df.style.format({"RMSE": "{:.3f}", "MAE": "{:.3f}", "MAPE(%)": lambda v: "-" if v is None else f"{v:.2f}",
                                  "R²": "{:.3f}"}))
except Exception:
    print(comp_df.to_string(index=False))

In [None]:
# Bar charts: R² and RMSE on Test split (Our vs Paper)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
test_slice = comp_df[comp_df["Split"].isin(["Test", "Paper (Ref.)"]) ]

# RMSE
for i, model_name in enumerate(["LSTM", "BiLSTM"]):
    subset = test_slice[test_slice["Model"] == model_name]
    axes[0].bar([i*2, i*2+1], subset["RMSE"], width=0.8, label=model_name if i==0 else "")
axes[0].set_xticks([0,1,2,3])
axes[0].set_xticklabels(["LSTM (Ours)", "LSTM (Paper)", "BiLSTM (Ours)", "BiLSTM (Paper)"], rotation=15)
axes[0].set_ylabel("RMSE (0–10 scale)")
axes[0].set_title("RMSE — Test")
axes[0].grid(axis="y", alpha=0.3)

# R²
for i, model_name in enumerate(["LSTM", "BiLSTM"]):
    subset = test_slice[test_slice["Model"] == model_name]
    axes[1].bar([i*2, i*2+1], subset["R²"], width=0.8, label=model_name if i==0 else "")
axes[1].set_xticks([0,1,2,3])
axes[1].set_xticklabels(["LSTM (Ours)", "LSTM (Paper)", "BiLSTM (Ours)", "BiLSTM (Paper)"], rotation=15)
axes[1].set_ylabel("R²")
axes[1].set_title("R² — Test")
axes[1].grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## Training Curves (from Saved Runs)
If available, the plots below are identical to those produced during training (`lstm_training_results.png` and `bilstm_training_results.png`).

In [None]:
def show_image_if_exists(path, title):
    if os.path.exists(path):
        img = plt.imread(path)
        plt.figure(figsize=(10,6))
        plt.imshow(img)
        plt.axis("off")
        plt.title(title)
        plt.show()
    else:
        print(f"[info] Plot not found at: {path}")

show_image_if_exists(LSTM_PNG,  "LSTM — Training Results")
show_image_if_exists(BILSTM_PNG, "BiLSTM — Training Results")

## Key Findings (LSTM & BiLSTM)

1. Consistent convergence; both models show steadily decreasing loss.
2. BiLSTM marginally outperforms LSTM on RMSE and R².
3. R² ≈ 0.42 (test) suggests moderate predictability.
4. Absolute errors larger than paper, likely due to preprocessing differences.

## Next Steps
- Add exogenous features (holidays, weather).
- Evaluate multivariate models with more features.
- Calibrate collection thresholds via cost-sensitive evaluation.