<a href="https://colab.research.google.com/github/Yusbad09/Autonomous-Vehicle-GPS-Spoofing-detection/blob/main/merged_spoofing_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unified GPS & IMU Spoofing Detection Pipeline

**Author:** Badrudeen Yusuf Akinkunmi (adapted)
**Date:** 2025-08-11

This notebook merges two previously separate evaluation pipelines into a single reproducible research-style experiment:

- **A:** 80:20 train/test split evaluation (RandomForest, XGBoost) + optional LSTM Autoencoder (thresholded labels)
- **B:** Stratified K-Fold cross-validation (k=5) evaluation for RandomForest & XGBoost (averaged metrics). Only the **last fold's** visualizations are included in the final PDF (per instruction).
- **C:** LSTM Autoencoder for unsupervised detection (trained on clean data, threshold = 95th percentile on clean errors) — thresholded labels used to compute comparable metrics.
- **D:** A single consolidated PDF report with both Split and CV comparisons, AE results, and plots.

All sections are heavily documented. Run cells top-to-bottom. Plots are shown inline and also saved to `outputs_combined/`.

## 0. Install dependencies (run if needed)

Uncomment the `!pip` lines if your environment doesn't already have these packages installed (Colab, fresh VM, etc.).

In [7]:
# Uncomment to install packages in a fresh environment
!pip install reportlab xgboost torch --quiet

print('Skip install step if packages already present')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hSkip install step if packages already present


## 1. Imports
Standard scientific stack, ML models, PyTorch for AE, and ReportLab for PDF generation.

In [8]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, roc_auc_score, roc_curve,
                             precision_recall_fscore_support, accuracy_score)

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image
from reportlab.lib.styles import getSampleStyleSheet

plt.rcParams['figure.figsize'] = (8,4)
print('Imports OK')

Imports OK


## 2. Configuration & Utility helpers
Set seeds for reproducibility and directories for outputs. `ensure_dir` makes sure directories exist.

In [9]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

OUT_ROOT = 'outputs_combined'
FIGS_DIR = os.path.join(OUT_ROOT, 'figures')
AE_DIR = os.path.join(OUT_ROOT, 'autoencoder')
N_SPLITS = 5
DO_AE = True
AE_SEQ_LEN = 20
AE_EPOCHS = 15
AE_BATCH = 32

def ensure_dir(path):
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

ensure_dir(OUT_ROOT); ensure_dir(FIGS_DIR); ensure_dir(AE_DIR)
print('Configuration set. Outputs ->', OUT_ROOT)

Configuration set. Outputs -> outputs_combined


## 3. Synthetic GNSS + IMU dataset generator
Generates a plausible random-walk trajectory and injects spoof windows (abrupt or gradual).

In [10]:
def generate_synthetic_gnss_imu(n_points=2000,
                                spoof_windows=[(400,480),(1100,1180),(1600,1680)],
                                abrupt_prob=0.6):
    deg_to_m = 111320
    lat = 51.75 + np.cumsum(np.random.normal(0, 0.00002, n_points))
    lon = -1.25 + np.cumsum(np.random.normal(0, 0.00002, n_points))

    dlat = np.concatenate([[0], np.diff(lat)])
    dlon = np.concatenate([[0], np.diff(lon)])
    disp_m = np.sqrt((dlat*deg_to_m)**2 + (dlon*deg_to_m*np.cos(np.deg2rad(lat)))**2)
    speed = np.clip(disp_m, 0, None)
    heading = (np.degrees(np.arctan2(dlon, np.where(dlat==0, 1e-9, dlat))) + 360) % 360

    accel = np.concatenate([[0], np.diff(speed)]) + np.random.normal(0,0.02,n_points)
    gyro = np.concatenate([[0], np.diff(heading)]) + np.random.normal(0,0.1,n_points)

    gnss_lat = lat + np.random.normal(0, 1e-5, n_points)
    gnss_lon = lon + np.random.normal(0, 1e-5, n_points)

    spoofed = np.zeros(n_points, dtype=int)
    for (s,e) in spoof_windows:
        length = e - s
        if np.random.rand() < abrupt_prob:
            gnss_lat[s:e] += np.random.normal(0.0009, 0.0003)
            gnss_lon[s:e] += np.random.normal(-0.0009, 0.0003)
        else:
            gnss_lat[s:e] += np.linspace(0, 0.0012, length) + np.random.normal(0,2e-5,length)
            gnss_lon[s:e] += np.linspace(0, -0.0012, length) + np.random.normal(0,2e-5,length)
        spoofed[s:e] = 1

    snr = 30 + np.random.normal(0,1,n_points)
    snr[spoofed==1] += np.random.normal(2.0,0.8, spoofed.sum())
    sat_count = np.random.randint(6,12,n_points)

    df = pd.DataFrame({
        'time': np.arange(n_points),
        'true_lat': lat, 'true_lon': lon,
        'gnss_lat': gnss_lat, 'gnss_lon': gnss_lon,
        'speed': speed, 'heading': heading,
        'imu_ax': accel, 'imu_gyro': gyro,
        'snr': snr, 'sat_count': sat_count,
        'spoofed': spoofed
    })
    return df

def load_or_generate_dataset():
    return generate_synthetic_gnss_imu()

print('Data generator ready')

Data generator ready


## 4. Feature engineering
Physics-informed features used by all models: GNSS displacement (m), INS proxy displacement, residuals, rolling stats, SNR, satellite count, heading change.

In [11]:
def extract_features(df):
    d = df.copy()
    deg_to_m = 111320

    d['dlat_g'] = d['gnss_lat'].diff().fillna(0)
    d['dlon_g'] = d['gnss_lon'].diff().fillna(0)
    d['dlat_m_g'] = d['dlat_g'] * deg_to_m
    d['dlon_m_g'] = d['dlon_g'] * deg_to_m * np.cos(np.deg2rad(d['gnss_lat']))
    d['disp_m_g'] = np.sqrt(d['dlat_m_g']**2 + d['dlon_m_g']**2)
    d['disp_m_ins'] = d['speed']
    d['disp_residual'] = d['disp_m_g'] - d['disp_m_ins']

    for w in [3,5,11]:
        d[f'disp_mean_{w}'] = d['disp_m_g'].rolling(w, min_periods=1).mean()
        d[f'disp_std_{w}'] = d['disp_m_g'].rolling(w, min_periods=1).std().fillna(0)

    d['dheading'] = np.abs(np.diff(np.pad(d['heading'].values,(1,0),'edge')))

    features = ['gnss_lat','gnss_lon','speed','heading','disp_m_g','disp_m_ins',
                'disp_residual','disp_mean_3','disp_std_3','disp_mean_5','disp_std_5',
                'snr','sat_count','dheading']
    features = [f for f in features if f in d.columns]
    X = d[features].fillna(0)
    y = d['spoofed'].astype(int).values
    return X, y

print('Feature extractor ready')

Feature extractor ready


## 5. Metrics & plotting helpers
Functions for common metrics and saved plots (confusion, ROC, metric bars, probability curves).

In [12]:
def safe_roc_auc(y_true, scores):
    try:
        return float(roc_auc_score(y_true, scores))
    except Exception:
        return float('nan')

def compute_metrics(y_true, y_pred, y_score):
    acc = accuracy_score(y_true, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary', zero_division=0)
    auc = safe_roc_auc(y_true, y_score)
    return {"accuracy": acc, "precision": prec, "recall": rec, "f1": f1, "roc_auc": auc}

def plot_confusion(y_true, y_pred, title, path):
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(4,3))
    ax.imshow(cm, cmap='Blues')
    for (i,j), v in np.ndenumerate(cm):
        ax.text(j, i, int(v), ha='center', va='center')
    ax.set_xticks([0,1]); ax.set_yticks([0,1])
    ax.set_xticklabels(['Clean','Spoof']); ax.set_yticklabels(['Clean','Spoof'])
    ax.set_xlabel('Predicted'); ax.set_ylabel('True'); ax.set_title(title)
    fig.tight_layout(); fig.savefig(path, dpi=150); plt.close(fig)

def plot_roc(y_true, y_prob, title, path):
    try:
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        auc = safe_roc_auc(y_true, y_prob)
    except Exception:
        fpr, tpr, auc = [0,1], [0,1], float('nan')
    fig, ax = plt.subplots(figsize=(4.5,3.5))
    ax.plot(fpr, tpr, label=f'AUC={auc:.3f}')
    ax.plot([0,1],[0,1],'--', color='gray')
    ax.set_xlabel('FPR'); ax.set_ylabel('TPR'); ax.set_title(title); ax.legend()
    fig.tight_layout(); fig.savefig(path, dpi=150); plt.close(fig)

def plot_metrics_bar(metrics_dict, title, path):
    names = list(metrics_dict.keys())
    vals = list(metrics_dict.values())
    fig, ax = plt.subplots(figsize=(5,3))
    ax.bar(names, vals)
    ax.set_ylim(0,1.05)
    ax.set_title(title)
    for i,v in enumerate(vals):
        if isinstance(v, float) and not np.isnan(v):
            ax.text(i, v+0.02, f'{v:.3f}', ha='center')
        else:
            ax.text(i, 0.5, 'n/a', ha='center')
    fig.tight_layout(); fig.savefig(path, dpi=150); plt.close(fig)

def plot_probabilities(proba_rf, proba_xgb, path):
    fig, ax = plt.subplots(figsize=(10,2.8))
    ax.plot(proba_xgb, label='XGBoost p(spoof)', alpha=0.9)
    ax.plot(proba_rf, label='RandomForest p(spoof)', alpha=0.7)
    ax.set_title('Predicted spoof probabilities (test set order)')
    ax.set_xlabel('test sample index'); ax.set_ylabel('probability'); ax.legend()
    fig.tight_layout(); fig.savefig(path, dpi=150); plt.close(fig)

print('Metric & plotting helpers ready')

Metric & plotting helpers ready


## 6. LSTM Autoencoder implementation (sequence AE)
Small, explainable LSTM AE. It trains on clean windows only and returns reconstruction error per sample by averaging windows covering each sample.

In [13]:
class LSTMAE(nn.Module):
    def __init__(self, n_features, emb_size=16):
        super().__init__()
        self.encoder = nn.LSTM(n_features, emb_size, batch_first=True)
        self.decoder = nn.LSTM(emb_size, n_features, batch_first=True)

    def forward(self, x):
        # x: (B, T, F)
        _, (h, _) = self.encoder(x)
        z = h.permute(1, 0, 2)
        z_rep = z.repeat(1, x.size(1), 1)
        out, _ = self.decoder(z_rep)
        return out

def train_lstm_autoencoder(X, y, save_dir, seq_len=AE_SEQ_LEN, epochs=AE_EPOCHS, batch_size=AE_BATCH):
    ensure_dir(save_dir)
    scaler = StandardScaler()
    Xs = scaler.fit_transform(X)

    clean_mask = (y == 0)
    arr = Xs[clean_mask]
    if arr.shape[0] < seq_len + 1:
        arr = Xs
    seqs = []
    for i in range(len(arr) - seq_len + 1):
        seqs.append(arr[i:i+seq_len])
    seqs = np.stack(seqs) if len(seqs) > 0 else np.empty((0, seq_len, Xs.shape[1]))
    if seqs.shape[0] == 0:
        raise RuntimeError('Not enough data to build AE sequences. Decrease seq_len or use more data.')

    train_tensor = torch.tensor(seqs, dtype=torch.float32)
    loader = DataLoader(TensorDataset(train_tensor), batch_size=batch_size, shuffle=True)

    model = LSTMAE(X.shape[1], emb_size=16)
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    lossfn = nn.MSELoss()
    losses = []

    for ep in range(1, epochs+1):
        model.train()
        running = 0.0
        count = 0
        for (batch,) in loader:
            opt.zero_grad()
            rec = model(batch)
            loss = lossfn(rec, batch)
            loss.backward(); opt.step()
            running += loss.item() * batch.size(0)
            count += batch.size(0)
        epoch_loss = running / max(count, 1)
        losses.append(epoch_loss)
        print(f"[AE] Epoch {ep}/{epochs} | loss={epoch_loss:.6f}")
    # Save loss curve
    loss_path = os.path.join(save_dir, 'ae_loss.png')
    fig, ax = plt.subplots(figsize=(5,3))
    ax.plot(range(1, len(losses)+1), losses, marker='o')
    ax.set_title('Autoencoder Training Loss'); ax.set_xlabel('Epoch'); ax.set_ylabel('Loss')
    fig.tight_layout(); fig.savefig(loss_path, dpi=150); plt.close(fig)

    # compute per-window error across whole dataset
    all_seqs = []
    for i in range(len(Xs) - seq_len + 1):
        all_seqs.append(Xs[i:i+seq_len])
    all_seqs = np.stack(all_seqs)
    model.eval()
    with torch.no_grad():
        rec_all = model(torch.tensor(all_seqs, dtype=torch.float32)).numpy()
    rec_err = np.mean((rec_all - all_seqs)**2, axis=(1,2))

    # map window errors to sample-level by averaging overlapping windows
    err_per_sample = np.zeros(len(Xs))
    counts = np.zeros(len(Xs))
    for i, e in enumerate(rec_err):
        err_per_sample[i:i+seq_len] += e
        counts[i:i+seq_len] += 1
    counts[counts == 0] = 1
    err_per_sample = err_per_sample / counts

    thresh = np.percentile(err_per_sample[clean_mask] if np.any(clean_mask) else err_per_sample, 95)
    preds = (err_per_sample > thresh).astype(int)

    ae_metrics = compute_metrics(y, preds, err_per_sample)

    err_plot_path = os.path.join(save_dir, 'ae_error_plot.png')
    fig, ax = plt.subplots(figsize=(8,3))
    ax.plot(err_per_sample, label='Reconstruction error')
    ax.axhline(thresh, color='r', ls='--', label='95% threshold')
    ax.scatter(np.where(y==1)[0], err_per_sample[y==1], c='orange', s=10, label='true spoof')
    ax.set_title('Autoencoder Reconstruction Errors per sample')
    ax.set_xlabel('sample index'); ax.set_ylabel('MSE'); ax.legend()
    fig.tight_layout(); fig.savefig(err_plot_path, dpi=150); plt.close(fig)

    return {
        'model': model,
        'scaler': scaler,
        'errors': err_per_sample,
        'threshold': thresh,
        'losses': losses,
        'metrics': ae_metrics,
        'loss_path': loss_path,
        'err_plot_path': err_plot_path
    }

print('AE implementation ready')

AE implementation ready


## 7. Cross-validation routine (RF & XGB)
- Runs Stratified K-Fold CV.
- Saves per-fold metrics CSVs and saves visuals **only for the final fold** (fold `N_SPLITS`) per your instruction.

In [14]:
def run_cross_validation(X, y, out_root, n_splits=5):
    ensure_dir(out_root)
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_SEED)
    rows = []
    last_fold_visuals = {}
    for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        print(f'\n--- Fold {fold_idx}/{n_splits} ---')
        Xtr, Xte = X.iloc[train_idx], X.iloc[test_idx]
        ytr, yte = y[train_idx], y[test_idx]

        rf = RandomForestClassifier(n_estimators=200, random_state=RANDOM_SEED)
        rf.fit(Xtr.values, ytr)
        rf_pred = rf.predict(Xte.values)
        rf_proba = rf.predict_proba(Xte.values)[:,1]
        rf_m = compute_metrics(yte, rf_pred, rf_proba)
        rows.append({'model':'RandomForest','fold':fold_idx, **rf_m})

        xg = xgb.XGBClassifier(n_estimators=200, use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_SEED)
        xg.fit(Xtr.values, ytr)
        xg_pred = xg.predict(Xte.values)
        xg_proba = xg.predict_proba(Xte.values)[:,1]
        xg_m = compute_metrics(yte, xg_pred, xg_proba)
        rows.append({'model':'XGBoost','fold':fold_idx, **xg_m})

        if fold_idx == n_splits:
            fold_dir = os.path.join(out_root, f'cv_fold_{fold_idx}')
            ensure_dir(fold_dir)
            print('Saving visuals for final fold to:', fold_dir)
            plot_confusion(yte, rf_pred, f'RF Confusion (fold {fold_idx})', os.path.join(fold_dir, 'rf_conf_lastfold.png'))
            plot_roc(yte, rf_proba, f'RF ROC (fold {fold_idx})', os.path.join(fold_dir, 'rf_roc_lastfold.png'))
            plot_metrics_bar(rf_m, f'RF Metrics (fold {fold_idx})', os.path.join(fold_dir, 'rf_metrics_lastfold.png'))

            plot_confusion(yte, xg_pred, f'XGB Confusion (fold {fold_idx})', os.path.join(fold_dir, 'xgb_conf_lastfold.png'))
            plot_roc(yte, xg_proba, f'XGB ROC (fold {fold_idx})', os.path.join(fold_dir, 'xgb_roc_lastfold.png'))
            plot_metrics_bar(xg_m, f'XGB Metrics (fold {fold_idx})', os.path.join(fold_dir, 'xgb_metrics_lastfold.png'))
            last_fold_visuals = {
                'rf_conf': os.path.join(fold_dir, 'rf_conf_lastfold.png'),
                'rf_roc': os.path.join(fold_dir, 'rf_roc_lastfold.png'),
                'rf_metrics': os.path.join(fold_dir, 'rf_metrics_lastfold.png'),
                'xgb_conf': os.path.join(fold_dir, 'xgb_conf_lastfold.png'),
                'xgb_roc': os.path.join(fold_dir, 'xgb_roc_lastfold.png'),
                'xgb_metrics': os.path.join(fold_dir, 'xgb_metrics_lastfold.png'),
                'fold_idx': fold_idx
            }

    df = pd.DataFrame(rows)
    df.to_csv(os.path.join(out_root, 'cv_per_fold_metrics.csv'), index=False)
    avg = df.groupby('model')[['accuracy','precision','recall','f1','roc_auc']].mean().reset_index()
    avg.to_csv(os.path.join(out_root, 'cv_average_metrics.csv'), index=False)
    return avg, df, last_fold_visuals

print('Cross-validation routine ready')

Cross-validation routine ready


## 8. Unified pipeline (run everything) — documentation inside
This main function runs the dataset generation, feature extraction, split evaluation (80:20), CV (k-fold), AE training, and generates a single PDF that contains:
- Split results (tables and plots)
- CV average metrics and last-fold visuals
- AE results (loss + errors + threshold + AE metrics)
- A combined comparison table and summary chart

In [15]:
def run_unified_pipeline(n_points=2000):
    # 1) Data
    print('1) Generating synthetic dataset...')
    df = generate_synthetic_gnss_imu(n_points=n_points)
    print('   dataset shape:', df.shape)

    # 2) Features
    print('2) Extracting features...')
    X, y = extract_features(df)
    print('   features shape:', X.shape, '| labels shape:', y.shape)

    # A: 80:20 train/test
    print('\n=== A: 80:20 train/test evaluation ===')
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

    rf = RandomForestClassifier(n_estimators=200, random_state=RANDOM_SEED)
    rf.fit(Xtr.values, ytr)
    rf_pred = rf.predict(Xte.values)
    rf_proba = rf.predict_proba(Xte.values)[:,1]
    rf_metrics = compute_metrics(yte, rf_pred, rf_proba)
    print('RF (80:20) metrics:', rf_metrics)

    xg = xgb.XGBClassifier(n_estimators=200, use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_SEED)
    xg.fit(Xtr.values, ytr)
    xg_pred = xg.predict(Xte.values)
    xg_proba = xg.predict_proba(Xte.values)[:,1]
    xg_metrics = compute_metrics(yte, xg_pred, xg_proba)
    print('XGB (80:20) metrics:', xg_metrics)

    # Save split visuals
    ensure_dir(FIGS_DIR)
    plot_confusion(yte, rf_pred, 'RF Confusion (80:20)', os.path.join(FIGS_DIR, 'rf_conf_80_20.png'))
    plot_confusion(yte, xg_pred, 'XGB Confusion (80:20)', os.path.join(FIGS_DIR, 'xgb_conf_80_20.png'))
    plot_roc(yte, rf_proba, 'RF ROC (80:20)', os.path.join(FIGS_DIR, 'rf_roc_80_20.png'))
    plot_roc(yte, xg_proba, 'XGB ROC (80:20)', os.path.join(FIGS_DIR, 'xgb_roc_80_20.png'))
    plot_metrics_bar(rf_metrics, 'RF Metrics (80:20)', os.path.join(FIGS_DIR, 'rf_metrics_80_20.png'))
    plot_metrics_bar(xg_metrics, 'XGB Metrics (80:20)', os.path.join(FIGS_DIR, 'xgb_metrics_80_20.png'))
    plot_probabilities(rf_proba, xg_proba, os.path.join(FIGS_DIR, 'probabilities_80_20.png'))

    # B: Cross-validation
    print('\n=== B: Stratified K-Fold Cross-Validation (k={}) ==='.format(N_SPLITS))
    avg_cv, df_cv, last_fold_visuals = run_cross_validation(X, y, FIGS_DIR, N_SPLITS)
    print('\nAverage CV metrics:\n', avg_cv)

    # C: Autoencoder
    ae_info = None
    if DO_AE:
        print('\n=== C: LSTM Autoencoder (unsupervised) ===')
        ae_info = train_lstm_autoencoder(X.values, y, AE_DIR, seq_len=AE_SEQ_LEN, epochs=AE_EPOCHS, batch_size=AE_BATCH)
        print('AE threshold (95th percentile on clean):', ae_info['threshold'])
        print('AE metrics (thresholded labels):', ae_info['metrics'])
    else:
        print('Skipping AE (DO_AE=False)')

    # D: Build comparison table and charts
    print('\n=== D: Building comparison tables and summary charts ===')
    split_rows = [
        {'model': 'RandomForest (80:20)', **rf_metrics},
        {'model': 'XGBoost (80:20)', **xg_metrics}
    ]
    if ae_info:
        split_rows.append({'model': 'LSTM-AE (thresholded)', **ae_info['metrics']})
    else:
        split_rows.append({'model': 'LSTM-AE (thresholded)', 'accuracy':np.nan, 'precision':np.nan, 'recall':np.nan, 'f1':np.nan, 'roc_auc':np.nan})

    split_df = pd.DataFrame(split_rows)
    split_df.to_csv(os.path.join(OUT_ROOT, 'split_results.csv'), index=False)

    # compose comparison DataFrame that includes CV averages
    comp_df = split_df.copy()
    for _, row in avg_cv.iterrows():
        comp_df = pd.concat([
            comp_df,
            pd.DataFrame([{ 'model': f"{row['model']} (CV avg)", 'accuracy': row['accuracy'], 'precision': row['precision'], 'recall': row['recall'], 'f1': row['f1'], 'roc_auc': row['roc_auc'] }])
        ], ignore_index=True)

    # Save comp chart
    summary_chart_path = os.path.join(FIGS_DIR, 'summary_comparison.png')
    metrics = ['accuracy','precision','recall','f1','roc_auc']
    fig, ax = plt.subplots(figsize=(10,3.5))
    n = len(comp_df)
    x = np.arange(n)
    width = 0.12
    for i,m in enumerate(metrics):
        vals = comp_df[m].astype(float).values
        ax.bar(x + i*width, vals, width, label=m)
    ax.set_xticks(x + width*2)
    ax.set_xticklabels(comp_df['model'], rotation=30, ha='right', fontsize=8)
    ax.set_ylim(0,1.05)
    ax.legend(ncol=3, fontsize=8)
    ax.set_title('Model Comparison: Split vs CV vs AE (thresholded)')
    fig.tight_layout(); fig.savefig(summary_chart_path, dpi=150); plt.close(fig)

    # Save comparison table image
    def save_table_image(df_table, path, title=''):
        fig, ax = plt.subplots(figsize=(10, 1.2 + 0.35*len(df_table)))
        ax.axis('off')
        tbl = ax.table(cellText=df_table.round(3).values, colLabels=df_table.columns, loc='center')
        tbl.auto_set_font_size(False); tbl.set_fontsize(9); tbl.scale(1,1.2)
        if title:
            ax.set_title(title)
        fig.tight_layout(); fig.savefig(path, dpi=150); plt.close(fig)

    table_img_path = os.path.join(FIGS_DIR, 'comparison_table.png')
    save_table_image(comp_df[['model'] + metrics], table_img_path, title='Split & CV Comparison (selected metrics)')

    # E: Build unified PDF
    print('\n=== E: Generating unified PDF report ===')
    pdf_path = os.path.join(OUT_ROOT, 'unified_report.pdf')
    doc = SimpleDocTemplate(pdf_path, pagesize=letter)
    styles = getSampleStyleSheet()
    elems = []
    elems.append(Paragraph('Unified GPS & IMU Spoofing Detection Report', styles['Title']))
    elems.append(Spacer(1, 8))
    elems.append(Paragraph(f'Generated: {datetime.utcnow().strftime("%Y-%m-%d %H:%M UTC")}', styles['Normal']))
    elems.append(Spacer(1, 12))
    # A: split section
    elems.append(Paragraph('A. Train/Test Split (80:20) Results', styles['Heading2']))
    elems.append(Spacer(1,6))
    elems.append(Paragraph('Metrics (80:20): RandomForest & XGBoost. LSTM Autoencoder thresholded included for comparison.', styles['Normal']))
    elems.append(Spacer(1,6))
    if os.path.exists(table_img_path):
        elems.append(Image(table_img_path, width=500, height=140)); elems.append(Spacer(1,8))
    for img in ['rf_conf_80_20.png','xgb_conf_80_20.png','rf_roc_80_20.png','xgb_roc_80_20.png','rf_metrics_80_20.png','xgb_metrics_80_20.png','probabilities_80_20.png']:
        ip = os.path.join(FIGS_DIR, img)
        if os.path.exists(ip):
            elems.append(Image(ip, width=420, height=200)); elems.append(Spacer(1,6))

    # B: CV
    elems.append(Paragraph(f'B. Stratified K-Fold Cross-Validation (k={N_SPLITS})', styles['Heading2']))
    elems.append(Spacer(1,6))
    elems.append(Paragraph('Average cross-validation metrics (grouped by model). Only the final fold visuals are included below.', styles['Normal']))
    elems.append(Spacer(1,6))
    try:
        elems.append(Paragraph(avg_cv.to_html(index=False), styles['Normal']))
    except Exception:
        elems.append(Paragraph(str(avg_cv), styles['Normal']))
    elems.append(Spacer(1,8))
    if last_fold_visuals:
        elems.append(Paragraph(f'Visuals from last fold (fold {last_fold_visuals.get("fold_idx", N_SPLITS)})', styles['Heading3']))
        for key in ['rf_conf','rf_roc','rf_metrics','xgb_conf','xgb_roc','xgb_metrics']:
            ip = last_fold_visuals.get(key)
            if ip and os.path.exists(ip):
                elems.append(Image(ip, width=420, height=200)); elems.append(Spacer(1,6))

    # C: AE
    if ae_info:
        elems.append(Paragraph('C. LSTM Autoencoder (unsupervised) Results', styles['Heading2']))
        elems.append(Spacer(1,6))
        elems.append(Paragraph(f'Threshold (95th percentile on clean): {ae_info["threshold"]:.6f}', styles['Normal']))
        elems.append(Spacer(1,6))
        try:
            elems.append(Paragraph(pd.DataFrame([ae_info['metrics']], index=['LSTM-AE']).to_html(), styles['Normal']))
        except Exception:
            elems.append(Paragraph(str(ae_info['metrics']), styles['Normal']))
        elems.append(Spacer(1,6))
        if os.path.exists(ae_info.get('loss_path','')):
            elems.append(Image(ae_info['loss_path'], width=420, height=200)); elems.append(Spacer(1,6))
        if os.path.exists(ae_info.get('err_plot_path','')):
            elems.append(Image(ae_info['err_plot_path'], width=420, height=140)); elems.append(Spacer(1,6))

    # D: final comparison
    elems.append(Paragraph('D. Final Comparison & Summary', styles['Heading2']))
    elems.append(Spacer(1,6))
    if os.path.exists(summary_chart_path):
        elems.append(Image(summary_chart_path, width=500, height=220)); elems.append(Spacer(1,6))

    doc.build(elems)
    print('\n✅ Unified PDF report written to:', pdf_path)
    print('All figures saved to:', FIGS_DIR, '; AE outputs saved to:', AE_DIR)

if __name__ == '__main__':
    run_unified_pipeline(n_points=2000)


1) Generating synthetic dataset...
   dataset shape: (2000, 12)
2) Extracting features...
   features shape: (2000, 14) | labels shape: (2000,)

=== A: 80:20 train/test evaluation ===
RF (80:20) metrics: {'accuracy': 0.99, 'precision': 1.0, 'recall': 0.9166666666666666, 'f1': 0.9565217391304348, 'roc_auc': 0.9995857007575757}
XGB (80:20) metrics: {'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'roc_auc': 1.0}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



=== B: Stratified K-Fold Cross-Validation (k=5) ===

--- Fold 1/5 ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



--- Fold 2/5 ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



--- Fold 3/5 ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



--- Fold 4/5 ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



--- Fold 5/5 ---


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Saving visuals for final fold to: outputs_combined/figures/cv_fold_5

Average CV metrics:
           model  accuracy  precision    recall        f1   roc_auc
0  RandomForest    0.9950   0.991667  0.966667  0.978899  0.999657
1       XGBoost    0.9985   1.000000  0.987500  0.993684  0.998923

=== C: LSTM Autoencoder (unsupervised) ===
[AE] Epoch 1/15 | loss=0.739879
[AE] Epoch 2/15 | loss=0.698526
[AE] Epoch 3/15 | loss=0.671854
[AE] Epoch 4/15 | loss=0.661221
[AE] Epoch 5/15 | loss=0.651356
[AE] Epoch 6/15 | loss=0.639789
[AE] Epoch 7/15 | loss=0.632340
[AE] Epoch 8/15 | loss=0.628165
[AE] Epoch 9/15 | loss=0.624925
[AE] Epoch 10/15 | loss=0.621833
[AE] Epoch 11/15 | loss=0.619374
[AE] Epoch 12/15 | loss=0.617776
[AE] Epoch 13/15 | loss=0.616567
[AE] Epoch 14/15 | loss=0.615708
[AE] Epoch 15/15 | loss=0.614489
AE threshold (95th percentile on clean): 1.1607283979561749
AE metrics (thresholded labels): {'accuracy': 0.9145, 'precision': 0.6408163265306123, 'recall': 0.6541666666666667, '

  elems.append(Paragraph(f'Generated: {datetime.utcnow().strftime("%Y-%m-%d %H:%M UTC")}', styles['Normal']))



✅ Unified PDF report written to: outputs_combined/unified_report.pdf
All figures saved to: outputs_combined/figures ; AE outputs saved to: outputs_combined/autoencoder


## 9. Notes (Implementation details & reproducibility)

- **Data preprocessing:** feature extraction uses GNSS diff -> meters conversion and rolling windows (3,5,11). Missing values are filled with 0 for simplicity; for real data consider interpolation and more careful handling.
- **Hyperparameters:** default tree counts: `n_estimators=200`. AE: `seq_len=20`, `epochs=15`. These were chosen for speed. For publication use randomized/grid search and report ranges.
- **AE threshold:** 95th percentile on clean reconstruction errors (common heuristic). You may calibrate threshold using validation sets.
- **What the PDF contains:** Split results (80:20), CV average metrics (and last-fold visuals only), AE results, final comparison chart and comparison table.
- **Run-time:** AE training is CPU-friendly but faster on GPU.

If you'd like, I can also produce:
- a) a ZIP with `merged_spoofing_pipeline.ipynb` and `outputs_combined/` (if you want me to run and upload outputs), or
- b) a reduced notebook without AE for faster runs.