# 08 · FFT & Autocorrelation Analysis (SpectraMind V50)

Mission‑grade notebook to analyze **frequency structure** and **temporal/spectral self‑similarity** of spectra using **FFT** and **autocorrelation**. This is a *thin orchestration* notebook:

- **No ad‑hoc pipeline code** — only reads artifacts produced by the CLI (calibrated spectra, predictions, residuals) and writes diagnostics to `outputs/`.
- Supports overlays for **symbolic regions** (e.g., water bands) and saves a compact JSON/CSV **summary bundle** for dashboards.
- Optional: invoke your CLI tool (e.g., `analyze_fft_autocorr_mu.py`) if available to regenerate artifacts.

**What you'll do**
1) Environment & repo sanity checks (best‑effort).
2) Locate inputs: calibrated spectra or prediction files under `outputs/`.
3) Compute per‑spectrum **FFT power** (and band energies) and **autocorrelation** (normalized).
4) Plot example spectra with **band overlays**, FFT power curves, and autocorr functions.
5) Export `fft_autocorr_summary.json` + `fft_autocorr_detail.csv` + PNGs to `outputs/notebooks/08_fft_autocorr/`.
6) (Optional) Run the CLI analyzer to (re)generate standardized diagnostics.

## 0) Setup & paths
Create a deterministic output directory for this notebook and capture minimal environment info.

In [None]:
import os, sys, json, subprocess, platform, shutil, textwrap
from pathlib import Path
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context('notebook'); sns.set_style('whitegrid')

ROOT = Path.cwd()
NB_OUT = ROOT / 'outputs' / 'notebooks' / '08_fft_autocorr'
NB_OUT.mkdir(parents=True, exist_ok=True)

print('ROOT:', ROOT)
print('NB_OUT:', NB_OUT)

env = {
    'python': platform.python_version(),
    'platform': platform.platform(),
}
with open(NB_OUT/'env_snapshot.json', 'w') as f:
    json.dump(env, f, indent=2)
print('Saved env snapshot.')

## 1) Locate inputs (calibrated / predictions / residuals)
We prefer **calibrated spectra**; else fall back to predictions. Expected formats are flexible — we try common CSV/NPY/Parquet file names under `outputs/`.

In [None]:
def find_candidates():
    out = []
    roots = [ROOT/'outputs']
    patterns = [
        '**/calibrated_spectra.csv',
        '**/predictions.csv',
        '**/mu.csv',
        '**/spectra.npy',
        '**/mu.npy',
        '**/residuals.csv',
        '**/residuals.npy'
    ]
    for r in roots:
        for pat in patterns:
            out.extend(r.glob(pat))
    # Unique by path
    seen, uniq = set(), []
    for p in sorted(out):
        if p not in seen:
            uniq.append(p); seen.add(p)
    return uniq

CANDIDATES = find_candidates()
print('Found candidates (up to 10):')
for p in CANDIDATES[:10]:
    print(' -', p.relative_to(ROOT))

# Heuristic pick: prefer calibrated or mu predictions
PICK = None
for key in ['calibrated_spectra.csv','mu.csv','predictions.csv','spectra.npy','mu.npy']:
    for p in CANDIDATES:
        if p.name == key or p.suffix == '.npy' and key.endswith('.npy'):
            PICK = p; break
    if PICK is not None:
        break
print('\nSelected input:', PICK)

### Load spectra table
We normalize to a simple in‑memory frame with columns:
- `planet_id` (or `id`) if available; else synthetic index
- `wavelength_index` (0..W-1)
- `wavelength_um` (if available; else synthetic)
- `mu` (spectrum value used for FFT/autocorr)
- `sigma` (optional)

In [None]:
def load_spectra_table(path: Path) -> pd.DataFrame:
    if path is None or not path.exists():
        raise FileNotFoundError('No spectra file selected.')
    if path.suffix == '.npy':
        arr = np.load(path)
        # Expect shape [N, W] (N spectra)
        if arr.ndim == 1:
            arr = arr[None, :]
        n, w = arr.shape
        rows = []
        for i in range(n):
            mu = arr[i]
            for j in range(w):
                rows.append({'planet_id': i, 'wavelength_index': j, 'wavelength_um': np.nan, 'mu': float(mu[j])})
        return pd.DataFrame(rows)

    # CSV/Parquet style (wide or long)
    if path.suffix == '.csv':
        df = pd.read_csv(path)
    elif path.suffix == '.parquet':
        df = pd.read_parquet(path)
    else:
        raise ValueError(f'Unsupported file type: {path.suffix}')

    cols = {c.lower(): c for c in df.columns}
    # Long format already?
    if {'planet_id','wavelength_index','mu'}.issubset(set(cols)):
        out = df.rename(columns={cols['planet_id']:'planet_id', cols['wavelength_index']:'wavelength_index', cols['mu']:'mu'})
        if 'wavelength_um' in cols:
            out = out.rename(columns={cols['wavelength_um']:'wavelength_um'})
        elif 'wavelength' in cols:
            out = out.rename(columns={cols['wavelength']:'wavelength_um'})
        if 'sigma' in cols: out = out.rename(columns={cols['sigma']:'sigma'})
        return out

    # Otherwise assume wide: columns like mu_000..mu_XXX or generic numeric columns
    mu_cols = [c for c in df.columns if str(c).startswith('mu_')]
    if not mu_cols:
        # Use all numeric cols as wavelengths
        mu_cols = [c for c in df.columns if np.issubdtype(df[c].dtype, np.number)]
    rows = []
    for i, row in df.iterrows():
        pid = row[cols['planet_id']] if 'planet_id' in cols else i
        for j, mc in enumerate(mu_cols):
            rows.append({
                'planet_id': pid,
                'wavelength_index': j,
                'wavelength_um': np.nan,
                'mu': float(row[mc])
            })
    return pd.DataFrame(rows)

spec_df = load_spectra_table(PICK)
print('Spectra table shape:', spec_df.shape)
spec_df.head()

## 2) FFT & autocorrelation functions
We compute **single‑sided FFT power** and **normalized autocorrelation** for each spectrum. (If these spectra are indexed by wavelength, the operations are over the discrete channel index.)

In [None]:
def fft_power_onesided(y: np.ndarray):
    y = np.asarray(y, float)
    y = y - np.nanmean(y)
    fy = np.fft.rfft(y)
    power = np.abs(fy)**2
    freqs = np.fft.rfftfreq(len(y), d=1.0)  # index space
    return freqs, power

def autocorr_norm(y: np.ndarray):
    y = np.asarray(y, float)
    y = y - np.nanmean(y)
    N = len(y)
    if N == 0:
        return np.arange(1), np.zeros(1)
    r = np.correlate(y, y, mode='full')
    r = r[r.size//2:]
    if r[0] != 0:
        r = r / r[0]
    lags = np.arange(r.size)
    return lags, r

def band_energy(power: np.ndarray, freqs: np.ndarray, f_lo: float, f_hi: float):
    m = (freqs >= f_lo) & (freqs <= f_hi)
    return float(np.nansum(power[m]))

def analyze_spectrum(y: np.ndarray, bands_f=None):
    f, p = fft_power_onesided(y)
    lags, r = autocorr_norm(y)
    be = {}
    if bands_f:
        for name,(flo,fhi) in bands_f.items():
            be[name] = band_energy(p, f, flo, fhi)
    summary = {
        'fft_peak_freq': float(f[np.nanargmax(p)]) if p.size else np.nan,
        'fft_peak_power': float(np.nanmax(p)) if p.size else np.nan,
        'autocorr_first_zero_lag': int(np.argmax(r<=0)) if np.any(r<=0) else int(len(r)-1)
    }
    summary.update({f'band_energy_{k}': v for k,v in be.items()})
    return f,p,lags,r,summary

### Define frequency and symbolic bands (demo)
- **FFT bands**: frequency ranges (index‑space) to summarize periodic components (e.g., low/medium/high).
- **Molecular bands** (symbolic overlays): index ranges in wavelength that we will shade on plots (for visualization only). Adjust for your grid.

In [None]:
# Frequency bands in index space (heuristic demo)
FFT_BANDS = {
    'low': (0.0, 0.05),
    'mid': (0.05, 0.15),
    'high': (0.15, 0.50)
}

# Symbolic wavelength index bands (adjust to your actual grid)
SYM_BANDS = {
    'H2O_1': (120, 150),
    'H2O_2': (180, 220)
}

print('FFT_BANDS:', FFT_BANDS)
print('SYM_BANDS:', SYM_BANDS)

## 3) Run analysis for a small set and plot
We sample a handful of spectra, compute FFT/autocorr, and generate PNGs with symbolic band overlays.

In [None]:
def plot_spectrum_with_bands(wl_idx, mu, title, out_png=None, bands=None):
    x = wl_idx
    plt.figure(figsize=(10,3.2))
    plt.plot(x, mu, lw=1.2)
    if bands:
        for name,(a,b) in bands.items():
            a,b = max(0,a), min(len(mu),b)
            if b>a:
                plt.axvspan(x[a], x[b-1], color='gray', alpha=0.15)
                plt.text((x[a]+x[b-1])/2, np.nanmax(mu)*0.98, name, ha='center', va='top', fontsize=8, alpha=0.8)
    plt.xlabel('wavelength index'); plt.ylabel('value (mu)')
    plt.title(title)
    plt.tight_layout()
    if out_png:
        plt.savefig(out_png, dpi=150)
        plt.close()
    else:
        plt.show()

def plot_fft_autocorr(freqs, power, lags, acorr, base_name):
    fig,ax = plt.subplots(1,2,figsize=(12,3.2))
    ax[0].plot(freqs[1:], power[1:], lw=1.2)
    ax[0].set_xlabel('frequency (index space)'); ax[0].set_ylabel('power'); ax[0].set_title('FFT power (one-sided)')
    ax[1].plot(lags, acorr, lw=1.2)
    ax[1].set_xlabel('lag (index)'); ax[1].set_ylabel('autocorr (norm)'); ax[1].set_title('Autocorrelation')
    fig.tight_layout()
    out = NB_OUT/f'{base_name}_fft_autocorr.png'
    fig.savefig(out, dpi=150)
    plt.close(fig)
    return out

# Choose a small set of planets
planets = spec_df['planet_id'].drop_duplicates().tolist()
sel_planets = planets[: min(5, len(planets))]

detail_rows = []
for pid in sel_planets:
    sdf = spec_df[spec_df['planet_id']==pid].sort_values('wavelength_index')
    wl_idx = sdf['wavelength_index'].to_numpy()
    mu = sdf['mu'].to_numpy()

    # Plot spectrum with symbolic bands
    png_spec = NB_OUT/f'planet_{pid}_spectrum.png'
    plot_spectrum_with_bands(wl_idx, mu, f'Planet {pid} — spectrum', out_png=png_spec, bands=SYM_BANDS)

    # FFT/AC
    f, p, lags, r, summ = analyze_spectrum(mu, bands_f=FFT_BANDS)
    png_fft = plot_fft_autocorr(f, p, lags, r, f'planet_{pid}')

    row = {'planet_id': pid}
    row.update(summ)
    detail_rows.append(row)

detail_df = pd.DataFrame(detail_rows)
detail_path = NB_OUT/'fft_autocorr_detail.csv'
detail_df.to_csv(detail_path, index=False)
print('Wrote:', detail_path)
detail_df.head()

## 4) Summary JSON
Persist a compact summary (means over the analyzed set) for dashboards or future notebooks.

In [None]:
summary = {
    'n_planets': int(detail_df.shape[0]),
    'fft_peak_freq_mean': float(detail_df['fft_peak_freq'].mean()) if 'fft_peak_freq' in detail_df else None,
    'fft_peak_power_mean': float(detail_df['fft_peak_power'].mean()) if 'fft_peak_power' in detail_df else None,
    'autocorr_first_zero_lag_mean': float(detail_df['autocorr_first_zero_lag'].mean()) if 'autocorr_first_zero_lag' in detail_df else None,
}
for k in [c for c in detail_df.columns if c.startswith('band_energy_')]:
    summary[f'{k}_mean'] = float(detail_df[k].mean())
with open(NB_OUT/'fft_autocorr_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print('Wrote:', NB_OUT/'fft_autocorr_summary.json')
summary

## 5) (Optional) Call the CLI analyzer to (re)generate standardized diagnostics
If your repository ships `analyze_fft_autocorr_mu.py` as a CLI tool, you can refresh artifacts here. This preserves the **CLI‑first** contract (notebook remains a viewer/driver only).

In [None]:
RUN_CLI = False  # toggle to True to enable
if RUN_CLI:
    cmd = [
        'spectramind', 'diagnose', 'fft-autocorr',
        # Example overrides (adjust to your config tree):
        # 'diagnostics.fft.n_freq=128', 'diagnostics.autocorr.max_lag=150'
    ]
    print('Running:', ' '.join(cmd))
    try:
        subprocess.run(cmd, check=True)
    except Exception as e:
        print('CLI diagnostics failed:', e)
else:
    print('CLI step disabled. Set RUN_CLI=True to enable.')

---
### Notes
- If you wish to compare **raw vs calibrated** FFT power, add your raw file path and call `analyze_spectrum()` on each; plot them together (log‑scale) to illustrate systematic suppression.
- For **wavelength‑aware** band overlays, replace index bands with exact channel indices for your instrument grid.
- Consider summarizing FFT power in **physically‑motivated** bands (e.g., those known to accumulate jitter/systematics) to track calibration efficacy over runs.

**All artifacts written to**: `outputs/notebooks/08_fft_autocorr/`