# 🌍 World Discovery Engine (WDE) — Kaggle Notebook (Ready-to-Upload)

Minimal, self-contained scaffold that mirrors the WDE pipeline stages and produces real artifacts on Kaggle without external dependencies:

Stages: data → preprocess → detect → verify (demo) → dossier → manifests

What you get
- Deterministic seeds + environment printout
- Synthetic AOI tiling + demo raster tile with a planted anomaly
- Coarse anomaly detection (Canny edges) + scoring
- Candidate artifacts (JSON + PNG) and a simple dossier per candidate
- Final run manifest & short file tree recap

> Replace the demo data generation with real fetchers and models as you develop. This notebook intentionally avoids private packages so it will run on fresh Kaggle kernels out of the box.

In [None]:
# --- Environment & Reproducibility -----------------------------------------
import os, sys, json, math, random, pathlib, time, datetime as dt
from typing import Dict, Any

print('Python:', sys.version)
print('CWD:', pathlib.Path.cwd())

# Set standard roots (overridable via env)
os.environ.setdefault('WDE_DATA_ROOT', './data')
os.environ.setdefault('WDE_ARTIFACTS_ROOT', './artifacts')
DATA_ROOT = pathlib.Path(os.environ['WDE_DATA_ROOT']).resolve()
ART_ROOT = pathlib.Path(os.environ['WDE_ARTIFACTS_ROOT']).resolve()
print('WDE_DATA_ROOT =', DATA_ROOT)
print('WDE_ARTIFACTS_ROOT =', ART_ROOT)

# Deterministic seeds (for numpy/random and torch when available)
SEED = int(os.environ.get('WDE_SEED', '2025'))
random.seed(SEED)
try:
    import numpy as np
except Exception as e:
    raise RuntimeError('NumPy is required in Kaggle runtime.') from e
np.random.seed(SEED)

try:
    import torch
    torch.manual_seed(SEED)
    TORCH_OK = True
    USE_GPU = torch.cuda.is_available()
except Exception:
    TORCH_OK = False
    USE_GPU = False
print('Torch available:', TORCH_OK, '| GPU:', USE_GPU)

# CV/Plot deps: try import; if missing on Kaggle, these are usually present.
try:
    import cv2
except Exception:
    cv2 = None
try:
    import matplotlib.pyplot as plt
    from matplotlib import cm
except Exception:
    plt = None

RUN_TS = dt.datetime.utcnow().replace(microsecond=0).isoformat() + 'Z'
print('Run timestamp (UTC):', RUN_TS)

In [None]:
# --- Small IO helpers (self-contained) --------------------------------------
def ensure_dir(p: os.PathLike) -> pathlib.Path:
    p = pathlib.Path(p)
    p.mkdir(parents=True, exist_ok=True)
    return p

def write_json(path: os.PathLike, obj: Dict[str, Any]):
    path = pathlib.Path(path)
    ensure_dir(path.parent)
    with path.open('w', encoding='utf-8') as f:
        json.dump(obj, f, indent=2, ensure_ascii=False)
    return path

def save_png(path: os.PathLike, img_arr):
    path = pathlib.Path(path)
    ensure_dir(path.parent)
    if cv2 is None:
        raise RuntimeError('OpenCV not available: cannot save PNG in this minimal demo.')
    # If RGB in range 0..255, write directly
    if img_arr.ndim == 3 and img_arr.shape[2] == 3:
        # OpenCV expects BGR; convert if input is RGB
        bgr = img_arr[..., ::-1]
        cv2.imwrite(str(path), bgr)
    else:
        cv2.imwrite(str(path), img_arr)
    return path

def limited_tree(root: os.PathLike, max_files: int = 64):
    root = pathlib.Path(root)
    out = []
    count = 0
    for dirpath, dirnames, filenames in os.walk(root):
        dp = pathlib.Path(dirpath)
        for fn in sorted(filenames):
            out.append(str((dp / fn).relative_to(root)))
            count += 1
            if count >= max_files:
                out.append('… (truncated)')
                return out
    return out

In [None]:
# --- Config (edit me) -------------------------------------------------------
CONFIG = {
    "run": {
        "seed": SEED,
        "timestamp_utc": RUN_TS
    },
    "aoi": {
        # (min_lat, min_lon, max_lat, max_lon) — tiny box for demo
        "bbox": (-3.500, -60.500, -3.450, -60.450)
    },
    "tiling": {
        "tile_size_deg": 0.05,   # ~5km in latitude; demo will produce at most one tile here
        "img_size": 256          # demo raster size per tile (pixels)
    },
    "detect": {
        "canny_low": 50,
        "canny_high": 150,
        "min_edge_score": 2000   # threshold to flag a candidate
    },
    "paths": {
        "data_raw": str(DATA_ROOT / 'raw'),
        "data_processed": str(DATA_ROOT / 'processed'),
        "candidates": str(ART_ROOT / 'candidates'),
        "dossiers": str(ART_ROOT / 'dossiers')
    }
}

print(json.dumps(CONFIG, indent=2))

# Ensure directories
ensure_dir(CONFIG['paths']['data_raw'])
ensure_dir(CONFIG['paths']['data_processed'])
ensure_dir(CONFIG['paths']['candidates'])
ensure_dir(CONFIG['paths']['dossiers'])
print('Verified minimal directory structure.')

## 1) Data (demo) → Synthetic AOI Tile
For a minimal, offline-safe run, we generate a single RGB tile with an injected circular anomaly. Replace this cell with your real data fetchers (Sentinel/Landsat/SRTM/NICFI).

In [None]:
img_size = CONFIG['tiling']['img_size']

# Synthetic RGB background
rgb = np.zeros((img_size, img_size, 3), dtype=np.uint8)
rgb[..., 1] = 35  # a faint greenish baseline to mimic vegetation tint

# Inject a bright circular anomaly at the center (white disk)
cx = cy = img_size // 2
rr = img_size // 8
yy, xx = np.ogrid[:img_size, :img_size]
mask = (xx - cx)**2 + (yy - cy)**2 <= rr**2
rgb[mask] = [255, 255, 255]

# Save demo tile
tile_png = pathlib.Path(CONFIG['paths']['data_processed']) / 'tile_000_rgb.png'
save_png(tile_png, rgb)

# Create a minimal fetch manifest
fetch_manifest = {
    "created": RUN_TS,
    "seed": SEED,
    "aoi_bbox": CONFIG['aoi']['bbox'],
    "tiles": [{"id": 0, "path": str(tile_png), "shape": list(rgb.shape)}]
}
write_json(pathlib.Path(CONFIG['paths']['data_raw']) / 'fetch_manifest.json', fetch_manifest)
print('Wrote synthetic tile and fetch_manifest.json')

## 2) Preprocess → (placeholder)
In a real run, you would reproject, resample, normalize, and stack multi-sensor layers (optical / SAR / DEM). This demo just writes a placeholder preprocess manifest.

In [None]:
preprocess_manifest = {
    "created": RUN_TS,
    "layers": [
        {"name": "optical_rgb", "path": str(tile_png), "normalized": False}
    ]
}
write_json(pathlib.Path(CONFIG['paths']['data_processed']) / 'preprocess_manifest.json', preprocess_manifest)
print('Wrote preprocess_manifest.json')

## 3) Detect → Canny Edges + Score
We use a basic edge-density score as a stand-in for a fast coarse anomaly detector. If the score exceeds a threshold, we produce a candidate with a preview PNG and JSON entry.

> Replace with your CV/VLM/DEM/VL model inference (e.g., CLIP caption similarity, hillshade features, RF/UNet, etc.).

In [None]:
if cv2 is None:
    raise RuntimeError('OpenCV not available; cannot run demo detector.')
if plt is None:
    raise RuntimeError('matplotlib not available; cannot render previews.')

# Load back the synthetic tile
img_bgr = cv2.imread(str(tile_png))
if img_bgr is None:
    raise RuntimeError(f'Failed to read {tile_png}')
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)

# Canny
low, high = CONFIG['detect']['canny_low'], CONFIG['detect']['canny_high']
edges = cv2.Canny(img_gray, low, high)
edge_score = int(edges.sum() // 255)  # count of edge pixels
print('Edge score:', edge_score)

# Candidate decision
cand_list = []
if edge_score >= CONFIG['detect']['min_edge_score']:
    cand_id = 0
    cand_png = pathlib.Path(CONFIG['paths']['candidates']) / f'candidate_{cand_id:03d}.png'
    cand_json = pathlib.Path(CONFIG['paths']['candidates']) / f'candidate_{cand_id:03d}.json'

    # Make a side-by-side preview
    fig, ax = plt.subplots(1, 2, figsize=(8, 4), dpi=120)
    ax[0].set_title('optical (demo)')
    ax[0].imshow(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
    ax[0].axis('off')
    ax[1].set_title(f'edges (score={edge_score})')
    ax[1].imshow(edges, cmap='gray')
    ax[1].axis('off')
    fig.tight_layout()
    fig.canvas.draw()
    ensure_dir(cand_png.parent)
    fig.savefig(str(cand_png), bbox_inches='tight')
    plt.close(fig)

    # Save candidate JSON
    cand_rec = {
        "id": cand_id,
        "score": edge_score,
        "threshold": CONFIG['detect']['min_edge_score'],
        "preview": str(cand_png),
        "source_tile": str(tile_png),
        "created": RUN_TS
    }
    write_json(cand_json, cand_rec)
    cand_list.append(cand_rec)
else:
    print('No candidates: score below threshold.')

# Write global candidates list
candidates_manifest = {
    "created": RUN_TS,
    "seed": SEED,
    "detector": "canny_edge_density",
    "params": {"low": low, "high": high, "min_edge_score": CONFIG['detect']['min_edge_score']},
    "candidates": cand_list
}
write_json(pathlib.Path(CONFIG['paths']['candidates']) / 'candidates.json', candidates_manifest)
print(f'Wrote candidates.json with {len(cand_list)} entr(ies).')

## 4) Verify (demo) → Simple Gate
For the minimal demo we treat the detector’s threshold as the entire verification step. In a real system, this stage should combine ≥2 modalities (e.g., DEM microrelief + vegetation time-series, or SAR + historical text) and produce calibrated uncertainty.

We still emit a verification manifest to wire the stage, making it easy to swap in your real logic later.

In [None]:
verified = []
for c in cand_list:
    c2 = dict(c)
    c2["verified"] = True
    c2["uncertainty_demo"] = 0.25  # placeholder
    verified.append(c2)

verify_manifest = {
    "created": RUN_TS,
    "rule": "score>=min_edge_score",
    "verified": verified
}
write_json(pathlib.Path(CONFIG['paths']['candidates']) / 'verify_manifest.json', verify_manifest)
print('Wrote verify_manifest.json')

## 5) Dossier → per-candidate bundle
Create a simple dossier (.md + .png) for each verified candidate. In the real pipeline this includes maps, overlays, time-series plots, uncertainty panels, and an ethics note/sovereignty banner where applicable.

In [None]:
dossiers = []
for c in verified:
    did = c['id']
    md_path = pathlib.Path(CONFIG['paths']['dossiers']) / f'dossier_{did:03d}.md'
    png_src = pathlib.Path(c['preview'])
    record = {
        "id": did,
        "figure": str(png_src),
        "summary": {
            "score": c['score'],
            "uncertainty_demo": c['uncertainty_demo'],
            "source_tile": c['source_tile']
        }
    }
    md = []
    md.append(f"# Dossier — Candidate {did:03d}\n")
    md.append(f"- Created: {RUN_TS}\n")
    md.append(f"- Source tile: {c['source_tile']}\n")
    md.append(f"- Detector score: {c['score']} (min={CONFIG['detect']['min_edge_score']})\n")
    md.append(f"- Uncertainty (demo): {c['uncertainty_demo']}\n")
    md.append(f"\n![]({png_src})\n")
    md.append("\n> NOTE: This is a minimal scaffold dossier. Replace with real maps, DEM hillshades, SAR overlays, NDVI time-series, and sovereignty notices as appropriate.\n")
    ensure_dir(md_path.parent)
    md_path.write_text("".join(md), encoding='utf-8')
    dossiers.append({"id": did, "path": str(md_path)})

dossiers_manifest = {
    "created": RUN_TS,
    "count": len(dossiers),
    "items": dossiers
}
write_json(pathlib.Path(CONFIG['paths']['dossiers']) / 'dossiers_manifest.json', dossiers_manifest)
print(f"Wrote dossiers_manifest.json with {len(dossiers)} entr(ies).")

## 6) Final Recap → Manifests & File Tree
We summarize run metadata and print a limited file tree so you can quickly confirm artifacts exist. Use the Kaggle sidebar or the output browser to download your results.

In [None]:
final_manifest = {
    "run": CONFIG['run'],
    "aoi": CONFIG['aoi'],
    "tiling": CONFIG['tiling'],
    "detect": CONFIG['detect'],
    "paths": CONFIG['paths'],
    "artifacts": {
        "candidates_json": str(pathlib.Path(CONFIG['paths']['candidates']) / 'candidates.json'),
        "verify_manifest": str(pathlib.Path(CONFIG['paths']['candidates']) / 'verify_manifest.json'),
        "dossiers_manifest": str(pathlib.Path(CONFIG['paths']['dossiers']) / 'dossiers_manifest.json')
    }
}
write_json(ART_ROOT / 'run_manifest.json', final_manifest)
print('Saved artifacts/run_manifest.json')

print('\n# artifacts/')
for p in limited_tree(ART_ROOT, max_files=200):
    print(' -', p)

print('\n# data/')
for p in limited_tree(DATA_ROOT, max_files=200):
    print(' -', p)

print('\nDone.')

## Next Steps (Swap in your real pipeline)
1. Data Fetchers → Replace the synthetic tile with real data loaders (Sentinel/Landsat/SRTM/NICFI). Keep small AOIs for Kaggle limits.
2. Preprocess → Tile/reproject/normalize stacks; compute DEM hillshades & local relief; clip to AOI.
3. Detect → Add your CV/VLM/DEM detection: edges+Hough, CLIP similarity, UNet segmentation, random forest on environmental layers, etc.
4. Verify → Enforce multi-proof rules (e.g., DEM microrelief + vegetation seasonal anomaly), causal plausibility graph, and calibrated uncertainty.
5. Dossier → Render maps/overlays/plots (NDVI time-series, SAR backscatter, DEM), add sovereignty notes (CARE) when applicable.
6. Configs → Keep all knobs in configs/*.yaml; pin seeds and dataset versions for reproducibility; wire to your CLI.
7. CI → If running in your repo via Actions, run lint/tests/repro checks on each PR; for Kaggle, prefer pinned images + Save & Run All.