Binder-compatible docs / example notebooks by espg · Pull Request #105 · englacial/zagg

espg · 2026-06-25T19:00:45Z

Closes #104

What this does

Makes the example notebooks Binder-launchable and refreshes them against the current API, following the decisions in this issue comment:

Binder env via conda/mamba — a .binder/ repo2docker config (conda environment.yml + postBuild) so Binder builds the geospatial stack cleanly.
[docs] stays mkdocs-only; the [analysis] + [catalog] extras are the notebook runtime — Binder installs zagg[analysis,catalog], no new dependency group (both extras already exist in pyproject.toml).
Three Binder-runnable notebooks (custom_aggregations, rasterized_zarr, jupyterhub_example) plus the cryocloud notebook refreshed to the current API (cryo stays the only Lambda demo and is excluded from Binder).
Lambda-invoke removed from jupyterhub_example (was "section 6"); Lambda lives only in the cryo notebook.
All public-data references point at source.coop — s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/ (anonymous read verified this run, see "How tested").

Approach / data rule

Per CLAUDE.md §4, every Binder-runnable cell must read data that is either anonymously reachable on the web or in the git tree. Two anonymous spines:

Read the prebuilt public store at englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr (HEALPix order 12, consolidated; variables count/h_mean/h_sigma/h_min/h_max/h_variance/h_q25..75/cell_ids/morton, matching src/zagg/configs/atl06.yaml). custom_aggregations uses only synthetic in-notebook data, so it is anonymous by construction.
Build a catalog from CMR-STAC — this is a pure metadata query and is anonymous (no Earthdata login needed to find granules; only fetching their pixels via agg() needs auth). The default HEALPix mortie coverage backend means the non-PyPI spherely fork is not required either. jupyterhub_example section 3 now builds a small ATL06 catalog this way (guarded with try/except so it degrades cleanly on egress-restricted Binders); only the agg() data-fetch cells stay credential-gated.

Files added / changed (and why)

New .binder/environment.yml — conda-forge Python 3.12 + (trimmed) system geo libs; the Python layer is installed in postBuild so the dep set lives once in pyproject.toml.
New .binder/postBuild — pip install ".[analysis,catalog]" with a hatch-vcs pretend-version guard for tag-less repo2docker checkouts; deliberately does NOT pull the spherely fork or icechunk/hvplot (cryo-only). [catalog] adds stac-geoparquet, which the anonymous catalog-build cell imports.
notebooks/custom_aggregations.ipynb, rasterized_zarr.ipynb, jupyterhub_example.ipynb — refreshed to source.coop data, Binder badge/intro cells, anonymous catalog-build cell, only the data-fetch (agg) cells gated, Lambda section removed from jupyterhub, outputs cleared.
notebooks/cryocloud_example.ipynb — refreshed to the current API (real python -m zagg.catalog CLI, source.coop output store, dropped the git+...xagg.git install), catalog-build comment corrected to note it is anonymous, kept as the sole Lambda demo and marked explicitly not-Binder.
pyproject.toml — unchanged: both [analysis] and [catalog] already exist; the Binder install just uses both.
README.md — Binder badge + a notebook table with per-notebook launch links and the explicit cryo "not Binder-runnable" row.

Phases

Phase 1 — Binder infra. .binder/ conda env + postBuild; PyPI/conda-forge-resolvable only, no git+/fork wheels.
Phase 2 — refresh the three Binder notebooks. Repoint data to the source.coop atl06_cycle22_fullsphere.zarr; fix stale xagg/bucket/ARN naming; remove the Lambda section from jupyterhub_example; gate EDL-only cells; clear outputs; add a Binder intro/badge cell.
Phase 3 — refresh cryocloud_example + extras reconcile. Update cryo to the current API, keep it the sole Lambda demo, mark it explicitly not-Binder.
Phase 4 — launch links. Binder badge in README.md; per-notebook Binder note; explicit "not Binder-able" note in cryo.
Phase 5 — address review feedback. Make the catalog-build step a runnable anonymous cell (catalog building is anonymous; only data fetch needs creds); install [catalog] on Binder; fix parent_order missing from the inline custom configs in custom_aggregations.

How tested

ruff check src tests — the only finding is a pre-existing N818 in src/zagg/registry.py (untouched by this branch; CI lint runs --select=E,F,W,I so it doesn't gate). No notebook/Binder change introduces a lint.
All four notebooks validate as nbformat v4 and ship with outputs cleared.
custom_aggregations.ipynb executed end-to-end clean via jupyter nbconvert --execute (synthetic data, fully offline) — this is what caught the missing-parent_order bug, now fixed.
source.coop anonymous read verified this run: opened atl06_cycle22_fullsphere.zarr group 12 over an unsigned S3 request (consolidated metadata fetched in <1s; chunk reads + decode confirmed). The whitelist for source.coop egress is in effect.
Catalog build is anonymous: confirmed CMRSource().fetch issues an unauthenticated requests.get to cmr.earthdata.nasa.gov/stac (no auth header) and ShardMap.build(..., backend="mortie") needs no spherely fork. End-to-end execution of the CMR query could not be run here because cmr.earthdata.nasa.gov is not on this environment's egress allowlist (only source.coop is) — the build cell is guarded to degrade cleanly in that case.

Questions for review

jupyterhub_example catalog spine — UPDATED. Per your correction, the catalog-build step is now a runnable anonymous cell (CMR-STAC metadata query, backend="mortie", no spherely fork, no Earthdata login). Only the agg() data-fetch cells stay credential-gated. The cell is wrapped in try/except so it degrades cleanly where outbound access to CMR is blocked. OK as-is?
Repo2docker build: the first real repo2docker . build + anonymous-read run is still best validated by you in a real Binder. I can add a CI smoke-build of the Binder image if you want it, but that would touch .github/workflows/, which I won't do without you naming it.
[analysis]/[catalog] — RESOLVED. Both are existing extras; Binder now installs zagg[analysis,catalog] (catalog adds stac-geoparquet, which the anonymous catalog-build cell needs). No new dependency, pyproject.toml untouched.

Generated by Claude Code

espg

🤖 from Claude (review)

Phase 1 adversarial review of the .binder/ repo2docker config. Scope checks against the issue #104 decisions pass: conda/mamba env, [docs] left mkdocs-only, [analysis] extended (not a new group), and no spherely fork or git+ URL on the binder path. No scope creep in the diff.

Two potential build blockers to clear before this phase is green (both inline):
(1) hatch-vcs derives zagg's version from git in pip install ".[analysis]" — confirm the repo2docker checkout exposes a reachable tag, else the build either errors or ships a dev version that diverges from pip install zagg[analysis].
(2) postBuild must be committed with the executable bit (100755), or repo2docker skips it and the image ships without zagg/notebook deps.

Plus three quality findings (inline): the conda geos/proj/libgdal-core layer is probably redundant (pyproj/shapely use vendored wheels; nothing imports GDAL) and the "GEOTIFF" comment doesn't match the libgdal-core dep; consider nodefaults for a reproducible conda-forge-only solve; and the python=3.12-only pin (vs 3.12/3.13) is fine but should be a conscious Lambda-parity choice. The biggest single ask: paste a real repo2docker/jupyter-repo2docker . build log for this branch — most of the above resolves to yes/no from one log. Comment-only; not blocking the notebook phases.

Generated by Claude Code

espg · 2026-06-25T19:02:02Z

+#     those, and it is excluded from Binder because it dispatches to AWS Lambda).
+set -euo pipefail
+
+python -m pip install --no-cache-dir ".[analysis]"


🤖 from Claude (review)

Build blocker: hatch-vcs needs a reachable git tag, and repo2docker's checkout may not provide one. pip install ".[analysis]" builds zagg through hatchling + hatch-vcs (pyproject.toml:1-3, 87-91), which derives the version from git. Two failure modes on the Binder path: (1) if .git is present but no tag is reachable, hatch-vcs emits a 0.1.dev…+g<sha> version — non-fatal but it diverges from a tagged pip install zagg[analysis]; (2) if the build env lacks the tag and setuptools_scm/hatch-vcs can't find git metadata at all, the build errors out (LookupError: setuptools-scm was unable to detect version). repo2docker checks out the repo with .git, but tag availability depends on how Binder cloned it. Two concrete options: (a) add a SETUPTOOLS_SCM_PRETEND_VERSION / HATCH_VCS_PRETEND_VERSION export in this script as a guard, or (b) confirm the Binder clone includes tags and document it. Please verify an actual repo2docker build of this branch reaches green before clearing the phase.

Generated by Claude Code

espg · 2026-06-25T19:02:05Z

@@ -0,0 +1,16 @@
+#!/bin/bash


🤖 from Claude (review)

Executable bit: repo2docker requires postBuild to be executable — it runs the file directly and will skip it (or error) if the mode isn't 0755. The diff can't show the file mode, so please confirm the committed blob has the +x bit set (git ls-files -s .binder/postBuild should show 100755). If it's 100644 the whole pip step silently never runs and the image ships without zagg or the notebook deps.

Generated by Claude Code

espg · 2026-06-25T19:02:19Z

+  - geos
+  - proj
+  - libgdal-core


🤖 from Claude (review)

Conda system layer is likely both redundant and slightly mis-targeted. Reasoning:

(1) pyproj and shapely ship self-contained manylinux wheels that vendor their own PROJ/GEOS — pip installs of those do not link the conda geos/proj here. So for pyproj/shapely these conda packages are inert.

(2) The one dep that genuinely benefits is cartopy>=0.25.0 (analysis extra, pyproject.toml:70): older cartopy built from sdist and needed GEOS/PROJ headers, but recent cartopy publishes wheels too. If the wheel is used, geos/proj are again unused; if a source build is forced, you'd need the dev headers (the conda geos/proj runtime packages do provide headers, so that's fine) — but you'd also need a compiler, which isn't requested here.

(3) libgdal-core (line 27): nothing in the [analysis] set imports GDAL — no rasterio/fiona/gdal. geopandas uses pyogrio/shapely wheels, not GDAL via conda. This line looks dead. The header comment (lines 4-5, 23-24) also says "GEOTIFF" but you install libgdal-core, not libgeotiff; the comment and the dep don't match.

Recommendation: either (a) drop the conda geo layer entirely and rely on PyPI wheels (simplest, matches a local pip install zagg[analysis] most closely), or (b) if you keep it because a real repo2docker build shows cartopy compiling from source, trim to exactly what that build needs and fix the comment. Ground the choice in an actual build log rather than assumption.

Generated by Claude Code

espg · 2026-06-25T19:02:22Z

+channels:
+  - conda-forge


🤖 from Claude (review)

Channel pinning: only conda-forge is listed but nodefaults is not. Depending on the solver config baked into the repo2docker base image, mamba can still append the defaults channel, which mixes ABIs and occasionally breaks the solve. For a reproducible conda-forge-only solve add an explicit nodefaults:

channels: - conda-forge - nodefaults

Minor/defensive, but cheap insurance for "solves the same on every Binder rebuild."

Generated by Claude Code

espg · 2026-06-25T19:02:31Z

+  - conda-forge
+dependencies:
+  # zagg requires Python >=3.12,<3.14; the AWS Lambda runtime is 3.12.
+  - python=3.12


🤖 from Claude (review)

python=3.12 is consistent with requires-python = ">=3.12,<3.14" (pyproject.toml:13) and matches the Lambda runtime — good. Just noting it pins to 3.12 only (not 3.13); that's a defensible choice given Lambda parity, so no change requested, but flagging it explicitly so the divergence from "any of 3.12/3.13" is a conscious decision and not an accident.

Generated by Claude Code

espg · 2026-06-25T19:02:38Z

+# Only PyPI/conda-forge resolvable packages appear here: no ``git+`` URLs and no
+# spherely fork. The three Binder-runnable notebooks
+# (custom_aggregations, rasterized_zarr, jupyterhub_example) read the public,
+# anonymous source.coop store or synthetic data, so the exact-S2 spherely
+# SpatialIndex backend is never on their import path.


🤖 from Claude (review)

Confirmed clean against the issue #104 decisions and the §4 data rule for the parts visible in this phase: no git+ URL and no spherely fork anywhere on the binder path (environment.yml and postBuild:9-13 both explicitly exclude it); [docs] is untouched and stays mkdocs-only (pyproject.toml:128-133, a dependency-group not an extra); [analysis] is reused rather than a new group being added. Two caveats this diff can't yet verify and that later phases must close: (1) the claim here that the three notebooks read "the public, anonymous source.coop store or synthetic data" is unverifiable until the notebooks land — hold the data-reachability check to that phase; (2) [analysis] transitively pulls cartopy/geopandas/xdggs etc. but not spherely, so the "read a prebuilt public store" path must genuinely avoid any SpatialIndex/catalog code that imports spherely — verify in the notebook phase.

Generated by Claude Code

espg

🤖 from Claude (review)

Adversarial review of Phase 2 (the three binder notebooks). Inline JSON-line anchoring on .ipynb is unreliable, so findings are in this body, labeled by notebook + cell index. Verified against current src/zagg/ on the branch and the mortie._healpix source.

Good / confirmed:

Lambda fan-out is fully removed from jupyterhub_example — no backend="lambda", no process-morton-cell ARN, no morton_aggregation_sub store, no xagg leftovers. The HAVE_CATALOG guards are sound and the public source.coop section is self-contained.
All data references are anonymous source.coop (us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr, skip_signature=True). No git+...xagg.git, no spherely-fork on a binder path.
Read schema matches the store: group str(child_order)/"12", consolidated=True, zarr_format=3; vars referenced (count/h_mean/h_sigma/h_min/h_max/h_variance/h_q25/h_q50/h_q75/cell_ids) all exist in the store.
Catalog CLI is current: --config / --short-name / --cycle / --polygon, output shardmap_*.json (matches catalog/__init__.py:262). No stale --parent-order / catalog_*.json.
ShardMap keys read correctly: metadata.total_shards / metadata.total_pairs (catalog/shardmap.py:470-471).
agg(...) kwargs (catalog, store, driver, dry_run, max_cells, max_workers, overwrite) and result keys (cells_with_data, total_obs, wall_time_s) all match runner.py:71-140. default_config, get_child_order, load_config, resolve_function, evaluate_expression, load_config_from_dict, validate_config, get_agg_fields, xdggs_spec(parent_order=, child_order=, config=) all exist with the signatures used.
Outputs cleared, execution_count: null, kernelspec python3 across all three. Binder badges present in every notebook.

BLOCKER (1) — pix2ang called with nside instead of depth; breaks the headline viz in two notebooks.

mortie._healpix.pix2ang(depth, pixel) takes depth = HEALPix order (here 12), not nside — confirmed in mortie/_healpix.py:36-47 and its module docstring: "The Rust layer is depth-native (it indexes by HEALPix order/depth, not nside) … take depth directly." nside = 2**depth.

Both notebooks pass 2**CHILD_ORDER (= 4096) as the first arg:

jupyterhub_example.ipynb cell 14: lon, lat = pix2ang(2 ** CHILD_ORDER, cell_ids)
rasterized_zarr.ipynb cell 6: lon, lat = pix2ang(NSIDE, cell_ids.astype(np.int64)) where NSIDE = 2 ** CHILD_ORDER (cell 3)

Passing 4096 where order is expected will error or yield wrong coordinates — this is the cell that produces the actual map in two of the three binder-runnable notebooks. Fix: pass the order, pix2ang(CHILD_ORDER, cell_ids) (i.e. pix2ang(12, ...)). Drop the NSIDE-as-arg usage in rasterized_zarr (keep NSIDE only for the display string in cell 13's attrs).

BLOCKER (2) — calculate_cell_statistics is passed a pd.DataFrame; it expects a dict of numpy arrays. custom_aggregations will crash on Binder.

calculate_cell_statistics(cell_data, ...) treats cell_data as a dict eval-namespace (processing/aggregate.py:290-344): it calls cell_data.values() at line 344 and unpacks **cell_data into the expression namespace. Every test passes a dict (tests/test_processing.py:98-1368); none pass a DataFrame.

custom_aggregations.ipynb builds pd.DataFrame(...) and passes it directly:

cell 14: stats = calculate_cell_statistics(df)
cell 18: calculate_cell_statistics(df, config=cfg_simple)
cell 25: calculate_cell_statistics(df_with_slope, config=cfg_extended)

On a DataFrame, df.values is an ndarray attribute, so df.values() at line 344 raises TypeError: 'numpy.ndarray' object is not callable — the notebook can't run past Part 1. Fix: pass dicts of arrays, e.g. calculate_cell_statistics({"h_li": h_li_arr, "s_li": s_li_arr}), matching the test contract. (The df/pd import in cell 14 becomes unnecessary once converted.)

Minor:

(3) custom_aggregations.ipynb cell 29 and rasterized_zarr.ipynb cell 22 are empty trailing code cells — harmless, but worth deleting for a clean render.

(4) custom_aggregations.ipynb cell 2: print(f"Sections: data_source, aggregation, output") is an f-string with no interpolation (would trip ruff F541 if it were linted source; in a notebook it's cosmetic). Drop the f.

(5) Markdown vs agg() default: jupyterhub_example cell 7 says driver="s3" is faster but only from us-west-2 and the agg() docstring says driver defaults to "s3" — consistent. No action; noting I checked.

Net: two functional blockers (1) and (2) each break a binder-runnable notebook's main output. (3)/(4) are cleanup. Recommend fixing (1) and (2) before clearing draft.

Generated by Claude Code

espg

🤖 from Claude (review)

Fresh-context adversarial review of phase 2 (commit d564bbd) — the three Binder notebooks. Static inspection only: this environment has no geo stack (cartopy/pyproj/obstore) and cannot reach source.coop (egress is blocked, 403 — not retried), so I could not execute the notebooks. Where I could isolate pure-Python logic I ran it directly; those cases are noted. Inline anchoring on the .ipynb JSON didn't land in the diff, so findings are consolidated here with file + cell index + snippet.

I verified API symbols against src/zagg/config.py, src/zagg/processing/aggregate.py, src/zagg/schema.py, src/zagg/configs/atl06.yaml, and mortie 0.8.1 (the pinned version, mortie>=0.8.1) at mortie/_healpix.py.

Blocker 1 — `calculate_cell_statistics(df)` is called with a DataFrame but expects a dict; it raises before computing

custom_aggregations.ipynb, cells 14, 18, 25:

stats = calculate_cell_statistics(df)
stats_simple = calculate_cell_statistics(df, config=cfg_simple)
stats_ext = calculate_cell_statistics(df_with_slope, config=cfg_extended)

The signature is calculate_cell_statistics(cell_data: dict[str, Any], ...) and the body does
n_obs = next((len(v) for v in cell_data.values() if np.ndim(v) != 0), 0) (src/zagg/processing/aggregate.py:344).
On a pandas DataFrame, .values is the ndarray property, so cell_data.values() is a call on an ndarray and raises TypeError: 'numpy.ndarray' object is not callable. I reproduced this in isolation with pandas — it fails on the very first line of the function. These are the only Binder-runnable stats cells, so the cell-0 promise "Runs end-to-end on Binder ... no cloud data, no credentials" does not hold as written.
Fix: pass a column→array dict, e.g. calculate_cell_statistics({c: df[c].values for c in df}), or build the synthetic inputs as dicts directly (the markdown in cell 13/contract even says "dict ... keys are column names; values are numpy arrays"). Add a quick local run of these cells before clearing outputs to catch this.

Blocker 2 — `pix2ang` is passed `nside` where it expects `depth` (HEALPix order)

mortie._healpix.pix2ang(depth, pixel) — "depth is the HEALPix order (nside = 2**depth)" (mortie/_healpix.py:36-39, mortie 0.8.1). The notebooks pass nside (4096), not the order (12):

rasterized_zarr.ipynb cell 6: lon, lat = pix2ang(NSIDE, cell_ids.astype(np.int64)), with NSIDE = 2 ** CHILD_ORDER # 4096 (cell 3).
jupyterhub_example.ipynb cell 14: lon, lat = pix2ang(2 ** CHILD_ORDER, cell_ids).

Both should pass CHILD_ORDER (= 12), i.e. pix2ang(CHILD_ORDER, cell_ids). As written, depth=4096 is far outside the valid HEALPix order range; the Rust healpix crate will error or return garbage coordinates, so the polar-stereo rasterization and the scatter map would be wrong (or raise) on the data cell that actually runs on Binder. I could not execute mortie here to capture the exact failure mode, so please confirm — but the argument is clearly order, not nside. (Flagging, not blocking on the mortie-symbol-existence rule: pix2ang exists and takes depth.)

Items checked and clean

Outputs cleared: all three notebooks — every code cell has execution_count: null and empty outputs. Confirmed via nbformat.
Binder badge + intro cell: present in all three (cell 0 each, mybinder.org/badge_logo.svg + a "what runs where" blurb).
Lambda removed from jupyterhub: no agg(backend="lambda") / ARN invoke cell. The only "lambda" mentions are markdown — cell 0 links to cryocloud_example.ipynb for fan-out, cell 1's credentials table notes AWS is optional "for ... invoking Lambda", cell 7 says "no Lambda needed". All allowed.
Earthdata/credentialed gating: jupyterhub_example.ipynb gates the catalog/processing cells behind HAVE_CATALOG = os.path.exists(catalog_path) (cells 6, 8, 9, 11), each printing a clean "Skipped" message; section headers are marked (requires NASA Earthdata login). The public section (cells 12–14) uses skip_signature=True anonymous reads. Good.
Data source: the anonymous store us-west-2.opendata.source.coop / englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr at order 12 is used in both data-reading notebooks; custom_aggregations uses only in-notebook synthetic arrays. No stale buckets, xagg, or ARNs found.
Store variable names match atl06.yaml exactly: cell_ids, morton, count, h_min, h_max, h_mean, h_sigma, h_variance, h_q25, h_q50, h_q75 (rasterized_zarr cell 4; jupyterhub cell 13).
API symbols exist: default_config, load_config, load_config_from_dict, validate_config, resolve_function, evaluate_expression, get_agg_fields, get_child_order (config.py); calculate_cell_statistics (processing/aggregate.py — modulo Blocker 1's input type); xdggs_spec(parent_order, child_order, config=None) (schema.py) — all call sites use 2 positional + config= kw, matches. agg(config, *, catalog=, store=, driver=, dry_run=, max_cells=, max_workers=, overwrite=) — jupyterhub passes config positional + the rest as kw, matches the keyword-only signature. config.catalog / config.output["store"] (jupyterhub cell 16) are valid PipelineConfig fields.
Intra-notebook links all resolve: cryocloud_example.ipynb and custom_aggregations.ipynb both exist in notebooks/.

Minor (non-blocking)

custom_aggregations.ipynb cell 2: print(f"Sections: data_source, aggregation, output") is an f-string with no placeholders — harmless, but ruff/lint may flag F541. Worth dropping the f.

Net: two correctness blockers (both in cells the PR advertises as Binder-runnable), one trivial lint nit; everything else against espg's #104 rulings checks out.

Generated by Claude Code

…dict input

espg

🤖 from Claude (review)

Fresh-context adversarial review of commits since d564bbd: 45f0a5c (pix2ang/cell-statistics fold), b74ce27 (cryocloud refresh), 6fcc2cb (README badges). Static inspection only — no geo stack installed, source.coop egress is blocked (403), and mortie is not importable, so pix2ang semantics were verified against mortie/_healpix.py source rather than by running. Notebooks were not executed.

pix2ang fix — CORRECT. Ground truth: mortie._healpix.pix2ang(depth, pixel) is depth-native, docstring "depth is the HEALPix order (nside = 2**depth)". All three call sites now pass the order (CHILD_ORDER = 12), not NSIDE and not 2**CHILD_ORDER:

rasterized_zarr.ipynb cell 6: pix2ang(CHILD_ORDER, cell_ids.astype(np.int64))
jupyterhub_example.ipynb cell 14: pix2ang(CHILD_ORDER, cell_ids)
cryocloud_example.ipynb cell 21: pix2ang(CHILD_ORDER, cell_ids.astype(np.int64))

NSIDE = 2 ** CHILD_ORDER survives only as a derived label in attrs/comments, never passed to pix2ang. Clean.

calculate_cell_statistics fix — CORRECT. Source src/zagg/processing/aggregate.py:290 takes cell_data: dict[str, Any]. All three custom_aggregations.ipynb call sites (cells 14, 18, 25) now pass dicts (cell, cell, cell_with_slope). import pandas as pd was removed and no live pd./df reference remains — the only surviving DataFrame token is in an explanatory comment ("NOT a DataFrame"), which is fine.

cryo API correctness — CORRECT. Verified against current source:

Catalog CLI (src/zagg/catalog/__init__.py main()): cell 14 uses --config (required, defines the grid), --short-name (required), --cycle, --polygon, --output — all real flags; no --parent-order flag is passed (parent order lives in the config). Matches the source.
agg(...) kwargs (cells 16, 18) all exist in src/zagg/runner.py def agg() (catalog, store, backend, driver, max_cells, max_workers, overwrite). Result keys used (cells_with_data, total_obs, wall_time_s, lambda_time_s, estimated_cost_usd) all exist in the local/lambda return dicts.
default_config, load_config_from_dict, get_agg_fields all exist in src/zagg/config.py; default_config('atl06') resolves (src/zagg/configs/atl06.yaml is present).
--output shardmap_atl06_cycle22.json (cell 14) matches catalog='shardmap_atl06_cycle22.json' in both cells 16 and 18. Consistent.

Not Binder-runnable marking — CORRECT. cryocloud_example carries a clear "Not Binder-runnable" callout in cell 0 and has no Binder badge/link. README table row says "not Binder-runnable".

README badges — CORRECT. All three Binder links are https://mybinder.org/v2/gh/englacial/zagg/main?urlpath=lab/tree/notebooks/<nb>.ipynb; cryo row marked not-runnable.

Data references / stale naming — CLEAN. All stores point to s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr. No xagg, no git+...xagg.git, no old-bucket naming. The 429435741471 / 574251165169 account IDs and the process-morton-cell ARN are intentional cross-account context in the (non-Binder) cryo notebook, not stale references.

[analysis] extra reconcile — CLEAN, no missing dep. Every import in the three Binder notebooks is covered: numpy/zarr/obstore/pyproj/shapely/mortie (core), yaml←pyyaml (core), xarray←xarray[io], cartopy/matplotlib (the [analysis] extra); importlib/json/datetime/warnings/logging/os are stdlib. pyproject.toml correctly was not modified — no new dependency is needed.

Outputs cleared — CORRECT. All four notebooks: 0 outputs, 0 non-null execution counts.

Two non-blocking notes for author judgment (neither gates the phase):

cryocloud_example cell 14 references --polygon antarctica.geojson, but no antarctica.geojson exists in the repo tree. Since this notebook is explicitly not Binder-runnable and already needs live AWS + Earthdata, this is cosmetic — but a reader copy-pasting the cell will hit a missing-file error. Consider a one-line note that the user supplies the AOI, or use --bbox.
Same cell uses --config src/zagg/configs/atl06.yaml (a repo-root-relative path) while the notebook lives in notebooks/. Works only if cwd is the repo root. Worth a note, given (1) above.

Net: the two fixes folded from the prior review are correct and complete, the cryo refresh matches the current CLI/runner/config API, and the README wiring is right. Nothing blocking.

Generated by Claude Code

espg · 2026-06-25T19:28:58Z

🤖 from Claude

All four phases landed (cryo refresh + [analysis] reconcile + README launch links pushed as b74ce27/6fcc2cb, plus self-review fixes 45f0a5c/9f9ebde); CI green on 3.12/3.13 + ruff. Folded two correctness bugs the Phase-2 self-review caught — calculate_cell_statistics now gets a column→array dict (not a DataFrame) and pix2ang is called with the HEALPix order, not nside. No new dependency added: [analysis] already covers every Binder import, so pyproject.toml is untouched (Question (3) resolved). Applying waiting on two standing questions for you: (1) is the credential-gated EDL spine in jupyterhub_example OK as-is, and (2) the first real repo2docker build + anonymous source.coop read is unverified here (egress to data.source.coop was policy-blocked this run, not retried) — happy to add a CI smoke-build if you name the workflow.

Generated by Claude Code

espg · 2026-06-26T23:05:59Z

and cannot reach source.coop (egress is blocked, 403 — not retried), so I could not execute the notebooks

the first real repo2docker build + anonymous source.coop read is unverified here (egress to data.source.coop was policy-blocked this run, not retried) — happy to add a CI smoke-build if you name the workflow.

source.coop endpoints have been whitelisted for egress, you should be able to access now

and mortie is not importable

what, why?

Binder spine for jupyterhub_example: catalog-build + agg-from-NSIDC need Earthdata login (and catalog needs the spherely fork), so they can't execute anonymously on Binder. Implemented as: those stay credential-gated cells (skip cleanly when HAVE_CATALOG is false) and the anonymous public-store read/viz is the part that runs on Binder. OK as-is, or would you rather those cells be markdown-only illustrations?

this isn't true -- for catalog building, anonymous access is allowed. Fetching data does require creds though...

[analysis] extension — RESOLVED. Every Binder-notebook import is covered by core deps + [analysis] (cartopy/matplotlib/xarray/notebook in [analysis]; numpy/obstore/zarr/mortie/pyproj/pyyaml core), so no new dep was added and pyproject.toml is untouched.

Great!

…m configs

espg

🤖 from Claude (review)

Adversarial review of just 31787cc ("fold #105 review: anonymous catalog-build cell, parent_order in custom configs"). I verified the changes against the repo source and found no correctness bugs; all four files check out. Inline notes below are confirmations plus one minor doc-consistency nit.

What I verified:

The new catalog cell's API usage is correct: ShardMap.build(catalog, grid, *, backend=...) signature matches (positional cat, grid, keyword backend="mortie"); Query fields (short_name/version/start_date/end_date/region) and query.collection exist; Catalog.__len__, sm.shard_keys, and sm.metadata["total_pairs"] all exist; from_config(config) takes the config object. HAVE_CATALOG is set on both the try and except paths, so the downstream if HAVE_CATALOG cells are always defined.
"Anonymous / no spherely on Binder" is accurate: CMRSource._search uses plain requests.get/post with no auth headers; shardmap.py has no top-level spherely import (it's lazy via _spherely_available()), and backend="mortie" bypasses it entirely.
stac-geoparquet is genuinely needed (CMRSource.fetch imports stac_geoparquet.arrow) and is NOT in the analysis extra, so adding catalog to postBuild is correct and necessary. catalog = ["stac-geoparquet>=0.7.0"] is an existing declared extra, not a new dependency.
The parent_order=6 additions are consistent with the xdggs_spec(parent_order=6, ...) calls — no mismatch.
broken_yaml now fails for the intended reason — see inline note.

One non-blocking nit (line not in this diff, so noting here): the Part-section "Summary" table still lists Output grid settings as type:, indexing_scheme:, child_order: and omits parent_order:, which is now required for healpix. Worth adding for consistency with the configs this commit just fixed.

Generated by Claude Code

espg

🤖 from Claude (review)

Adversarial review of just 31787cc ("fold #105 review: anonymous catalog-build cell, parent_order in custom configs"), checked against the repo source. No correctness bugs found; all four files check out. Inline notes are confirmations plus one minor doc nit below.

Verified: the new catalog cell's API usage (ShardMap.build/Query/Catalog.__len__/shard_keys/metadata["total_pairs"]/from_config) and its HAVE_CATALOG wiring on both branches; the anonymity + no-spherely-on-Binder claims (bare requests in CMRSource._search, lazy spherely, mortie backend); that stac-geoparquet is genuinely needed and is an existing catalog extra (not a new dep); that parent_order=6 is consistent with the xdggs_spec(parent_order=6, ...) calls; and that broken_yaml now fails for the intended missing-dh_fit_dx reason rather than missing parent_order.

Non-blocking nit (line not in this diff, so noted here not inline): the "Summary" table near the end of custom_aggregations.ipynb still lists Output grid settings as type:, indexing_scheme:, child_order: and omits parent_order:, which this commit just established as required for healpix. Worth adding for consistency.

Generated by Claude Code

espg · 2026-06-27T02:10:30Z

+   "id": "b964db81",
+   "metadata": {},
+   "outputs": [],
+   "source": "# Build a small ATL06 catalog anonymously (CMR-STAC metadata only -- no login).\n# This is the same work `python -m zagg.catalog ...` does, via the Python API.\nfrom zagg.catalog.shardmap import ShardMap\nfrom zagg.catalog.sources import CMRSource, Query\nfrom zagg.grids import from_config\n\ncatalog_path = \"shardmap_ATL06_2025-04-01_2025-04-03.json\"\ngrid = from_config(config)\n\n# Small bbox + a few days so the build is quick on Binder.\nquery = Query(\n    short_name=\"ATL06\",\n    version=\"006\",\n    start_date=\"2025-04-01\",\n    end_date=\"2025-04-03\",\n    region=(-70.0, -78.0, -65.0, -75.0),  # lon_min, lat_min, lon_max, lat_max\n)\n\ntry:\n    cat = CMRSource().fetch(query)\n    print(f\"Fetched {len(cat)} granules ({query.collection})\")\n    # backend=\"mortie\": HEALPix coverage, so no spherely fork needed.\n    sm = ShardMap.build(cat, grid, backend=\"mortie\")\n    sm.to_json(catalog_path)\n    HAVE_CATALOG = True\n    print(f\"ShardMap: {len(sm.shard_keys)} shards, \"\n          f\"{sm.metadata['total_pairs']} pairs -> {catalog_path}\")\nexcept Exception as exc:\n    # CMR-STAC is anonymous but needs outbound network; degrade cleanly on an\n    # offline / egress-restricted Binder so the rest of the notebook still runs.\n    HAVE_CATALOG = False\n    print(f\"Catalog build skipped ({type(exc).__name__}: {exc}).\\n\"\n          \"If this environment can't reach cmr.earthdata.nasa.gov, jump to the \"\n          \"public-result section, which reads source.coop directly.\")"


🤖 from Claude (review)

Catalog cell checks out against source. ShardMap.build(cls, catalog, grid, *, backend=...) matches the positional cat, grid + keyword backend="mortie"; Query fields and query.collection, Catalog.__len__, sm.shard_keys, and sm.metadata["total_pairs"] all exist; from_config(config) takes the config object. HAVE_CATALOG is assigned on both the try and except branches, so the downstream if HAVE_CATALOG cells never hit an undefined name. The anonymity claim is grounded: CMRSource._search uses bare requests.get/post with no auth headers.

Generated by Claude Code

espg · 2026-06-27T02:10:36Z

-    "    print(f\"Validation error: {e}\")"
-   ]
+   "outputs": [],
+   "source": "# Same config but without dh_fit_dx in data_source.variables\nbroken_yaml = \"\"\"\ndata_source:\n  reader: h5coro\n  groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]\n  coordinates:\n    latitude: \"/{group}/land_ice_segments/latitude\"\n    longitude: \"/{group}/land_ice_segments/longitude\"\n  variables:\n    h_li: \"/{group}/land_ice_segments/h_li\"\n    s_li: \"/{group}/land_ice_segments/h_li_sigma\"\n    # dh_fit_dx is missing!\n\naggregation:\n  coordinates:\n    cell_ids: {dtype: uint64, fill_value: 0}\n    morton:   {dtype: int64,  fill_value: 0}\n  variables:\n    count: {function: len, source: h_li, dtype: int32, fill_value: 0}\n    slope_mean: {function: mean, source: dh_fit_dx, dtype: float32}\n\noutput:\n  grid:\n    type: healpix\n    indexing_scheme: nested\n    parent_order: 6\n    child_order: 12\n\"\"\"\n\ntry:\n    validate_config(load_config_from_dict(yaml.safe_load(broken_yaml)))\nexcept ValueError as e:\n    print(f\"Validation error: {e}\")"


🤖 from Claude (review)

Verified broken_yaml now fails for the intended reason. In validate_config, the output.grid.parent_order is required for healpix grid check runs near the top, before the per-variable source-reference check (source '...' not in data_source.variables). So pre-fix, the missing parent_order would have short-circuited and raised the grid error, masking the missing-dh_fit_dx error the surrounding markdown advertises. Adding parent_order: 6 makes the cell raise the intended source 'dh_fit_dx' not in data_source.variables, matching cell 23's "What if we forget to add dh_fit_dx... validate_config() catches it." Same fix correctly unblocks simple_yaml/extended_yaml, and the parent_order=6 value is consistent with the xdggs_spec(parent_order=6, ...) calls later in the notebook.

Generated by Claude Code

espg · 2026-06-27T02:10:51Z

+# irrelevant for running the example notebooks).
+git fetch --tags --unshallow 2>/dev/null || git fetch --tags 2>/dev/null || true
+if git describe --tags --abbrev=0 >/dev/null 2>&1; then
+    python -m pip install --no-cache-dir ".[analysis,catalog]"


🤖 from Claude (review)

Confirmed correct and necessary: the new catalog cell calls CMRSource.fetch, which does import stac_geoparquet.arrow. stac-geoparquet is in the catalog extra only (not analysis), so without adding catalog here the build cell would ImportError on Binder. catalog = ["stac-geoparquet>=0.7.0"] is a pre-existing declared extra, so this is not a new dependency. The "no spherely" comment is also accurate — backend="mortie" plus no top-level spherely import in shardmap.py.

Generated by Claude Code

espg · 2026-06-27T02:13:08Z

🤖 from Claude

Addressed your three review points (pushed 31787cc + 6e45d6b; CI green):

1. mortie not importable — root cause. It is a not-installed problem, not a repo/declaration bug. mortie>=0.8.1 is a core dependencies entry (not behind an extra) and ships a cp310-abi3 manylinux wheel that imports cleanly: in a synced env import mortie gives 0.8.2 and from mortie._healpix import pix2ang resolves, which is all the notebooks use. My earlier "not importable" came from a run whose kernel did not have zagg[analysis] actually installed (so mortie was simply absent) — not from anything wrong on the Binder path, where postBuild's pip install pulls mortie via core deps. So no code fix was needed for the import itself; the real gap was (3) below.

2. source.coop egress — verified anonymous read. With the whitelist in place I can now reach it. I opened s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr group 12 over an unsigned S3Store(skip_signature=True) request: consolidated metadata fetched in <1s (11 data vars, 201,326,592 cells), and bounded array-chunk reads (millions of elements) decoded and fed through pix2ang(12, …) cleanly. The custom_aggregations notebook (synthetic, fully offline) executed end-to-end clean under jupyter nbconvert --execute — that run is what surfaced the parent_order bug now fixed. (A full-store nbconvert of rasterized_zarr reads all ten 201M-cell arrays; the bytes are flowing fine over the whitelisted link, it's just a large sustained download.)

3. catalog building is anonymous — fixed. You're right. CMRSource._search issues a plain unauthenticated requests.get/post to cmr.earthdata.nasa.gov/stac (no auth header), and ShardMap.build(..., backend="mortie") uses the HEALPix MOC backend, so the spherely fork isn't needed either. jupyterhub_example section 3 is now a runnable anonymous catalog-build cell (small ATL06 bbox + a few days, backend="mortie"), guarded with try/except so it degrades cleanly where outbound CMR access is blocked. Only the agg() data-fetch cells stay credential-gated; the "What runs where" table and section 4/6 prose are updated to match, and the same (incorrect) "needs Earthdata" note in the cryo catalog comment is corrected. This needs stac-geoparquet (imported by CMRSource.fetch), which lives in the existing [catalog] extra — so .binder/postBuild now installs zagg[analysis,catalog] (both pre-existing extras; no new dependency, pyproject.toml untouched). The parent_order fix in custom_aggregations' inline configs was a separate correctness bug validate_config would have raised on Binder.

One note I could not close here: the CMR-STAC build can't be executed end-to-end in this environment because cmr.earthdata.nasa.gov is not on this run's egress allowlist (only source.coop is) — proxy returns 403 on CONNECT. I confirmed the code path is anonymous (it reaches the unauthenticated request before failing purely on egress), and the cell is written to degrade cleanly, but the first real CMR build is best validated by you in a real Binder. Questions for review updated accordingly; left as draft with implement for your look.

Generated by Claude Code

…dict input

…m configs

phase 1 of issue #104

619e84b

espg added the implement label Jun 25, 2026

espg commented Jun 25, 2026

View reviewed changes

espg added 2 commits June 25, 2026 19:03

fold #105 review: hatch-vcs version guard, trim binder conda geo layer

a7e46be

phase 2 of issue #104

d564bbd

espg commented Jun 25, 2026

View reviewed changes

claude added 3 commits June 25, 2026 19:24

fold #105 review: pix2ang order not nside, calculate_cell_statistics …

45f0a5c

…dict input

phase 3 of issue #104

b74ce27

phase 4 of issue #104

6fcc2cb

espg commented Jun 25, 2026

View reviewed changes

fold #105 review: clarify cryo catalog AOI / output filename

9f9ebde

espg added the waiting label Jun 25, 2026

espg mentioned this pull request Jun 25, 2026

Claude Summary Tracking (June 2026) #50

Open

espg removed the waiting label Jun 26, 2026

This was referenced Jun 26, 2026

Example notebook for the t-digest read helpers #106

Merged

Binder compatible docs / example notebooks #104

Closed

espg pushed a commit that referenced this pull request Jun 27, 2026

phase 3 of issue #79: align notebook with #105 .binder env

bb206d3

fold #105 review: anonymous catalog-build cell, parent_order in custo…

31787cc

…m configs

espg commented Jun 27, 2026

View reviewed changes

fold #105 review: note parent_order in custom_aggregations summary table

6e45d6b

espg added the waiting label Jun 27, 2026

espg mentioned this pull request Jun 27, 2026

Optional strict-AOI cell mask (issue #101) #111

Draft

7 tasks

espg marked this pull request as ready for review June 27, 2026 05:42

espg merged commit 046342f into main Jun 27, 2026
9 checks passed

espg added a commit that referenced this pull request Jun 27, 2026

fold #105 review: hatch-vcs version guard, trim binder conda geo layer

7031645

espg pushed a commit that referenced this pull request Jun 27, 2026

fold #105 review: pix2ang order not nside, calculate_cell_statistics …

810e690

…dict input

espg pushed a commit that referenced this pull request Jun 27, 2026

fold #105 review: clarify cryo catalog AOI / output filename

959a194

espg pushed a commit that referenced this pull request Jun 27, 2026

fold #105 review: anonymous catalog-build cell, parent_order in custo…

c5d49b6

…m configs

espg deleted the claude/104-binder-docs branch June 27, 2026 05:42

espg pushed a commit that referenced this pull request Jun 27, 2026

phase 3 of issue #79: align notebook with #105 .binder env

69df560

espg mentioned this pull request Jun 27, 2026

shardmap viewer: headless render core + ipyleaflet wrapper #44

Open

1 task

espg pushed a commit that referenced this pull request Jun 27, 2026

align shardmap viewer notebook with #105 binder env (#38)

fc50ea4

espg mentioned this pull request Jun 27, 2026

Follow up items for the week of June 27, 2026 #114

Open

Uh oh!

Conversation

espg commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Approach / data rule

Files added / changed (and why)

Phases

How tested

Questions for review

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Blocker 1 — calculate_cell_statistics(df) is called with a DataFrame but expects a dict; it raises before computing

Blocker 2 — pix2ang is passed nside where it expects depth (HEALPix order)

Items checked and clean

Minor (non-blocking)

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 25, 2026

Uh oh!

espg commented Jun 26, 2026

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

espg commented Jun 25, 2026 •

edited

Loading

Blocker 1 — `calculate_cell_statistics(df)` is called with a DataFrame but expects a dict; it raises before computing

Blocker 2 — `pix2ang` is passed `nside` where it expects `depth` (HEALPix order)