Binder-compatible docs / example notebooks#105
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 1 adversarial review of the .binder/ repo2docker config. Scope checks against the issue #104 decisions pass: conda/mamba env, [docs] left mkdocs-only, [analysis] extended (not a new group), and no spherely fork or git+ URL on the binder path. No scope creep in the diff.
Two potential build blockers to clear before this phase is green (both inline):
(1) hatch-vcs derives zagg's version from git in pip install ".[analysis]" — confirm the repo2docker checkout exposes a reachable tag, else the build either errors or ships a dev version that diverges from pip install zagg[analysis].
(2) postBuild must be committed with the executable bit (100755), or repo2docker skips it and the image ships without zagg/notebook deps.
Plus three quality findings (inline): the conda geos/proj/libgdal-core layer is probably redundant (pyproj/shapely use vendored wheels; nothing imports GDAL) and the "GEOTIFF" comment doesn't match the libgdal-core dep; consider nodefaults for a reproducible conda-forge-only solve; and the python=3.12-only pin (vs 3.12/3.13) is fine but should be a conscious Lambda-parity choice. The biggest single ask: paste a real repo2docker/jupyter-repo2docker . build log for this branch — most of the above resolves to yes/no from one log. Comment-only; not blocking the notebook phases.
Generated by Claude Code
| # those, and it is excluded from Binder because it dispatches to AWS Lambda). | ||
| set -euo pipefail | ||
|
|
||
| python -m pip install --no-cache-dir ".[analysis]" |
There was a problem hiding this comment.
🤖 from Claude (review)
Build blocker: hatch-vcs needs a reachable git tag, and repo2docker's checkout may not provide one. pip install ".[analysis]" builds zagg through hatchling + hatch-vcs (pyproject.toml:1-3, 87-91), which derives the version from git. Two failure modes on the Binder path: (1) if .git is present but no tag is reachable, hatch-vcs emits a 0.1.dev…+g<sha> version — non-fatal but it diverges from a tagged pip install zagg[analysis]; (2) if the build env lacks the tag and setuptools_scm/hatch-vcs can't find git metadata at all, the build errors out (LookupError: setuptools-scm was unable to detect version). repo2docker checks out the repo with .git, but tag availability depends on how Binder cloned it. Two concrete options: (a) add a SETUPTOOLS_SCM_PRETEND_VERSION / HATCH_VCS_PRETEND_VERSION export in this script as a guard, or (b) confirm the Binder clone includes tags and document it. Please verify an actual repo2docker build of this branch reaches green before clearing the phase.
Generated by Claude Code
| @@ -0,0 +1,16 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
🤖 from Claude (review)
Executable bit: repo2docker requires postBuild to be executable — it runs the file directly and will skip it (or error) if the mode isn't 0755. The diff can't show the file mode, so please confirm the committed blob has the +x bit set (git ls-files -s .binder/postBuild should show 100755). If it's 100644 the whole pip step silently never runs and the image ships without zagg or the notebook deps.
Generated by Claude Code
| - geos | ||
| - proj | ||
| - libgdal-core |
There was a problem hiding this comment.
🤖 from Claude (review)
Conda system layer is likely both redundant and slightly mis-targeted. Reasoning:
(1) pyproj and shapely ship self-contained manylinux wheels that vendor their own PROJ/GEOS — pip installs of those do not link the conda geos/proj here. So for pyproj/shapely these conda packages are inert.
(2) The one dep that genuinely benefits is cartopy>=0.25.0 (analysis extra, pyproject.toml:70): older cartopy built from sdist and needed GEOS/PROJ headers, but recent cartopy publishes wheels too. If the wheel is used, geos/proj are again unused; if a source build is forced, you'd need the dev headers (the conda geos/proj runtime packages do provide headers, so that's fine) — but you'd also need a compiler, which isn't requested here.
(3) libgdal-core (line 27): nothing in the [analysis] set imports GDAL — no rasterio/fiona/gdal. geopandas uses pyogrio/shapely wheels, not GDAL via conda. This line looks dead. The header comment (lines 4-5, 23-24) also says "GEOTIFF" but you install libgdal-core, not libgeotiff; the comment and the dep don't match.
Recommendation: either (a) drop the conda geo layer entirely and rely on PyPI wheels (simplest, matches a local pip install zagg[analysis] most closely), or (b) if you keep it because a real repo2docker build shows cartopy compiling from source, trim to exactly what that build needs and fix the comment. Ground the choice in an actual build log rather than assumption.
Generated by Claude Code
| channels: | ||
| - conda-forge |
There was a problem hiding this comment.
🤖 from Claude (review)
Channel pinning: only conda-forge is listed but nodefaults is not. Depending on the solver config baked into the repo2docker base image, mamba can still append the defaults channel, which mixes ABIs and occasionally breaks the solve. For a reproducible conda-forge-only solve add an explicit nodefaults:
channels:
- conda-forge
- nodefaultsMinor/defensive, but cheap insurance for "solves the same on every Binder rebuild."
Generated by Claude Code
| - conda-forge | ||
| dependencies: | ||
| # zagg requires Python >=3.12,<3.14; the AWS Lambda runtime is 3.12. | ||
| - python=3.12 |
There was a problem hiding this comment.
🤖 from Claude (review)
python=3.12 is consistent with requires-python = ">=3.12,<3.14" (pyproject.toml:13) and matches the Lambda runtime — good. Just noting it pins to 3.12 only (not 3.13); that's a defensible choice given Lambda parity, so no change requested, but flagging it explicitly so the divergence from "any of 3.12/3.13" is a conscious decision and not an accident.
Generated by Claude Code
| # Only PyPI/conda-forge resolvable packages appear here: no ``git+`` URLs and no | ||
| # spherely fork. The three Binder-runnable notebooks | ||
| # (custom_aggregations, rasterized_zarr, jupyterhub_example) read the public, | ||
| # anonymous source.coop store or synthetic data, so the exact-S2 spherely | ||
| # SpatialIndex backend is never on their import path. |
There was a problem hiding this comment.
🤖 from Claude (review)
Confirmed clean against the issue #104 decisions and the §4 data rule for the parts visible in this phase: no git+ URL and no spherely fork anywhere on the binder path (environment.yml and postBuild:9-13 both explicitly exclude it); [docs] is untouched and stays mkdocs-only (pyproject.toml:128-133, a dependency-group not an extra); [analysis] is reused rather than a new group being added. Two caveats this diff can't yet verify and that later phases must close: (1) the claim here that the three notebooks read "the public, anonymous source.coop store or synthetic data" is unverifiable until the notebooks land — hold the data-reachability check to that phase; (2) [analysis] transitively pulls cartopy/geopandas/xdggs etc. but not spherely, so the "read a prebuilt public store" path must genuinely avoid any SpatialIndex/catalog code that imports spherely — verify in the notebook phase.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of Phase 2 (the three binder notebooks). Inline JSON-line anchoring on .ipynb is unreliable, so findings are in this body, labeled by notebook + cell index. Verified against current src/zagg/ on the branch and the mortie._healpix source.
Good / confirmed:
- Lambda fan-out is fully removed from
jupyterhub_example— nobackend="lambda", noprocess-morton-cellARN, nomorton_aggregation_substore, noxaggleftovers. The HAVE_CATALOG guards are sound and the public source.coop section is self-contained. - All data references are anonymous source.coop (
us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr,skip_signature=True). Nogit+...xagg.git, no spherely-fork on a binder path. - Read schema matches the store: group
str(child_order)/"12",consolidated=True,zarr_format=3; vars referenced (count/h_mean/h_sigma/h_min/h_max/h_variance/h_q25/h_q50/h_q75/cell_ids) all exist in the store. - Catalog CLI is current:
--config/--short-name/--cycle/--polygon, outputshardmap_*.json(matchescatalog/__init__.py:262). No stale--parent-order/catalog_*.json. - ShardMap keys read correctly:
metadata.total_shards/metadata.total_pairs(catalog/shardmap.py:470-471). agg(...)kwargs (catalog,store,driver,dry_run,max_cells,max_workers,overwrite) and result keys (cells_with_data,total_obs,wall_time_s) all matchrunner.py:71-140.default_config,get_child_order,load_config,resolve_function,evaluate_expression,load_config_from_dict,validate_config,get_agg_fields,xdggs_spec(parent_order=, child_order=, config=)all exist with the signatures used.- Outputs cleared,
execution_count: null, kernelspecpython3across all three. Binder badges present in every notebook.
BLOCKER (1) — pix2ang called with nside instead of depth; breaks the headline viz in two notebooks.
mortie._healpix.pix2ang(depth, pixel) takes depth = HEALPix order (here 12), not nside — confirmed in mortie/_healpix.py:36-47 and its module docstring: "The Rust layer is depth-native (it indexes by HEALPix order/depth, not nside) … take depth directly." nside = 2**depth.
Both notebooks pass 2**CHILD_ORDER (= 4096) as the first arg:
jupyterhub_example.ipynbcell 14:lon, lat = pix2ang(2 ** CHILD_ORDER, cell_ids)rasterized_zarr.ipynbcell 6:lon, lat = pix2ang(NSIDE, cell_ids.astype(np.int64))whereNSIDE = 2 ** CHILD_ORDER(cell 3)
Passing 4096 where order is expected will error or yield wrong coordinates — this is the cell that produces the actual map in two of the three binder-runnable notebooks. Fix: pass the order, pix2ang(CHILD_ORDER, cell_ids) (i.e. pix2ang(12, ...)). Drop the NSIDE-as-arg usage in rasterized_zarr (keep NSIDE only for the display string in cell 13's attrs).
BLOCKER (2) — calculate_cell_statistics is passed a pd.DataFrame; it expects a dict of numpy arrays. custom_aggregations will crash on Binder.
calculate_cell_statistics(cell_data, ...) treats cell_data as a dict eval-namespace (processing/aggregate.py:290-344): it calls cell_data.values() at line 344 and unpacks **cell_data into the expression namespace. Every test passes a dict (tests/test_processing.py:98-1368); none pass a DataFrame.
custom_aggregations.ipynb builds pd.DataFrame(...) and passes it directly:
- cell 14:
stats = calculate_cell_statistics(df) - cell 18:
calculate_cell_statistics(df, config=cfg_simple) - cell 25:
calculate_cell_statistics(df_with_slope, config=cfg_extended)
On a DataFrame, df.values is an ndarray attribute, so df.values() at line 344 raises TypeError: 'numpy.ndarray' object is not callable — the notebook can't run past Part 1. Fix: pass dicts of arrays, e.g. calculate_cell_statistics({"h_li": h_li_arr, "s_li": s_li_arr}), matching the test contract. (The df/pd import in cell 14 becomes unnecessary once converted.)
Minor:
(3) custom_aggregations.ipynb cell 29 and rasterized_zarr.ipynb cell 22 are empty trailing code cells — harmless, but worth deleting for a clean render.
(4) custom_aggregations.ipynb cell 2: print(f"Sections: data_source, aggregation, output") is an f-string with no interpolation (would trip ruff F541 if it were linted source; in a notebook it's cosmetic). Drop the f.
(5) Markdown vs agg() default: jupyterhub_example cell 7 says driver="s3" is faster but only from us-west-2 and the agg() docstring says driver defaults to "s3" — consistent. No action; noting I checked.
Net: two functional blockers (1) and (2) each break a binder-runnable notebook's main output. (3)/(4) are cleanup. Recommend fixing (1) and (2) before clearing draft.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial review of phase 2 (commit d564bbd) — the three Binder notebooks. Static inspection only: this environment has no geo stack (cartopy/pyproj/obstore) and cannot reach source.coop (egress is blocked, 403 — not retried), so I could not execute the notebooks. Where I could isolate pure-Python logic I ran it directly; those cases are noted. Inline anchoring on the .ipynb JSON didn't land in the diff, so findings are consolidated here with file + cell index + snippet.
I verified API symbols against src/zagg/config.py, src/zagg/processing/aggregate.py, src/zagg/schema.py, src/zagg/configs/atl06.yaml, and mortie 0.8.1 (the pinned version, mortie>=0.8.1) at mortie/_healpix.py.
Blocker 1 — calculate_cell_statistics(df) is called with a DataFrame but expects a dict; it raises before computing
custom_aggregations.ipynb, cells 14, 18, 25:
stats = calculate_cell_statistics(df)
stats_simple = calculate_cell_statistics(df, config=cfg_simple)
stats_ext = calculate_cell_statistics(df_with_slope, config=cfg_extended)
The signature is calculate_cell_statistics(cell_data: dict[str, Any], ...) and the body does
n_obs = next((len(v) for v in cell_data.values() if np.ndim(v) != 0), 0) (src/zagg/processing/aggregate.py:344).
On a pandas DataFrame, .values is the ndarray property, so cell_data.values() is a call on an ndarray and raises TypeError: 'numpy.ndarray' object is not callable. I reproduced this in isolation with pandas — it fails on the very first line of the function. These are the only Binder-runnable stats cells, so the cell-0 promise "Runs end-to-end on Binder ... no cloud data, no credentials" does not hold as written.
Fix: pass a column→array dict, e.g. calculate_cell_statistics({c: df[c].values for c in df}), or build the synthetic inputs as dicts directly (the markdown in cell 13/contract even says "dict ... keys are column names; values are numpy arrays"). Add a quick local run of these cells before clearing outputs to catch this.
Blocker 2 — pix2ang is passed nside where it expects depth (HEALPix order)
mortie._healpix.pix2ang(depth, pixel) — "depth is the HEALPix order (nside = 2**depth)" (mortie/_healpix.py:36-39, mortie 0.8.1). The notebooks pass nside (4096), not the order (12):
rasterized_zarr.ipynbcell 6:lon, lat = pix2ang(NSIDE, cell_ids.astype(np.int64)), withNSIDE = 2 ** CHILD_ORDER # 4096(cell 3).jupyterhub_example.ipynbcell 14:lon, lat = pix2ang(2 ** CHILD_ORDER, cell_ids).
Both should pass CHILD_ORDER (= 12), i.e. pix2ang(CHILD_ORDER, cell_ids). As written, depth=4096 is far outside the valid HEALPix order range; the Rust healpix crate will error or return garbage coordinates, so the polar-stereo rasterization and the scatter map would be wrong (or raise) on the data cell that actually runs on Binder. I could not execute mortie here to capture the exact failure mode, so please confirm — but the argument is clearly order, not nside. (Flagging, not blocking on the mortie-symbol-existence rule: pix2ang exists and takes depth.)
Items checked and clean
- Outputs cleared: all three notebooks — every code cell has
execution_count: nulland emptyoutputs. Confirmed via nbformat. - Binder badge + intro cell: present in all three (cell 0 each,
mybinder.org/badge_logo.svg+ a "what runs where" blurb). - Lambda removed from jupyterhub: no
agg(backend="lambda")/ ARN invoke cell. The only "lambda" mentions are markdown — cell 0 links tocryocloud_example.ipynbfor fan-out, cell 1's credentials table notes AWS is optional "for ... invoking Lambda", cell 7 says "no Lambda needed". All allowed. - Earthdata/credentialed gating:
jupyterhub_example.ipynbgates the catalog/processing cells behindHAVE_CATALOG = os.path.exists(catalog_path)(cells 6, 8, 9, 11), each printing a clean "Skipped" message; section headers are marked (requires NASA Earthdata login). The public section (cells 12–14) usesskip_signature=Trueanonymous reads. Good. - Data source: the anonymous store
us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarrat order 12 is used in both data-reading notebooks; custom_aggregations uses only in-notebook synthetic arrays. No stale buckets,xagg, or ARNs found. - Store variable names match
atl06.yamlexactly:cell_ids, morton, count, h_min, h_max, h_mean, h_sigma, h_variance, h_q25, h_q50, h_q75(rasterized_zarr cell 4; jupyterhub cell 13). - API symbols exist:
default_config,load_config,load_config_from_dict,validate_config,resolve_function,evaluate_expression,get_agg_fields,get_child_order(config.py);calculate_cell_statistics(processing/aggregate.py — modulo Blocker 1's input type);xdggs_spec(parent_order, child_order, config=None)(schema.py) — all call sites use 2 positional +config=kw, matches.agg(config, *, catalog=, store=, driver=, dry_run=, max_cells=, max_workers=, overwrite=)— jupyterhub passesconfigpositional + the rest as kw, matches the keyword-only signature.config.catalog/config.output["store"](jupyterhub cell 16) are validPipelineConfigfields. - Intra-notebook links all resolve:
cryocloud_example.ipynbandcustom_aggregations.ipynbboth exist innotebooks/.
Minor (non-blocking)
custom_aggregations.ipynbcell 2:print(f"Sections: data_source, aggregation, output")is an f-string with no placeholders — harmless, but ruff/lint may flag F541. Worth dropping thef.
Net: two correctness blockers (both in cells the PR advertises as Binder-runnable), one trivial lint nit; everything else against espg's #104 rulings checks out.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial review of commits since d564bbd: 45f0a5c (pix2ang/cell-statistics fold), b74ce27 (cryocloud refresh), 6fcc2cb (README badges). Static inspection only — no geo stack installed, source.coop egress is blocked (403), and mortie is not importable, so pix2ang semantics were verified against mortie/_healpix.py source rather than by running. Notebooks were not executed.
pix2ang fix — CORRECT. Ground truth: mortie._healpix.pix2ang(depth, pixel) is depth-native, docstring "depth is the HEALPix order (nside = 2**depth)". All three call sites now pass the order (CHILD_ORDER = 12), not NSIDE and not 2**CHILD_ORDER:
rasterized_zarr.ipynbcell 6:pix2ang(CHILD_ORDER, cell_ids.astype(np.int64))jupyterhub_example.ipynbcell 14:pix2ang(CHILD_ORDER, cell_ids)cryocloud_example.ipynbcell 21:pix2ang(CHILD_ORDER, cell_ids.astype(np.int64))
NSIDE = 2 ** CHILD_ORDER survives only as a derived label in attrs/comments, never passed to pix2ang. Clean.
calculate_cell_statistics fix — CORRECT. Source src/zagg/processing/aggregate.py:290 takes cell_data: dict[str, Any]. All three custom_aggregations.ipynb call sites (cells 14, 18, 25) now pass dicts (cell, cell, cell_with_slope). import pandas as pd was removed and no live pd./df reference remains — the only surviving DataFrame token is in an explanatory comment ("NOT a DataFrame"), which is fine.
cryo API correctness — CORRECT. Verified against current source:
- Catalog CLI (
src/zagg/catalog/__init__.pymain()): cell 14 uses--config(required, defines the grid),--short-name(required),--cycle,--polygon,--output— all real flags; no--parent-orderflag is passed (parent order lives in the config). Matches the source. agg(...)kwargs (cells 16, 18) all exist insrc/zagg/runner.pydef agg() (catalog,store,backend,driver,max_cells,max_workers,overwrite). Result keys used (cells_with_data,total_obs,wall_time_s,lambda_time_s,estimated_cost_usd) all exist in the local/lambda return dicts.default_config,load_config_from_dict,get_agg_fieldsall exist insrc/zagg/config.py;default_config('atl06')resolves (src/zagg/configs/atl06.yamlis present).--output shardmap_atl06_cycle22.json(cell 14) matchescatalog='shardmap_atl06_cycle22.json'in both cells 16 and 18. Consistent.
Not Binder-runnable marking — CORRECT. cryocloud_example carries a clear "Not Binder-runnable" callout in cell 0 and has no Binder badge/link. README table row says "not Binder-runnable".
README badges — CORRECT. All three Binder links are https://mybinder.org/v2/gh/englacial/zagg/main?urlpath=lab/tree/notebooks/<nb>.ipynb; cryo row marked not-runnable.
Data references / stale naming — CLEAN. All stores point to s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr. No xagg, no git+...xagg.git, no old-bucket naming. The 429435741471 / 574251165169 account IDs and the process-morton-cell ARN are intentional cross-account context in the (non-Binder) cryo notebook, not stale references.
[analysis] extra reconcile — CLEAN, no missing dep. Every import in the three Binder notebooks is covered: numpy/zarr/obstore/pyproj/shapely/mortie (core), yaml←pyyaml (core), xarray←xarray[io], cartopy/matplotlib (the [analysis] extra); importlib/json/datetime/warnings/logging/os are stdlib. pyproject.toml correctly was not modified — no new dependency is needed.
Outputs cleared — CORRECT. All four notebooks: 0 outputs, 0 non-null execution counts.
Two non-blocking notes for author judgment (neither gates the phase):
- cryocloud_example cell 14 references
--polygon antarctica.geojson, but noantarctica.geojsonexists in the repo tree. Since this notebook is explicitly not Binder-runnable and already needs live AWS + Earthdata, this is cosmetic — but a reader copy-pasting the cell will hit a missing-file error. Consider a one-line note that the user supplies the AOI, or use--bbox. - Same cell uses
--config src/zagg/configs/atl06.yaml(a repo-root-relative path) while the notebook lives innotebooks/. Works only if cwd is the repo root. Worth a note, given (1) above.
Net: the two fixes folded from the prior review are correct and complete, the cryo refresh matches the current CLI/runner/config API, and the README wiring is right. Nothing blocking.
Generated by Claude Code
|
🤖 from Claude All four phases landed (cryo refresh + Generated by Claude Code |
source.coop endpoints have been whitelisted for egress, you should be able to access now
what, why?
this isn't true -- for catalog building, anonymous access is allowed. Fetching data does require creds though...
Great! |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of just 31787cc ("fold #105 review: anonymous catalog-build cell, parent_order in custom configs"). I verified the changes against the repo source and found no correctness bugs; all four files check out. Inline notes below are confirmations plus one minor doc-consistency nit.
What I verified:
- The new catalog cell's API usage is correct:
ShardMap.build(catalog, grid, *, backend=...)signature matches (positionalcat, grid, keywordbackend="mortie");Queryfields (short_name/version/start_date/end_date/region) andquery.collectionexist;Catalog.__len__,sm.shard_keys, andsm.metadata["total_pairs"]all exist;from_config(config)takes the config object.HAVE_CATALOGis set on both the try and except paths, so the downstreamif HAVE_CATALOGcells are always defined. - "Anonymous / no spherely on Binder" is accurate:
CMRSource._searchuses plainrequests.get/postwith no auth headers;shardmap.pyhas no top-levelspherelyimport (it's lazy via_spherely_available()), andbackend="mortie"bypasses it entirely. - stac-geoparquet is genuinely needed (
CMRSource.fetchimportsstac_geoparquet.arrow) and is NOT in theanalysisextra, so addingcatalogtopostBuildis correct and necessary.catalog = ["stac-geoparquet>=0.7.0"]is an existing declared extra, not a new dependency. - The
parent_order=6additions are consistent with thexdggs_spec(parent_order=6, ...)calls — no mismatch. broken_yamlnow fails for the intended reason — see inline note.
One non-blocking nit (line not in this diff, so noting here): the Part-section "Summary" table still lists Output grid settings as type:, indexing_scheme:, child_order: and omits parent_order:, which is now required for healpix. Worth adding for consistency with the configs this commit just fixed.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of just 31787cc ("fold #105 review: anonymous catalog-build cell, parent_order in custom configs"), checked against the repo source. No correctness bugs found; all four files check out. Inline notes are confirmations plus one minor doc nit below.
Verified: the new catalog cell's API usage (ShardMap.build/Query/Catalog.__len__/shard_keys/metadata["total_pairs"]/from_config) and its HAVE_CATALOG wiring on both branches; the anonymity + no-spherely-on-Binder claims (bare requests in CMRSource._search, lazy spherely, mortie backend); that stac-geoparquet is genuinely needed and is an existing catalog extra (not a new dep); that parent_order=6 is consistent with the xdggs_spec(parent_order=6, ...) calls; and that broken_yaml now fails for the intended missing-dh_fit_dx reason rather than missing parent_order.
Non-blocking nit (line not in this diff, so noted here not inline): the "Summary" table near the end of custom_aggregations.ipynb still lists Output grid settings as type:, indexing_scheme:, child_order: and omits parent_order:, which this commit just established as required for healpix. Worth adding for consistency.
Generated by Claude Code
| "id": "b964db81", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Build a small ATL06 catalog anonymously (CMR-STAC metadata only -- no login).\n# This is the same work `python -m zagg.catalog ...` does, via the Python API.\nfrom zagg.catalog.shardmap import ShardMap\nfrom zagg.catalog.sources import CMRSource, Query\nfrom zagg.grids import from_config\n\ncatalog_path = \"shardmap_ATL06_2025-04-01_2025-04-03.json\"\ngrid = from_config(config)\n\n# Small bbox + a few days so the build is quick on Binder.\nquery = Query(\n short_name=\"ATL06\",\n version=\"006\",\n start_date=\"2025-04-01\",\n end_date=\"2025-04-03\",\n region=(-70.0, -78.0, -65.0, -75.0), # lon_min, lat_min, lon_max, lat_max\n)\n\ntry:\n cat = CMRSource().fetch(query)\n print(f\"Fetched {len(cat)} granules ({query.collection})\")\n # backend=\"mortie\": HEALPix coverage, so no spherely fork needed.\n sm = ShardMap.build(cat, grid, backend=\"mortie\")\n sm.to_json(catalog_path)\n HAVE_CATALOG = True\n print(f\"ShardMap: {len(sm.shard_keys)} shards, \"\n f\"{sm.metadata['total_pairs']} pairs -> {catalog_path}\")\nexcept Exception as exc:\n # CMR-STAC is anonymous but needs outbound network; degrade cleanly on an\n # offline / egress-restricted Binder so the rest of the notebook still runs.\n HAVE_CATALOG = False\n print(f\"Catalog build skipped ({type(exc).__name__}: {exc}).\\n\"\n \"If this environment can't reach cmr.earthdata.nasa.gov, jump to the \"\n \"public-result section, which reads source.coop directly.\")" |
There was a problem hiding this comment.
🤖 from Claude (review)
Catalog cell checks out against source. ShardMap.build(cls, catalog, grid, *, backend=...) matches the positional cat, grid + keyword backend="mortie"; Query fields and query.collection, Catalog.__len__, sm.shard_keys, and sm.metadata["total_pairs"] all exist; from_config(config) takes the config object. HAVE_CATALOG is assigned on both the try and except branches, so the downstream if HAVE_CATALOG cells never hit an undefined name. The anonymity claim is grounded: CMRSource._search uses bare requests.get/post with no auth headers.
Generated by Claude Code
| " print(f\"Validation error: {e}\")" | ||
| ] | ||
| "outputs": [], | ||
| "source": "# Same config but without dh_fit_dx in data_source.variables\nbroken_yaml = \"\"\"\ndata_source:\n reader: h5coro\n groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]\n coordinates:\n latitude: \"/{group}/land_ice_segments/latitude\"\n longitude: \"/{group}/land_ice_segments/longitude\"\n variables:\n h_li: \"/{group}/land_ice_segments/h_li\"\n s_li: \"/{group}/land_ice_segments/h_li_sigma\"\n # dh_fit_dx is missing!\n\naggregation:\n coordinates:\n cell_ids: {dtype: uint64, fill_value: 0}\n morton: {dtype: int64, fill_value: 0}\n variables:\n count: {function: len, source: h_li, dtype: int32, fill_value: 0}\n slope_mean: {function: mean, source: dh_fit_dx, dtype: float32}\n\noutput:\n grid:\n type: healpix\n indexing_scheme: nested\n parent_order: 6\n child_order: 12\n\"\"\"\n\ntry:\n validate_config(load_config_from_dict(yaml.safe_load(broken_yaml)))\nexcept ValueError as e:\n print(f\"Validation error: {e}\")" |
There was a problem hiding this comment.
🤖 from Claude (review)
Verified broken_yaml now fails for the intended reason. In validate_config, the output.grid.parent_order is required for healpix grid check runs near the top, before the per-variable source-reference check (source '...' not in data_source.variables). So pre-fix, the missing parent_order would have short-circuited and raised the grid error, masking the missing-dh_fit_dx error the surrounding markdown advertises. Adding parent_order: 6 makes the cell raise the intended source 'dh_fit_dx' not in data_source.variables, matching cell 23's "What if we forget to add dh_fit_dx... validate_config() catches it." Same fix correctly unblocks simple_yaml/extended_yaml, and the parent_order=6 value is consistent with the xdggs_spec(parent_order=6, ...) calls later in the notebook.
Generated by Claude Code
| # irrelevant for running the example notebooks). | ||
| git fetch --tags --unshallow 2>/dev/null || git fetch --tags 2>/dev/null || true | ||
| if git describe --tags --abbrev=0 >/dev/null 2>&1; then | ||
| python -m pip install --no-cache-dir ".[analysis,catalog]" |
There was a problem hiding this comment.
🤖 from Claude (review)
Confirmed correct and necessary: the new catalog cell calls CMRSource.fetch, which does import stac_geoparquet.arrow. stac-geoparquet is in the catalog extra only (not analysis), so without adding catalog here the build cell would ImportError on Binder. catalog = ["stac-geoparquet>=0.7.0"] is a pre-existing declared extra, so this is not a new dependency. The "no spherely" comment is also accurate — backend="mortie" plus no top-level spherely import in shardmap.py.
Generated by Claude Code
|
🤖 from Claude Addressed your three review points (pushed 1. mortie not importable — root cause. It is a not-installed problem, not a repo/declaration bug. 2. source.coop egress — verified anonymous read. With the whitelist in place I can now reach it. I opened 3. catalog building is anonymous — fixed. You're right. One note I could not close here: the CMR-STAC build can't be executed end-to-end in this environment because Generated by Claude Code |
Closes #104
What this does
Makes the example notebooks Binder-launchable and refreshes them against the current API, following the decisions in this issue comment:
.binder/repo2docker config (condaenvironment.yml+postBuild) so Binder builds the geospatial stack cleanly.[docs]stays mkdocs-only; the[analysis]+[catalog]extras are the notebook runtime — Binder installszagg[analysis,catalog], no new dependency group (both extras already exist inpyproject.toml).custom_aggregations,rasterized_zarr,jupyterhub_example) plus the cryocloud notebook refreshed to the current API (cryo stays the only Lambda demo and is excluded from Binder).jupyterhub_example(was "section 6"); Lambda lives only in the cryo notebook.s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/(anonymous read verified this run, see "How tested").Approach / data rule
Per CLAUDE.md §4, every Binder-runnable cell must read data that is either anonymously reachable on the web or in the git tree. Two anonymous spines:
englacial/zagg/benchmarks/atl06_cycle22_fullsphere.zarr(HEALPix order 12, consolidated; variablescount/h_mean/h_sigma/h_min/h_max/h_variance/h_q25..75/cell_ids/morton, matchingsrc/zagg/configs/atl06.yaml).custom_aggregationsuses only synthetic in-notebook data, so it is anonymous by construction.agg()needs auth). The default HEALPixmortiecoverage backend means the non-PyPI spherely fork is not required either.jupyterhub_examplesection 3 now builds a small ATL06 catalog this way (guarded with try/except so it degrades cleanly on egress-restricted Binders); only theagg()data-fetch cells stay credential-gated.Files added / changed (and why)
.binder/environment.yml— conda-forge Python 3.12 + (trimmed) system geo libs; the Python layer is installed in postBuild so the dep set lives once inpyproject.toml..binder/postBuild—pip install ".[analysis,catalog]"with a hatch-vcs pretend-version guard for tag-less repo2docker checkouts; deliberately does NOT pull the spherely fork or icechunk/hvplot (cryo-only).[catalog]addsstac-geoparquet, which the anonymous catalog-build cell imports.notebooks/custom_aggregations.ipynb,rasterized_zarr.ipynb,jupyterhub_example.ipynb— refreshed to source.coop data, Binder badge/intro cells, anonymous catalog-build cell, only the data-fetch (agg) cells gated, Lambda section removed from jupyterhub, outputs cleared.notebooks/cryocloud_example.ipynb— refreshed to the current API (realpython -m zagg.catalogCLI, source.coop output store, dropped thegit+...xagg.gitinstall), catalog-build comment corrected to note it is anonymous, kept as the sole Lambda demo and marked explicitly not-Binder.pyproject.toml— unchanged: both[analysis]and[catalog]already exist; the Binder install just uses both.README.md— Binder badge + a notebook table with per-notebook launch links and the explicit cryo "not Binder-runnable" row.Phases
.binder/conda env + postBuild; PyPI/conda-forge-resolvable only, nogit+/fork wheels.atl06_cycle22_fullsphere.zarr; fix stalexagg/bucket/ARN naming; remove the Lambda section fromjupyterhub_example; gate EDL-only cells; clear outputs; add a Binder intro/badge cell.cryocloud_example+ extras reconcile. Update cryo to the current API, keep it the sole Lambda demo, mark it explicitly not-Binder.README.md; per-notebook Binder note; explicit "not Binder-able" note in cryo.[catalog]on Binder; fixparent_ordermissing from the inline custom configs incustom_aggregations.How tested
ruff check src tests— the only finding is a pre-existingN818insrc/zagg/registry.py(untouched by this branch; CI lint runs--select=E,F,W,Iso it doesn't gate). No notebook/Binder change introduces a lint.custom_aggregations.ipynbexecuted end-to-end clean viajupyter nbconvert --execute(synthetic data, fully offline) — this is what caught the missing-parent_orderbug, now fixed.atl06_cycle22_fullsphere.zarrgroup12over an unsigned S3 request (consolidated metadata fetched in <1s; chunk reads + decode confirmed). The whitelist for source.coop egress is in effect.CMRSource().fetchissues an unauthenticatedrequests.gettocmr.earthdata.nasa.gov/stac(no auth header) andShardMap.build(..., backend="mortie")needs no spherely fork. End-to-end execution of the CMR query could not be run here becausecmr.earthdata.nasa.govis not on this environment's egress allowlist (only source.coop is) — the build cell is guarded to degrade cleanly in that case.Questions for review
jupyterhub_examplecatalog spine — UPDATED. Per your correction, the catalog-build step is now a runnable anonymous cell (CMR-STAC metadata query,backend="mortie", no spherely fork, no Earthdata login). Only theagg()data-fetch cells stay credential-gated. The cell is wrapped in try/except so it degrades cleanly where outbound access to CMR is blocked. OK as-is?repo2docker .build + anonymous-read run is still best validated by you in a real Binder. I can add a CI smoke-build of the Binder image if you want it, but that would touch.github/workflows/, which I won't do without you naming it.[analysis]/[catalog]— RESOLVED. Both are existing extras; Binder now installszagg[analysis,catalog](catalog addsstac-geoparquet, which the anonymous catalog-build cell needs). No new dependency,pyproject.tomluntouched.Generated by Claude Code