Skip to content

Add GFS virtual icechunk dataset prototype#511

Closed
aldenks wants to merge 4 commits into
mainfrom
claude/gfs-icechunk-prototype-a1nzU
Closed

Add GFS virtual icechunk dataset prototype#511
aldenks wants to merge 4 commits into
mainfrom
claude/gfs-icechunk-prototype-a1nzU

Conversation

@aldenks
Copy link
Copy Markdown
Member

@aldenks aldenks commented Mar 14, 2026

Prototype demonstrating a virtual icechunk dataset backed by NOAA GFS
GRIB2 files on S3. Data variable chunks are virtual references decoded
at read time by GribberishCodec. Coordinates stored as real data.

Demonstrates three phases: backfill with partial init times, adding a
new init time, and filling missing lead times for an incomplete forecast.

https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp

claude added 4 commits March 14, 2026 17:47
Prototype demonstrating a virtual icechunk dataset backed by NOAA GFS
GRIB2 files on S3. Data variable chunks are virtual references decoded
at read time by GribberishCodec. Coordinates stored as real data.

Demonstrates three phases: backfill with partial init times, adding a
new init time, and filling missing lead times for an incomplete forecast.

https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
Phase flow is now:
1. Backfill 3 init times with all lead times
2. Resize to add 4th init time, fill 3 of 7 lead times
3. Fill 2 more lead times for the 4th init time

This demonstrates actually growing the dataset dimensions via
zarr array resize() rather than pre-allocating all init times.

https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
Demonstrate opening the repo fresh as a reader, fetching real GFS values
from S3 through virtual references, and update the report with reader
code pattern, global stats, point samples, and partial init time behavior.

https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
@aldenks aldenks linked an issue Mar 16, 2026 that may be closed by this pull request
@aldenks
Copy link
Copy Markdown
Member Author

aldenks commented Mar 16, 2026

heck yeah. done its job

@aldenks aldenks closed this Mar 16, 2026
@aldenks aldenks deleted the claude/gfs-icechunk-prototype-a1nzU branch March 16, 2026 03:00
aldenks pushed a commit that referenced this pull request May 25, 2026
- Drop the Materialized/Virtual DynamicalDataset split — single base.
  Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob
  to host their substantial per-variant code.
- Reframe the update process around indexed CronJobs (same scheduling
  pattern as materialized): worker 0 expands dims on main; workers
  fill chunks in parallel; no temp branch for operational updates.
  Avoids concurrent dim-expansion conflicts on init_time coord chunks.
- Promote filtering-already-ingested from open question to the core
  efficiency mechanism for steady-state updates (region jobs span
  shards that are mostly already populated).
- Walk through a concurrent-update scenario explicitly and explain why
  ConflictDetector accepts it. Call out the integration test PR #2
  needs to verify icechunk 2.x rebase semantics.
- Document three options for per-variable serializer (encoding factory,
  metadata-only common config, inherit-and-replace) instead of picking
  one prematurely.
- Minimize __main__.py surface: source virtual chunk containers declared
  on the VirtualRegionJob class; store factory picks them up automatically.
- Address each unresolved review comment from Alden inline.
- Add appendices with concrete code patterns from PR #511, context from
  PR #510, and an inventory of existing infrastructure we reuse.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Virtual datasets exploration

2 participants