Add GFS virtual icechunk dataset prototype#511
Closed
aldenks wants to merge 4 commits into
Closed
Conversation
Prototype demonstrating a virtual icechunk dataset backed by NOAA GFS GRIB2 files on S3. Data variable chunks are virtual references decoded at read time by GribberishCodec. Coordinates stored as real data. Demonstrates three phases: backfill with partial init times, adding a new init time, and filling missing lead times for an incomplete forecast. https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
Phase flow is now: 1. Backfill 3 init times with all lead times 2. Resize to add 4th init time, fill 3 of 7 lead times 3. Fill 2 more lead times for the 4th init time This demonstrates actually growing the dataset dimensions via zarr array resize() rather than pre-allocating all init times. https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
Demonstrate opening the repo fresh as a reader, fetching real GFS values from S3 through virtual references, and update the report with reader code pattern, global stats, point samples, and partial init time behavior. https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp
Member
Author
|
heck yeah. done its job |
aldenks
pushed a commit
that referenced
this pull request
May 25, 2026
- Drop the Materialized/Virtual DynamicalDataset split — single base. Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob to host their substantial per-variant code. - Reframe the update process around indexed CronJobs (same scheduling pattern as materialized): worker 0 expands dims on main; workers fill chunks in parallel; no temp branch for operational updates. Avoids concurrent dim-expansion conflicts on init_time coord chunks. - Promote filtering-already-ingested from open question to the core efficiency mechanism for steady-state updates (region jobs span shards that are mostly already populated). - Walk through a concurrent-update scenario explicitly and explain why ConflictDetector accepts it. Call out the integration test PR #2 needs to verify icechunk 2.x rebase semantics. - Document three options for per-variable serializer (encoding factory, metadata-only common config, inherit-and-replace) instead of picking one prematurely. - Minimize __main__.py surface: source virtual chunk containers declared on the VirtualRegionJob class; store factory picks them up automatically. - Address each unresolved review comment from Alden inline. - Add appendices with concrete code patterns from PR #511, context from PR #510, and an inventory of existing infrastructure we reuse. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prototype demonstrating a virtual icechunk dataset backed by NOAA GFS
GRIB2 files on S3. Data variable chunks are virtual references decoded
at read time by GribberishCodec. Coordinates stored as real data.
Demonstrates three phases: backfill with partial init times, adding a
new init time, and filling missing lead times for an incomplete forecast.
https://claude.ai/code/session_01UsSdT7E8S9bnrabtM5gNvp