Skip to content

download only specific bands from source gribs#2

Merged
margocrawf merged 39 commits into
mainfrom
specific-bands
Dec 17, 2024
Merged

download only specific bands from source gribs#2
margocrawf merged 39 commits into
mainfrom
specific-bands

Conversation

@aldenks
Copy link
Copy Markdown
Member

@aldenks aldenks commented Nov 18, 2024

  • Download source gribs band-by-band
  • Use rasterio/gdal instead of cfgrib as grib decoder
  • Parallelize downloads and file reads
  • New structure for data variable encoding and attributes
  • Support full forecast horizon

aldenks and others added 30 commits November 18, 2024 17:00
Comment thread noaa/gefs/forecast/read_data.py
@margocrawf margocrawf marked this pull request as ready for review December 17, 2024 18:26
@margocrawf margocrawf merged commit f0c2773 into main Dec 17, 2024
@aldenks aldenks deleted the specific-bands branch February 25, 2026 21:04
aldenks pushed a commit that referenced this pull request May 25, 2026
- Drop the Materialized/Virtual DynamicalDataset split — single base.
  Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob
  to host their substantial per-variant code.
- Reframe the update process around indexed CronJobs (same scheduling
  pattern as materialized): worker 0 expands dims on main; workers
  fill chunks in parallel; no temp branch for operational updates.
  Avoids concurrent dim-expansion conflicts on init_time coord chunks.
- Promote filtering-already-ingested from open question to the core
  efficiency mechanism for steady-state updates (region jobs span
  shards that are mostly already populated).
- Walk through a concurrent-update scenario explicitly and explain why
  ConflictDetector accepts it. Call out the integration test PR #2
  needs to verify icechunk 2.x rebase semantics.
- Document three options for per-variable serializer (encoding factory,
  metadata-only common config, inherit-and-replace) instead of picking
  one prematurely.
- Minimize __main__.py surface: source virtual chunk containers declared
  on the VirtualRegionJob class; store factory picks them up automatically.
- Address each unresolved review comment from Alden inline.
- Add appendices with concrete code patterns from PR #511, context from
  PR #510, and an inventory of existing infrastructure we reuse.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
aldenks pushed a commit that referenced this pull request May 26, 2026
Previous draft routed multi-new-init scenarios through a "catchup
edge case" with two divergent strategies (app-level retry vs backfill
fallback). That's operational complexity for no good reason.

Replace with a single uniform model: every batch commit opens a fresh
session, computes refs against current state (filter-already-ingested
+ lazy index lookup), expands the dim if needed, sets refs, commits.
On ConflictDetector rejection, throw away the session and retry with
a fresh one — the retry's "recompute against current state" step
naturally picks up whatever the other pod committed and recomputes
target indices.

Retries are cheap: byte ranges from parsed index files are already
in hand, set_virtual_ref calls are microseconds, only the chunk-key
indices need recomputation. Steady state sees ~0 conflicts; the rare
multi-new-init scenario pays a few extra retries and converges.

Update PR #2 integration tests to verify this uniform model converges
in both the disjoint-write and concurrent-expansion scenarios.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants