Skip to content

Cleaner concurrency model #31

@kylebarron

Description

@kylebarron

Related to #26, #29, #30

If I understand correctly:

  • open is fully synchronous because it's fully based on DuckDB, which is only synchronous. Ideally this part could be also async, because loading the Stac-GeoParquet file is network-IO, but this is less important anyways because open is relatively fast
  • Chunk reads currently expose only a synchronous API because it only implements the synchronous Xarray API (see Implement Xarray async integration #29). But under the hood a large part of this process is actually async via async-geotiff.

My opinion is that as a general principle, it's good to compose async and sync functionality as separate API calls so that the downstream user can decide when and how to manage concurrency between IO-bound calls and CPU-bound calls. This is why async-tiff has a separation between the tile fetch (async network bound) and the decompression (CPU bound).

Claude suggests there are these multitude of sync/async boundary crossings:

MultiBandStacBackendArray.__getitem__       SYNC   ← xarray BackendArray protocol
└── _raw_getitem                            SYNC
    ├── (resolve indexers, build _ChunkReadPlan)            sync, pure CPU
    └── _run_coroutine(_run_mosaic_all_dates(...))          sync→async bridge
        └── _run_mosaic_all_dates             ASYNC
            └── asyncio.gather(_run_one_date per time step)
                └── _run_one_date             ASYNC
                    ├── duckdb_client.search                SYNC, blocking — guarded by threading.Lock
                    └── async_mosaic_chunk    ASYNC
                        ├── per-item: async-geotiff open    ASYNC (network I/O)
                        ├── per-item: async-geotiff read    ASYNC (network I/O)
                        ├── per-item: reproject+mosaic      SYNC, dispatched via run_in_executor
                        └── pixel-selection mosaic logic    SYNC, in-line

Claude suggests

Infra — DuckDB-inside-async: even if you exposed async def read_chunk(...), it would still block the event loop during DuckDB calls. A truly async API should at minimum await loop.run_in_executor(None, duckdb_client.search, ...) so the loop yields. (Trivial change, big honesty improvement.)

Is there prior work about duckdb threading? can we put each duckdb chunk query into a separate thread? Claude says:

DuckDB threading

DuckDB's official thread-safety model:

  • Connections are thread-safe with internal serialization — operations on one connection block each other (single mutex), but the connection won't crash under concurrent access. DuckDB has documented this since 0.5.x.
  • Multiple connections to the same DB file run in parallel — that's the official pattern for parallel Python workloads. (duckdb.connect(...) once per thread, or connection.cursor() for cheap parallel transaction contexts on the same DB.)
  • A single query uses multiple threads internally via DuckDB's parallel executor (SET threads = N).

So the answer to "can we put each duckdb chunk query into a separate thread" is yes, and there are two flavors:

  1. Cheap fix: wrap the existing single client's .search() call in await loop.run_in_executor(None, lambda: client.search(...)). This doesn't get parallelism — the existing threading.Lock (or DuckDB's internal mutex) still serializes the queries — but it does release the event loop so other coroutines (COG opens, reads, reprojections for already-resolved time steps) can progress. Honesty improvement, no parallelism gain.

  2. Real parallelism: a thread-local or per-time-step DuckdbClient pool. Each thread calls DuckdbClient() once and reuses it; queries from different worker threads run truly in parallel (DuckDB's process-level engine handles concurrent connections fine). This requires checking what rustac.DuckdbClient actually does internally — dir() shows only sync methods (search, search_to_arrow, execute, query_to_table, get_collections), no cursor() or per-connection clone. Worth a quick question to the rustac maintainers: does each DuckdbClient instance hold its own connection, and is it safe to construct N of them for the same parquet file? If yes, the path is clear.

A useful data point for the issue: it's worth measuring how much wall-time the per-chunk DuckDB queries actually account for. There's already a logger.debug trace for this in _backend.py:319-325. If DuckDB is <5% of chunk time on typical workloads, fix #1 alone is the right call (cheap, honest, no infra). If it's 20%+, fix #2 starts paying for itself.

One thing to verify before posting: _search_items calls client.search(...) directly inside the coroutine (_backend.py:357-368), under a threading.Lock. The lock is held across the sync call, so even loop.run_in_executor wouldn't gain parallelism without removing it. That doesn't change the conclusion (loop is blocked → fix #1 is right) but it makes the issue tighter to cite the line numbers.

Ideally I feel like there should be an API split between the sync parts and the async parts, but the problem is that if we shoehorn this into xarray's API, it doesn't allow for that split. The xarray API is either sync or async.

Perhaps this is a suggestion for lazycogs to have a high-level xarray API but another low-level API with more control? I'm not sure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions