Prototype: delegation to Dataframe library #8

alxmrs · 2024-02-09T09:40:15Z

Is there a way to quickly get feature complete with xql — without implementing a SQL engine?Beyond using just DataFusion, can we be lazier?

Here’s my pitch: we provide a core operation to translate a chunked xr.Dataset to a dd.DataFrame. Then, we let dask-sql handle the heavy lifting of performing SQL queries.

Why would this work?

Xarray chunks have an analog to Dask chunked arrays. This is the core issue to solve to get decent performance, not necessarily the SQL engine. If we use dask.dataframe.io.from_map to translate a xr.Dataset chunk into a deferred, chunked dask array — and the dask DataFrame has sufficient metadata to compute indexes s.t. values can be lazily evaluated, then it should work and be performant.

Why is this preferred?

Let’s let the dask-sql project be on the hook for maintenance, for implementing the SQL spec (including hairy cases like nested queries and joins).
Plus, we may get the ability to join raster and tabular data together “for free.” This would be a killer use case for xql.

Why is this not preferred?

This ties our implementation to Dask. (Mitigation: maybe we could make the Dataframe transformation use a “portable” dataframe library, like Fugue.) Does coupling the implementation to a dataframe prevent future plans for distributed Xarray backends (Cubed, xbeam with Stephan’s Dataframe refactor)?

Unknowns

Would we lose out on efficiencies from compact data representation (in contiguous arrays)?
Does the math check out? Specifically, the chunking math? For example, are there obvious corner cases when we perform standard queries (select..wheres, joins) given that we’re translating from raster chunks to table chunks? We should be able to mathematically model this (or simulate it in a program). Does the arithmetic work out? Can we get lazy evaluation when it’s needed (it’s critical that we avoid loading the whole dataset to perform operations that just use a subset)? Will we always fit within memory a la the guarantees of Dask and Xarray (this is a must to me)? If not, what cases wont we, and how do we mitigate that?

Implementation notes

Assuming we use the implementation defined here in qarray: we need to update the core translate operation to work on a range of data rather than a single index (i.e. vectorized. Maybe operate on a whole Dataset at once? These need to be chunked ofc.). Then, we would be able to add the function to “from_map” or “from_defferred”, applied on a iterable of the xr.Dataset chunks. (This should also help provide clarity on the chunking math, defined above.)

Warning: this is an untested sketch! This change provides a few humble functions to try to adapt the Xarray model to Dask's dataframe model. The conversion is more or less an itertools.product and index operation. The translation to dataframes honor's Xarray's chunks. I've copied weather-tools' `ichunked` function just in case we need that layer of chunking of iterables (it's not used now). There area a few next steps. From here, we can write unit tests to prove out the conversion to Dask Dataframe. Further, we can then add dask-sql to this module and see how it works on real SQL queries. I'm pretty sure before applying `unravel` to `form_map`, we'll need to convert the output to a Pandas dataframe.

Here, I put forth a significantly faster implementation of unravel that uses an unbounded amount of memory. Since we use NumPy operations, everything is much faster (due to vectorization). We won't have to worry about unbounded memory when converting to Dask dataframes, since we default to use Xarray chunks. This answers some questions in #8.

Fixes #8. Updates project structure and README.

alxmrs mentioned this issue Feb 13, 2024

Two level query plan execution #10

Open

alxmrs changed the title ~~qarray/xql prototype idea.~~ Prototype: delegation to Dataframe library Feb 17, 2024

This was referenced Feb 17, 2024

NumPy implementation of unravel(). #12

Merged

Support SQL-Style Joins between Xarray datasets and Dask/Pandas dataframes #5

Open

alxmrs added this to the MVP milestone Feb 17, 2024

alxmrs mentioned this issue Mar 3, 2024

Adding explicit support for SQL #32

Merged

alxmrs closed this as completed in #32 Mar 3, 2024

alxmrs added a commit that referenced this issue Mar 3, 2024

Adding explicit support for SQL (#32)

5fe7b0e

Fixes #8. Updates project structure and README.

alxmrs added a commit that referenced this issue Mar 3, 2024

Adding explicit support for SQL (#32)

1e78e78

Fixes #8. Updates project structure and README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: delegation to Dataframe library #8

Prototype: delegation to Dataframe library #8

alxmrs commented Feb 9, 2024

Prototype: delegation to Dataframe library #8

Prototype: delegation to Dataframe library #8

Comments

alxmrs commented Feb 9, 2024

Why would this work?

Why is this preferred?

Why is this not preferred?

Unknowns

Implementation notes