Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: delegation to Dataframe library #8

Closed
alxmrs opened this issue Feb 9, 2024 · 0 comments · Fixed by #32
Closed

Prototype: delegation to Dataframe library #8

alxmrs opened this issue Feb 9, 2024 · 0 comments · Fixed by #32
Milestone

Comments

@alxmrs
Copy link
Owner

alxmrs commented Feb 9, 2024

Is there a way to quickly get feature complete with xql — without implementing a SQL engine?Beyond using just DataFusion, can we be lazier?

Here’s my pitch: we provide a core operation to translate a chunked xr.Dataset to a dd.DataFrame. Then, we let dask-sql handle the heavy lifting of performing SQL queries.

Why would this work?

Xarray chunks have an analog to Dask chunked arrays. This is the core issue to solve to get decent performance, not necessarily the SQL engine. If we use dask.dataframe.io.from_map to translate a xr.Dataset chunk into a deferred, chunked dask array — and the dask DataFrame has sufficient metadata to compute indexes s.t. values can be lazily evaluated, then it should work and be performant.

Why is this preferred?

Let’s let the dask-sql project be on the hook for maintenance, for implementing the SQL spec (including hairy cases like nested queries and joins).
Plus, we may get the ability to join raster and tabular data together “for free.” This would be a killer use case for xql.

Why is this not preferred?

This ties our implementation to Dask. (Mitigation: maybe we could make the Dataframe transformation use a “portable” dataframe library, like Fugue.) Does coupling the implementation to a dataframe prevent future plans for distributed Xarray backends (Cubed, xbeam with Stephan’s Dataframe refactor)?

Unknowns

  • Would we lose out on efficiencies from compact data representation (in contiguous arrays)?
  • Does the math check out? Specifically, the chunking math? For example, are there obvious corner cases when we perform standard queries (select..wheres, joins) given that we’re translating from raster chunks to table chunks? We should be able to mathematically model this (or simulate it in a program). Does the arithmetic work out? Can we get lazy evaluation when it’s needed (it’s critical that we avoid loading the whole dataset to perform operations that just use a subset)? Will we always fit within memory a la the guarantees of Dask and Xarray (this is a must to me)? If not, what cases wont we, and how do we mitigate that?

Implementation notes

Assuming we use the implementation defined here in qarray: we need to update the core translate operation to work on a range of data rather than a single index (i.e. vectorized. Maybe operate on a whole Dataset at once? These need to be chunked ofc.). Then, we would be able to add the function to “from_map” or “from_defferred”, applied on a iterable of the xr.Dataset chunks. (This should also help provide clarity on the chunking math, defined above.)

alxmrs added a commit that referenced this issue Feb 11, 2024
Warning: this is an untested sketch!

This change provides a few humble functions to try to adapt the Xarray model to Dask's dataframe model. The conversion is more or less an itertools.product and index operation. The translation to dataframes honor's Xarray's chunks. I've copied weather-tools' `ichunked` function just in case we need that layer of chunking of iterables (it's not used now).

There area a few next steps. From here, we can write unit tests to prove out the conversion to Dask Dataframe. Further, we can then add dask-sql to this module and see how it works on real SQL queries. I'm pretty sure before applying `unravel` to `form_map`, we'll need to convert the output to a Pandas dataframe.
@alxmrs alxmrs changed the title qarray/xql prototype idea. Prototype: delegation to Dataframe library Feb 17, 2024
alxmrs added a commit that referenced this issue Feb 17, 2024
Here, I put forth a significantly faster implementation of unravel that
uses an unbounded amount of memory. Since we use NumPy operations,
everything is much faster (due to vectorization).

We won't have to worry about unbounded memory when converting to Dask
dataframes, since we default to use Xarray chunks.

This answers some questions in #8.
@alxmrs alxmrs added this to the MVP milestone Feb 17, 2024
@alxmrs alxmrs closed this as completed in #32 Mar 3, 2024
alxmrs added a commit that referenced this issue Mar 3, 2024
Fixes #8. Updates project structure and README.
alxmrs added a commit that referenced this issue Mar 3, 2024
Fixes #8. Updates project structure and README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant