-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype: delegation to Dataframe library #8
Milestone
Comments
alxmrs
added a commit
that referenced
this issue
Feb 11, 2024
Warning: this is an untested sketch! This change provides a few humble functions to try to adapt the Xarray model to Dask's dataframe model. The conversion is more or less an itertools.product and index operation. The translation to dataframes honor's Xarray's chunks. I've copied weather-tools' `ichunked` function just in case we need that layer of chunking of iterables (it's not used now). There area a few next steps. From here, we can write unit tests to prove out the conversion to Dask Dataframe. Further, we can then add dask-sql to this module and see how it works on real SQL queries. I'm pretty sure before applying `unravel` to `form_map`, we'll need to convert the output to a Pandas dataframe.
This was referenced Feb 17, 2024
alxmrs
added a commit
that referenced
this issue
Feb 17, 2024
Here, I put forth a significantly faster implementation of unravel that uses an unbounded amount of memory. Since we use NumPy operations, everything is much faster (due to vectorization). We won't have to worry about unbounded memory when converting to Dask dataframes, since we default to use Xarray chunks. This answers some questions in #8.
alxmrs
added a commit
that referenced
this issue
Mar 3, 2024
Fixes #8. Updates project structure and README.
alxmrs
added a commit
that referenced
this issue
Mar 3, 2024
Fixes #8. Updates project structure and README.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is there a way to quickly get feature complete with xql — without implementing a SQL engine?Beyond using just DataFusion, can we be lazier?
Here’s my pitch: we provide a core operation to translate a chunked xr.Dataset to a dd.DataFrame. Then, we let dask-sql handle the heavy lifting of performing SQL queries.
Why would this work?
Xarray chunks have an analog to Dask chunked arrays. This is the core issue to solve to get decent performance, not necessarily the SQL engine. If we use dask.dataframe.io.from_map to translate a xr.Dataset chunk into a deferred, chunked dask array — and the dask DataFrame has sufficient metadata to compute indexes s.t. values can be lazily evaluated, then it should work and be performant.
Why is this preferred?
Let’s let the dask-sql project be on the hook for maintenance, for implementing the SQL spec (including hairy cases like nested queries and joins).
Plus, we may get the ability to join raster and tabular data together “for free.” This would be a killer use case for xql.
Why is this not preferred?
This ties our implementation to Dask. (Mitigation: maybe we could make the Dataframe transformation use a “portable” dataframe library, like Fugue.) Does coupling the implementation to a dataframe prevent future plans for distributed Xarray backends (Cubed, xbeam with Stephan’s Dataframe refactor)?
Unknowns
Implementation notes
Assuming we use the implementation defined here in qarray: we need to update the core translate operation to work on a range of data rather than a single index (i.e. vectorized. Maybe operate on a whole Dataset at once? These need to be chunked ofc.). Then, we would be able to add the function to “from_map” or “from_defferred”, applied on a iterable of the xr.Dataset chunks. (This should also help provide clarity on the chunking math, defined above.)
The text was updated successfully, but these errors were encountered: