-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars DataFrame Support #7228
Comments
@ritchie46 and I spent a couple days and put together a very brief POC on this over at https://github.com/pola-rs/dask-polars We paused due to lack of users asking for it. It's nice to see users ask for it. Can I ask your motivation here? Is it mostly data access / serialization pain, or are you mostly looking for performance? |
Performance is the main motivation:
What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in |
I would be surprised by this. Pandas is also column-major. Also, in-memory flips like this are rarely performance bottlenecks. Reading from disk/S3 is likely to be slower.
Makes sense. This should end up working in pandas + dask soon regardless.
Dask compresses all data across the network with fast compression if available
Yeah, this would be good. Dask should also get something like this sometime next year, but it'll never do the hardcore multiple-operations-in-single-pass-over-memory stuff that I suspect Polars does. Dask+Polars is doable and would be fun. I'm game to support something like this. My guess though is that it'll be faster to get Dask+Pandas that meets most users needs faster. The things that you bring up are valid, but also mostly handled or being handled actively (with the exception of query planning). |
Out-of-core/memory of most queries is only a month or two away 🙂 : pola-rs/polars#5339, pola-rs/polars#5139. Distributed not so much. I think a dask+polars has the potential to really save cloud costs as polars' really is able to utilize a nodes potential with regard to query time with low overhead. |
Several of these concepts are implemented in Dask-SQL, but plan optimization will be work in progress for some time. I wonder if being able to swap Pandas for Polars in Dask would yield some improvements for Dask-SQL queries. |
I am working a lot with fairly large Delta Tables (from 50GB to 40TB). I recently tried to read and print a sample table (90GB) with dask, polars and delta-rs on a single machine (32 Cores). These were my findings:
Tried many different variations but polars seems to be doing something better here. |
Just throwing my use case out there: I am processing a lot of modest sized parquet files (5-100MB) in parallel due to IO blocks and would love to distribute the work of doing so across a Dask cluster whilst benefiting from polars' robust performance and memory characteristics |
Polars is a single-machine columnar dataframe library built on top of pyarrow Tables with both a eager and lazy API. While the default pickle serialization works fine for eagerly-created dataframes, performance could be improved by leveraging the same IPC serialization pyarrow Tables currently have. Additionally, LazyFrames are not currently serializable, but contain an underlying json plan that could be passed over the wire.
The text was updated successfully, but these errors were encountered: