Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars DataFrame Support #7228

Open
th3ed opened this issue Oct 30, 2022 · 7 comments · May be fixed by #7229
Open

Polars DataFrame Support #7228

th3ed opened this issue Oct 30, 2022 · 7 comments · May be fixed by #7229

Comments

@th3ed
Copy link

th3ed commented Oct 30, 2022

Polars is a single-machine columnar dataframe library built on top of pyarrow Tables with both a eager and lazy API. While the default pickle serialization works fine for eagerly-created dataframes, performance could be improved by leveraging the same IPC serialization pyarrow Tables currently have. Additionally, LazyFrames are not currently serializable, but contain an underlying json plan that could be passed over the wire.

@th3ed th3ed linked a pull request Oct 30, 2022 that will close this issue
2 tasks
@mrocklin
Copy link
Member

@ritchie46 and I spent a couple days and put together a very brief POC on this over at https://github.com/pola-rs/dask-polars

We paused due to lack of users asking for it. It's nice to see users ask for it.

Can I ask your motivation here? Is it mostly data access / serialization pain, or are you mostly looking for performance?

@th3ed
Copy link
Author

th3ed commented Oct 31, 2022

Performance is the main motivation:

  • most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files
  • native support for the nullable spark dtypes present in our parquet files
  • better compression for columnar data means less network transfer in the cluster
  • Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

@mrocklin
Copy link
Member

most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files

I would be surprised by this. Pandas is also column-major. Also, in-memory flips like this are rarely performance bottlenecks. Reading from disk/S3 is likely to be slower.

native support for the nullable spark dtypes present in our parquet files

Makes sense. This should end up working in pandas + dask soon regardless.

better compression for columnar data means less network transfer in the cluster

Dask compresses all data across the network with fast compression if available

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Yeah, this would be good. Dask should also get something like this sometime next year, but it'll never do the hardcore multiple-operations-in-single-pass-over-memory stuff that I suspect Polars does.

Dask+Polars is doable and would be fun. I'm game to support something like this. My guess though is that it'll be faster to get Dask+Pandas that meets most users needs faster. The things that you bring up are valid, but also mostly handled or being handled actively (with the exception of query planning).

@ritchie46
Copy link

ritchie46 commented Oct 31, 2022

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

Out-of-core/memory of most queries is only a month or two away 🙂 : pola-rs/polars#5339, pola-rs/polars#5139.

Distributed not so much. I think a dask+polars has the potential to really save cloud costs as polars' really is able to utilize a nodes potential with regard to query time with low overhead.

@randerzander
Copy link

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Several of these concepts are implemented in Dask-SQL, but plan optimization will be work in progress for some time. I wonder if being able to swap Pandas for Polars in Dask would yield some improvements for Dask-SQL queries.

@glxplz
Copy link

glxplz commented Nov 3, 2022

I am working a lot with fairly large Delta Tables (from 50GB to 40TB). I recently tried to read and print a sample table (90GB) with dask, polars and delta-rs on a single machine (32 Cores).

These were my findings:

Version Time in s Max Mem. in GiB
dask 234.23 ~105
dask (scheduler="processes") 134.3 ~102
polars 21.23 ~95
polars(+to_pandas) 40.11 ~110
delta-rs(to_pandas) 235.28 ~100

Tried many different variations but polars seems to be doing something better here.
At least on single machines these results convinced me to switch a number of my processes from pandas to polars.
I would love to see how this would perform in a distributed setup... which would be much more beneficial for my use cases.

@eddie-atkinson
Copy link

Just throwing my use case out there: I am processing a lot of modest sized parquet files (5-100MB) in parallel due to IO blocks and would love to distribute the work of doing so across a Dask cluster whilst benefiting from polars' robust performance and memory characteristics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants