Polars DataFrame Support #7228

th3ed · 2022-10-30T21:24:21Z

Polars is a single-machine columnar dataframe library built on top of pyarrow Tables with both a eager and lazy API. While the default pickle serialization works fine for eagerly-created dataframes, performance could be improved by leveraging the same IPC serialization pyarrow Tables currently have. Additionally, LazyFrames are not currently serializable, but contain an underlying json plan that could be passed over the wire.

mrocklin · 2022-10-31T13:16:48Z

@ritchie46 and I spent a couple days and put together a very brief POC on this over at https://github.com/pola-rs/dask-polars

We paused due to lack of users asking for it. It's nice to see users ask for it.

Can I ask your motivation here? Is it mostly data access / serialization pain, or are you mostly looking for performance?

th3ed · 2022-10-31T14:44:29Z

Performance is the main motivation:

most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files
native support for the nullable spark dtypes present in our parquet files
better compression for columnar data means less network transfer in the cluster
Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

mrocklin · 2022-10-31T14:51:49Z

most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files

I would be surprised by this. Pandas is also column-major. Also, in-memory flips like this are rarely performance bottlenecks. Reading from disk/S3 is likely to be slower.

native support for the nullable spark dtypes present in our parquet files

Makes sense. This should end up working in pandas + dask soon regardless.

better compression for columnar data means less network transfer in the cluster

Dask compresses all data across the network with fast compression if available

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Yeah, this would be good. Dask should also get something like this sometime next year, but it'll never do the hardcore multiple-operations-in-single-pass-over-memory stuff that I suspect Polars does.

Dask+Polars is doable and would be fun. I'm game to support something like this. My guess though is that it'll be faster to get Dask+Pandas that meets most users needs faster. The things that you bring up are valid, but also mostly handled or being handled actively (with the exception of query planning).

ritchie46 · 2022-10-31T15:05:52Z

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

Out-of-core/memory of most queries is only a month or two away 🙂 : pola-rs/polars#5339, pola-rs/polars#5139.

Distributed not so much. I think a dask+polars has the potential to really save cloud costs as polars' really is able to utilize a nodes potential with regard to query time with low overhead.

randerzander · 2022-11-01T14:28:32Z

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Several of these concepts are implemented in Dask-SQL, but plan optimization will be work in progress for some time. I wonder if being able to swap Pandas for Polars in Dask would yield some improvements for Dask-SQL queries.

glxplz · 2022-11-03T18:16:20Z

I am working a lot with fairly large Delta Tables (from 50GB to 40TB). I recently tried to read and print a sample table (90GB) with dask, polars and delta-rs on a single machine (32 Cores).

These were my findings:

Version	Time in s	Max Mem. in GiB
dask	234.23	~105
dask (scheduler="processes")	134.3	~102
polars	21.23	~95
polars(+to_pandas)	40.11	~110
delta-rs(to_pandas)	235.28	~100

Tried many different variations but polars seems to be doing something better here.
At least on single machines these results convinced me to switch a number of my processes from pandas to polars.
I would love to see how this would perform in a distributed setup... which would be much more beneficial for my use cases.

eddie-atkinson · 2023-04-29T06:44:33Z

Just throwing my use case out there: I am processing a lot of modest sized parquet files (5-100MB) in parallel due to IO blocks and would love to distribute the work of doing so across a Dask cluster whilst benefiting from polars' robust performance and memory characteristics

th3ed linked a pull request Oct 30, 2022 that will close this issue

Polars DataFrame/LazyFrame protocol support + distributed config #7229

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars DataFrame Support #7228

Polars DataFrame Support #7228

th3ed commented Oct 30, 2022

mrocklin commented Oct 31, 2022

th3ed commented Oct 31, 2022

mrocklin commented Oct 31, 2022

ritchie46 commented Oct 31, 2022 •

edited

randerzander commented Nov 1, 2022

glxplz commented Nov 3, 2022

eddie-atkinson commented Apr 29, 2023

Polars DataFrame Support #7228

Polars DataFrame Support #7228

Comments

th3ed commented Oct 30, 2022

mrocklin commented Oct 31, 2022

th3ed commented Oct 31, 2022

mrocklin commented Oct 31, 2022

ritchie46 commented Oct 31, 2022 • edited

randerzander commented Nov 1, 2022

glxplz commented Nov 3, 2022

eddie-atkinson commented Apr 29, 2023

ritchie46 commented Oct 31, 2022 •

edited