Does pyarrow support analytic (dataframe like) functions? #2189

ophiry · 2018-06-28T17:36:58Z

From what I understand, pyarrow is intended to supply (or at least support) 'inplace' analytical functions, similar to pandas.
But from the documentation, I didn't find something more then direct access to the data, or conversion to pandas.

Is there some support for such functions without using pandas (which requires copying the memory)?
specifically I need group by and indexing capabilities

paddyhoran · 2018-06-28T18:06:39Z

@ophiry Arrow will serve as the computational core of pandas eventually. When you say dataframe like functions, I believe that these will always be in pandas, but pandas users can expect to see many improvements due to moving to an Arrow backend.

What is in-scope for Arrow is more low level in this regard, things like query planning and compute graph optimization operating on Arrow memory. gandiva which was started by Dremio is an example of some early work in this area. Discussion regarding merging Gandiva into Arrow is underway and very early bindings (for python) are here.

Just my understanding, @wesm or @xhochy might correct me :)

xhochy · 2018-06-28T21:14:31Z

At the moment the best reference is my prototype repository https://github.com/xhochy/fletcher where I use Arrow Arrays as the memory backend for Pandas columns. You can use Arrow with that in Pandas without copies while getting most of the functionality. There is also some code that uses Numba to get highly efficient analytical functions on Arrow arrays. Note that the code in this repository is highly experimental at the moment.

More information on possible (and already current uses of Arrow) can be found at https://ursalabs.org/tech/

wesm · 2018-07-05T22:51:29Z

Just to add some context from my perspective: I don't really see pandas moving to an Arrow-based backend. There is too much legacy code. About 2.5 years ago we started discussing the idea of "pandas2", an API-incompatible future iteration of pandas. This was around the time that we were starting Apache Arrow, and to me it makes a lot of sense to base the future of in-memory analytics in Python on an open standard for columnar data (i.e. the Arrow columnar format) than something bespoke and custom to Python like we have now.

In the Arrow project, we have strived to be "front end agnostic", which is to say that we aren't really intending to build a user-facing library in the style of pandas in the Arrow Python bindings. The intent is for developers to be able to craft different styles of front ends to suit the needs of particular use cases. For example, for the last few years we have been developing Ibis, a sort of "deferred pandas" system https://github.com/ibis-project/ibis.

I expect work in analytic functions in the style of pandas to happen in this repository to happen over the coming 2-3 years. We need as much help as possible with this; it is likely to be a mix of precompiled C/C++ functions as well as just-in-time compiled LLVM functions.

Reminder as always that I have funding and job openings available in my organization available to work on this (and maintain the Arrow project more generally). Thanks!

pcmoritz · 2018-07-05T23:29:12Z

This is great, thanks for sharing your thoughts!

Given that there are already many people out there who know the pandas API well, do you think it's a good idea to make a subset of the pandas functionality work backed by arrow (say similar to the fletcher project that xhochy linked) and give errors or performance warnings for pandas functionality that can't be implemented in this way (and say perform the operation on a copy), together with instructions to migrate it?

paddyhoran · 2018-07-06T01:57:50Z

Thanks for correcting me @wesm.

I expect work in analytic functions in the style of pandas to happen in this repository to happen over the coming 2-3 years. We need as much help as possible with this; it is likely to be a mix of precompiled C/C++ functions as well as just-in-time compiled LLVM functions.

this respository being ibis, right?

Wouldn't this mean that the scope of Arrow is different for C++ and Python because there is a desire to move Gandiva into the Arrow project, but on the Python side Ibis would serve this purpose?

Apologies if I am mistaken.... again :)

wesm · 2018-07-06T14:47:12Z

@paddyhoran no, I mean this codebase, Apache Arrow. Ibis is a front end-only system; currently it is a front end for SQL-based systems and pandas (which serve as back ends). Analytic function implementations are "back end" concerns.

paddyhoran · 2018-07-06T15:16:37Z

Right, sorry for the confusion, that was my originally understanding. Thanks

buchanae · 2018-07-06T18:02:37Z

We need as much help as possible with this

I'm interested. Can you point me to a good starting point? What's the lowest hanging fruit? I can poke around the codebase and guess, but some guidance would speed things up.

wesm · 2018-07-06T21:41:31Z

You can look in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/compute-test.cc. Note this is the most unrefined part of the codebase, so very little is set in stone

TomScheffers · 2021-02-14T16:36:50Z

For my own projects, I wrote the pyarrow_ops package which performs pandas like operations on the pyarrow.Table directly. Currently it supports join, groupby, filters, drop_duplicates and head operations, but it can be easily extended. Maybe people are interested in extending on this work or optimizing it (for example, by calling more arrow.compute operations directly)? Let me know!

wesm closed this as completed Jul 6, 2018

lhoestq mentioned this issue Feb 4, 2022

Add a GROUP BY operator huggingface/datasets#3644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does pyarrow support analytic (dataframe like) functions? #2189

Does pyarrow support analytic (dataframe like) functions? #2189

ophiry commented Jun 28, 2018

paddyhoran commented Jun 28, 2018

xhochy commented Jun 28, 2018

wesm commented Jul 5, 2018

pcmoritz commented Jul 5, 2018 •

edited

Loading

paddyhoran commented Jul 6, 2018

wesm commented Jul 6, 2018

paddyhoran commented Jul 6, 2018

buchanae commented Jul 6, 2018

wesm commented Jul 6, 2018

TomScheffers commented Feb 14, 2021

Does pyarrow support analytic (dataframe like) functions? #2189

Does pyarrow support analytic (dataframe like) functions? #2189

Comments

ophiry commented Jun 28, 2018

paddyhoran commented Jun 28, 2018

xhochy commented Jun 28, 2018

wesm commented Jul 5, 2018

pcmoritz commented Jul 5, 2018 • edited Loading

paddyhoran commented Jul 6, 2018

wesm commented Jul 6, 2018

paddyhoran commented Jul 6, 2018

buchanae commented Jul 6, 2018

wesm commented Jul 6, 2018

TomScheffers commented Feb 14, 2021

pcmoritz commented Jul 5, 2018 •

edited

Loading