Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does pyarrow support analytic (dataframe like) functions? #2189

Closed
ophiry opened this issue Jun 28, 2018 · 10 comments
Closed

Does pyarrow support analytic (dataframe like) functions? #2189

ophiry opened this issue Jun 28, 2018 · 10 comments

Comments

@ophiry
Copy link

ophiry commented Jun 28, 2018

From what I understand, pyarrow is intended to supply (or at least support) 'inplace' analytical functions, similar to pandas.
But from the documentation, I didn't find something more then direct access to the data, or conversion to pandas.

Is there some support for such functions without using pandas (which requires copying the memory)?
specifically I need group by and indexing capabilities

@paddyhoran
Copy link
Contributor

@ophiry Arrow will serve as the computational core of pandas eventually. When you say dataframe like functions, I believe that these will always be in pandas, but pandas users can expect to see many improvements due to moving to an Arrow backend.

What is in-scope for Arrow is more low level in this regard, things like query planning and compute graph optimization operating on Arrow memory. gandiva which was started by Dremio is an example of some early work in this area. Discussion regarding merging Gandiva into Arrow is underway and very early bindings (for python) are here.

Just my understanding, @wesm or @xhochy might correct me :)

@xhochy
Copy link
Member

xhochy commented Jun 28, 2018

At the moment the best reference is my prototype repository https://github.com/xhochy/fletcher where I use Arrow Arrays as the memory backend for Pandas columns. You can use Arrow with that in Pandas without copies while getting most of the functionality. There is also some code that uses Numba to get highly efficient analytical functions on Arrow arrays. Note that the code in this repository is highly experimental at the moment.

More information on possible (and already current uses of Arrow) can be found at https://ursalabs.org/tech/

@wesm
Copy link
Member

wesm commented Jul 5, 2018

Just to add some context from my perspective: I don't really see pandas moving to an Arrow-based backend. There is too much legacy code. About 2.5 years ago we started discussing the idea of "pandas2", an API-incompatible future iteration of pandas. This was around the time that we were starting Apache Arrow, and to me it makes a lot of sense to base the future of in-memory analytics in Python on an open standard for columnar data (i.e. the Arrow columnar format) than something bespoke and custom to Python like we have now.

In the Arrow project, we have strived to be "front end agnostic", which is to say that we aren't really intending to build a user-facing library in the style of pandas in the Arrow Python bindings. The intent is for developers to be able to craft different styles of front ends to suit the needs of particular use cases. For example, for the last few years we have been developing Ibis, a sort of "deferred pandas" system https://github.com/ibis-project/ibis.

I expect work in analytic functions in the style of pandas to happen in this repository to happen over the coming 2-3 years. We need as much help as possible with this; it is likely to be a mix of precompiled C/C++ functions as well as just-in-time compiled LLVM functions.

Reminder as always that I have funding and job openings available in my organization available to work on this (and maintain the Arrow project more generally). Thanks!

@pcmoritz
Copy link
Contributor

pcmoritz commented Jul 5, 2018

This is great, thanks for sharing your thoughts!

Given that there are already many people out there who know the pandas API well, do you think it's a good idea to make a subset of the pandas functionality work backed by arrow (say similar to the fletcher project that xhochy linked) and give errors or performance warnings for pandas functionality that can't be implemented in this way (and say perform the operation on a copy), together with instructions to migrate it?

@paddyhoran
Copy link
Contributor

Thanks for correcting me @wesm.

I expect work in analytic functions in the style of pandas to happen in this repository to happen over the coming 2-3 years. We need as much help as possible with this; it is likely to be a mix of precompiled C/C++ functions as well as just-in-time compiled LLVM functions.

this respository being ibis, right?

Wouldn't this mean that the scope of Arrow is different for C++ and Python because there is a desire to move Gandiva into the Arrow project, but on the Python side Ibis would serve this purpose?

Apologies if I am mistaken.... again :)

@wesm
Copy link
Member

wesm commented Jul 6, 2018

@paddyhoran no, I mean this codebase, Apache Arrow. Ibis is a front end-only system; currently it is a front end for SQL-based systems and pandas (which serve as back ends). Analytic function implementations are "back end" concerns.

@paddyhoran
Copy link
Contributor

Right, sorry for the confusion, that was my originally understanding. Thanks

@buchanae
Copy link

buchanae commented Jul 6, 2018

We need as much help as possible with this

I'm interested. Can you point me to a good starting point? What's the lowest hanging fruit? I can poke around the codebase and guess, but some guidance would speed things up.

@wesm
Copy link
Member

wesm commented Jul 6, 2018

You can look in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/compute-test.cc. Note this is the most unrefined part of the codebase, so very little is set in stone

@wesm wesm closed this as completed Jul 6, 2018
@TomScheffers
Copy link

For my own projects, I wrote the pyarrow_ops package which performs pandas like operations on the pyarrow.Table directly. Currently it supports join, groupby, filters, drop_duplicates and head operations, but it can be easily extended. Maybe people are interested in extending on this work or optimizing it (for example, by calling more arrow.compute operations directly)? Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants