-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does pyarrow support analytic (dataframe like) functions? #2189
Comments
@ophiry Arrow will serve as the computational core of pandas eventually. When you say dataframe like functions, I believe that these will always be in pandas, but pandas users can expect to see many improvements due to moving to an Arrow backend. What is in-scope for Arrow is more low level in this regard, things like query planning and compute graph optimization operating on Arrow memory. gandiva which was started by Dremio is an example of some early work in this area. Discussion regarding merging Gandiva into Arrow is underway and very early bindings (for python) are here. |
At the moment the best reference is my prototype repository https://github.com/xhochy/fletcher where I use Arrow Arrays as the memory backend for Pandas columns. You can use Arrow with that in Pandas without copies while getting most of the functionality. There is also some code that uses Numba to get highly efficient analytical functions on Arrow arrays. Note that the code in this repository is highly experimental at the moment. More information on possible (and already current uses of Arrow) can be found at https://ursalabs.org/tech/ |
Just to add some context from my perspective: I don't really see pandas moving to an Arrow-based backend. There is too much legacy code. About 2.5 years ago we started discussing the idea of "pandas2", an API-incompatible future iteration of pandas. This was around the time that we were starting Apache Arrow, and to me it makes a lot of sense to base the future of in-memory analytics in Python on an open standard for columnar data (i.e. the Arrow columnar format) than something bespoke and custom to Python like we have now. In the Arrow project, we have strived to be "front end agnostic", which is to say that we aren't really intending to build a user-facing library in the style of pandas in the Arrow Python bindings. The intent is for developers to be able to craft different styles of front ends to suit the needs of particular use cases. For example, for the last few years we have been developing Ibis, a sort of "deferred pandas" system https://github.com/ibis-project/ibis. I expect work in analytic functions in the style of pandas to happen in this repository to happen over the coming 2-3 years. We need as much help as possible with this; it is likely to be a mix of precompiled C/C++ functions as well as just-in-time compiled LLVM functions. Reminder as always that I have funding and job openings available in my organization available to work on this (and maintain the Arrow project more generally). Thanks! |
This is great, thanks for sharing your thoughts! Given that there are already many people out there who know the pandas API well, do you think it's a good idea to make a subset of the pandas functionality work backed by arrow (say similar to the fletcher project that xhochy linked) and give errors or performance warnings for pandas functionality that can't be implemented in this way (and say perform the operation on a copy), together with instructions to migrate it? |
Thanks for correcting me @wesm.
Wouldn't this mean that the scope of Arrow is different for C++ and Python because there is a desire to move Gandiva into the Arrow project, but on the Python side Ibis would serve this purpose? Apologies if I am mistaken.... again :) |
@paddyhoran no, I mean this codebase, Apache Arrow. Ibis is a front end-only system; currently it is a front end for SQL-based systems and pandas (which serve as back ends). Analytic function implementations are "back end" concerns. |
Right, sorry for the confusion, that was my originally understanding. Thanks |
I'm interested. Can you point me to a good starting point? What's the lowest hanging fruit? I can poke around the codebase and guess, but some guidance would speed things up. |
You can look in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/compute-test.cc. Note this is the most unrefined part of the codebase, so very little is set in stone |
For my own projects, I wrote the pyarrow_ops package which performs pandas like operations on the pyarrow.Table directly. Currently it supports join, groupby, filters, drop_duplicates and head operations, but it can be easily extended. Maybe people are interested in extending on this work or optimizing it (for example, by calling more arrow.compute operations directly)? Let me know! |
From what I understand, pyarrow is intended to supply (or at least support) 'inplace' analytical functions, similar to pandas.
But from the documentation, I didn't find something more then direct access to the data, or conversion to pandas.
Is there some support for such functions without using pandas (which requires copying the memory)?
specifically I need group by and indexing capabilities
The text was updated successfully, but these errors were encountered: