Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating arrow2 branch #2709

Closed
dbr opened this issue Jun 8, 2022 · 23 comments
Closed

Updating arrow2 branch #2709

dbr opened this issue Jun 8, 2022 · 23 comments
Labels
bug Something isn't working

Comments

@dbr
Copy link

dbr commented Jun 8, 2022

The arrow2 branch is currently using v0.10.0 - I'd like to update it to work with v0.12.0. Is there anything I should know before starting this?

I ask as last time I made a PR for the branch, there was an impending rework of that branch, and I'm not sure what the state of things are currently!

@dbr dbr added the bug Something isn't working label Jun 8, 2022
@andygrove
Copy link
Member

andygrove commented Jun 8, 2022

Would it make sense to consider having a separate, optional, datafusion-arrow2 crate containing a physical planner + operators that use arrow2? It seems to me that this would be easier to maintain. Maybe @houqp has an opinion on this?

@alamb
Copy link
Contributor

alamb commented Jun 8, 2022

cc @Igosuki as well who may have some thoughts

@Igosuki
Copy link
Contributor

Igosuki commented Jun 12, 2022

Hello, sorry for my absence. Maintaining full compatibility requires constant updates, and updating the arrow2 branch is often in the hundred files because of the API surface conflict. So adding a new crate where only arrow2 itself needs to be updated is a good idea in my opinion.

@houqp
Copy link
Member

houqp commented Jun 13, 2022

+1 on @andygrove 's idea for isolating changes to a self-contained crate. @dbr if you are interested in that, please feel free to start that project at the datafusion contrib github org. If you are only interested in updating arrow2 version in the current arrow2 branch, then you can just send a PR to the arrow2 branch :)

@andygrove
Copy link
Member

Should we have a discussion/vote about whether this should be in datafusion-contrib rather than under Arrow governance? There are the options of creating a new repo such as arrow-datafusion-arrow2 or a new optional crate within the main repo. The work on this so far has been under Arrow governance so we should make sure that all contributors are in agreement on moving it out and also be aware that moving it back in is a lot of work. In hindsight, I wish we had not moved the DataFusion Python bindings out of Arrow governance after all the work that was done to get them donated in the first place.

@andygrove
Copy link
Member

@alamb @jorgecarleitao Do you have opinions on this?

@alamb
Copy link
Contributor

alamb commented Jun 16, 2022

I do not have any preference, but I also don't really spend any effort on arrow2 or the arrow2 branch in datafusion -- perhaps @Igosuki or others who have expended significant effort would be better to weigh on

@jorgecarleitao
Copy link
Member

@andygrove, thanks for the ping! I do not have a preference - I am very happy to support anyone that wishes to use arrow2/parquet2, and I am very thankful for everyone that has been maintaining the arrow2 branch. I unfortunately do not have the time left to spearhead it.

@dbr I would be very happy to support you in the migration - let me know how I can help and feel free to assign tasks directly. Arrow2's core API is essentially stable and that we we usually have small breaking changes in IO, so most of the changes should be on the datafusion's side.

For full transparency: imo the primary question here is one of control - although we demonstrated that arrow2 had superior support, safety, performance, documentation, and UX, I sensed that the main requirement was that arrow2 was donated to Apache, which for me was unacceptable. This demotivated me from contributing here.

I understand that companies that have a strong dependency on DataFusion need control over its core dependency. For me it demos that we are a bit past the idea that people in Apache are individual contributors.

The core development of Polars and Databend have been more pragmatic and I have thus been more aware of the requests there. However, if there is anything needed, just drop an issue and I will gladly take it, irrespectively of where the development happens (here or in a separate repo / org).

I continue to believe that DataFusion would benefit immensely from arrow2's current and future capabilities - arrow2 is the backend of likely the fastest in-node dataframe APIs out there and just this week we exposed a new API that enable compute to be 2x-5x faster ;)

@alamb
Copy link
Contributor

alamb commented Jun 16, 2022

although we demonstrated that arrow2 had superior support, safety, performance, documentation, and UX

I believe there are differing opinions on some of these items

I do think there is broad agreement that the governance model of arrow2 (benign dictator) is not suitable for all users for a variety of reasons.

@Igosuki
Copy link
Contributor

Igosuki commented Jun 17, 2022

The main issue a few months ago was that the datafusion codebase was not split like it is now, so it created conflicts in the various places where it wasn't possible to confine differences between arrow and arrow2::io to extension traits. So my question is, can a contrib crate exist since arrow2 is essentially required at the core ?

@andygrove
Copy link
Member

I do think there is broad agreement that the governance model of arrow2 (benign dictator) is not suitable for all users for a variety of reasons.

I agree. Although there is some great work happening in arrow2, I am not motivated (or even permitted) to contribute there. I am heavily invested in working with the Apache Arrow community to continue to improve things here together. The Apache governance model is not perfect but it is working well enough from my point of view. I am a big believer in community over code.

@andygrove
Copy link
Member

The main issue a few months ago was that the datafusion codebase was not split like it is now, so it created conflicts in the various places where it wasn't possible to confine differences between arrow and arrow2::io to extension traits. So my question is, can a contrib crate exist since arrow2 is essentially required at the core ?

Regardless of where the code lives, I would view us as having two main options:

  1. Continue to maintain a copy of DataFusion "core" with modifications to allow building with arrow or arrow2. This will be an ongoing maintenance burden and IMO only worth the effort if we are working towards having this merged into DataFusion master at some point and I am not sure how much appetite there is for this.

  2. Fork datafusion-core to create a new query engine based on arrow2 that can leverage the new DataFusion crates for logical plan building, SQL query planning, logical plan optimizations etc. Over time maybe more functionality can be moved out of core and re-used by other query engines. This would decouple the development work and allow those interested in arrow2 to move faster and does not put any burden on the core DataFusiuon development. That would be my choice if I were working on this.

@alamb
Copy link
Contributor

alamb commented Jun 18, 2022

One approach that @jimexist started with, that I thought was promising was to use traits to collect the differences in the two apis -- so that DataFusion used the traits and then there were impls for both arrow and arrow2

For example, from_slice in #1588 and #1589

I would be quite supportive of such a strategy and would be a nice way to let people use which every library they wanted

@andygrove
Copy link
Member

I would be quite supportive of such a strategy and would be a nice way to let people use which every library they wanted

Bonus points if it could one day support a GPU implementation of Arrow as well 😉

@waynexia
Copy link
Member

so that DataFusion used the traits and then there were impls for both arrow and arrow2

This indirection might also help to decouple downstream's arrow version with DF's.

@houqp
Copy link
Member

houqp commented Jun 27, 2022

The current arrow2 branch was not created with the intention to keep it as a long running branch, instead, the original plan was to use it as a tool to validate (#1652) whether we will be able to get a significant performance boost by switching to arrow2, and if so, convince the rest of the community to make that jump :)

Unfortunately, due to personal reasons, I don't have the time to keep pushing this forward anymore.

I agree with everyone that coming up with a zero cost arrow trait abstraction would be the best path forward, especially considering the possibility of adding arrow-gpu in the future!

@jorgecarleitao
Copy link
Member

@alamb AFAIK governance has not been a factor when considering dependencies in DataFusion, but maybe I missed it when that decision taken. Specifically, going through some of its dependencies:

  • sqlparser - unclear governance
  • ahash - unclear governance
  • tokio - Tokio team; could not find governance doc
  • prost - Tokio team; could not find governance doc
  • thrift - clear governance, Apache Arrow
  • parquet-format - unclear governance
  • rayon - unclear governance
  • rand - unclear governance
  • parking_lot - unclear governance
  • smallvec - unclear governance
  • cranelift - clear governance, ByteCodeAlliance
  • avro-rs - unclear governance / donated to Apache but no release yet.
  • parquet - clear governance, Apache Arrow
  • chrono - unclear governance
  • async-trait - unclear governance
  • (under PR) object_store - unclear governance

"arrow2 - unclear governance" is imo a more adequate characterization here - I have been contributing the most to it, so it is natural that I have been taking most of the decisions. Ritchie from Polars, @Dandandan, @alamb and others have full write permissions to the repo; @houqp and Ritchie have permissions to publish to crates.io.

I would be happy to set/modify a governance to arrow2 to accommodate this community's needs, but I would need to know what is the exact requirement, because being part of Apache does not seem to be one today.

Wrt to community over code, I certainly agree with you, @andygrove - I just believe that the Apache Foundation does not own the monopoly of that practice - there are many projects and organizations that fulfill that without forcing projects to a structure like Apache forces, in particular how contributions to some part of the project induce control over the whole project (being PMC).

I agree that going through a trait abstraction is a way. Creating an "arrow2arrow" as suggested in jorgecarleitao/arrow2#629 would also be a way. I would be willing to create and maintain such a crate if that would help this community, but again, we would end up with the same question - it would not be Apache governed, can it be a dependency?

@tustvold
Copy link
Contributor

AFAIK governance has not been a factor when considering dependencies in DataFusion

To conflate arrow which is core to both the in-memory layout and query computation, with something fairly self-contained like smallvec, chrono or ahash I think is a wee bit disingenuous 😅

It is fairly common for work in arrow-datafusion to build upon/require work in arrow-rs, whereas the same is not true of those listed dependencies. Given we are still talking about this I think we have clearly demonstrated arrow is not a trivial dependency, and therefore warrants some additional care and attention...

was to use traits to collect the differences in the two apis -- so that DataFusion used the traits and then there were impls for both arrow and arrow2

If practicable this sounds like a good idea to me. It would also fit with my ongoing efforts, articulated here, to gradually bring incubated ideas from arrow2 to arrow-rs. I'm actually pretty happy with the progress that has been made so far, and having an easy way to compare the two implementations I think would prove insightful.

In particular I have been inspired by the way the arrow-rs community has worked together to drive progress, through code contributions or otherwise. Yes arrow-rs still has many warts, and is often not particularly pleasant to work on, but there is an active community of people working to make it better in whatever ways they can, and that to me is what open source is all about. For some the governance question is one of legal practicalities, but at least for me it is about having a friendly community working together to deliver great software.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2022

We have proposed donating object_store to Arrow on influxdata/object_store_rs#41

@andygrove
Copy link
Member

I have filed an issue for adding GPU support: apache/arrow-rs#1966

The proposed arrow-gpu crate differs from arrow and arrow2 because it is necessary to transfer the array data to the GPU before invoking GPU kernels and then the results can be transferred back. It does not make sense to support accessing individual elements in arrays stored on the GPU. However, in other ways they look similar to arrays on the host in that they have data types and you can query the length and whether they have nulls.

As we think about abstractions it would be good to take this into account.

@v0y4g3r
Copy link

v0y4g3r commented Jul 7, 2022

Lets move back to upgrading arrow2 version in arrow2 branch. I'm quite interested in that, but I found that some unit tests are currently failing:

  • datasource::file_format::json::tests::read_limit
  • datasource::file_format::json::tests::read_small_batches
  • execution::dataframe_impl::tests::register_table
  • optimizer::simplify_expressions::tests::cast_expr_wrong_arg
  • optimizer::simplify_expressions::tests::multiple_now_expr
  • optimizer::simplify_expressions::tests::now_less_than_timestamp
  • optimizer::simplify_expressions::tests::test_const_evaluator_now
  • physical_plan::file_format::json::tests::nd_json_exec_file_projection
  • physical_plan::file_format::json::tests::nd_json_exec_file_with_missing_column
  • physical_plan::file_format::json::tests::nd_json_exec_file_without_projection
  • physical_plan::file_format::parquet::tests::parquet_exec_with_error
  • physical_plan::file_format::parquet::tests::write_parquet_results
  • physical_plan::planner::tests::bad_extension_planner
  • physical_plan::sorts::sort::tests::test_sort_spill

I'd like to first upgrade arrow2 version and then fix these failing tests.
Also, upgrading arrow2 to latest version v0.12 means upgrading some dependencies like tonic/parquet-format-async-temp etc, still don't how much work to do. I will open a PR as soon as I finish.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2023

Given the current discussion about arrow/arrow2 unification (see apache/arrow-rs#1176 (comment)) I think this issue may no longer be relevant.

If no one objects, I'll plan to close it in the next few days

@alamb
Copy link
Contributor

alamb commented Mar 6, 2023

Hearing no other comments here, I am closing this one. Feel free to chime in if you disagree or have other comments to add

@alamb alamb closed this as completed Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants