Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Officially maintained Arrow2 branch #1556

Merged
merged 46 commits into from
Jan 20, 2022
Merged

Officially maintained Arrow2 branch #1556

merged 46 commits into from
Jan 20, 2022

Conversation

houqp
Copy link
Member

@houqp houqp commented Jan 14, 2022

Which issue does this PR close?

Close arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3

Rationale for this change

Provide a complete arrow2 based datafusion implementation for full evaluation of the migration. This should give us a good feeling of the arrow2 API UX as well as a starting point for performance benchmarks within datafusion and downstream projects.

The goal is to merge this code into an official arrow2 branch in the short run, until we are comfortable doing the switch in master.

What changes are included in this PR?

Here is a TPCH benchmark I ran on my Linux laptop (baseline 2008b1d):

Screenshot_20220113_174918

On avg, we are getting around 5% speed up across the board, with q5 at 11% and q12 at only 1% as the two outliners. If this performance gain can also be replicated in downstream projects, then I think it would be a strong case for us to do the arrow2 swtich. On top of this, we end up with a nice 1000+ lines of code reduction ;)

Are there any user-facing changes?

Yes, downstream consumer of datafusion will need to switch to arrow2 as well.

jorgecarleitao and others added 30 commits July 5, 2021 17:05
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
* wip

* more

* Make scalar.rs compile

* Fix various compilation error due to API difference

* Make datafusion core compile

* fmt

* wip

* wip: compile ballista

* Pass all datafusion tests

* Compile ballista
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
* WIP: on making cargo test compile

* make cargo test compile

* fix
…ts (#10)

* Fix tests

* Ignore last test, fix clippy, fmt and enable integration

* more clippy fix
Co-authored-by:  Yijie Shen <henry.yijieshen@gmail.com>
@houqp
Copy link
Member Author

houqp commented Jan 16, 2022

Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:

  • Get the parquet row group filter test to pass
  • Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master.

I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well.

@Igosuki
Copy link
Contributor

Igosuki commented Jan 16, 2022 via email

@houqp
Copy link
Member Author

houqp commented Jan 17, 2022

The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591. I changed the expected result in this branch to fix the test failure for now. What the predicate pruning logic returns in this branch is more correct than what we have in master, but still wrong. The proper fix is out of scope of arrow2 migration and tracked in #1591.

We are now passing all 856 unit tests. 2 more integration tests to fix, which are caused by difference in how arrow2 formats binary array.

@houqp
Copy link
Member Author

houqp commented Jan 17, 2022

I also noticed my benchmarks were ran with data generated from tpch-gen.sh, which only produces single partition CSV files. @andygrove could you share with me how you generated your sf100 dataset?

@yjshen
Copy link
Member

yjshen commented Jan 17, 2022

The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591.

I shared the same observation in houqp#16, but ignored the test at the time.

@houqp houqp force-pushed the arrow2_merge branch 2 times, most recently from 5ad816a to 61ace8f Compare January 18, 2022 07:00
@houqp
Copy link
Member Author

houqp commented Jan 18, 2022

update: all datafusion unit and integration tests are passing now, down to a single test failure in datafusion-cli related to json display format.

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

I think we should merge it into the arrow2 branch and keep iterating from there. I suspect the next big chunk of work is the RecordBatch removal / adaptation in jorgecarleitao/arrow2#717

@houqp
Copy link
Member Author

houqp commented Jan 19, 2022

oops, looks like the arrow2 branch got updated with latest commits from master, anyone mind if I revert it back to 2008b1d and handle master catch up in a follow up PR?

@houqp houqp merged commit c0c9c72 into apache:arrow2 Jan 20, 2022
@houqp
Copy link
Member Author

houqp commented Jan 20, 2022

All integration and unit tests are passing now, the MIRI check is failing due to an upstream tokio issue I believe. I will file some follow up issues tomorrow to track the remaining work needed for us to make the final call on master merge.

@houqp
Copy link
Member Author

houqp commented Jan 20, 2022

Thank you @jorgecarleitao @yjshen and @Igosuki for your hard work on the migration thus far :)

@yjshen
Copy link
Member

yjshen commented Jan 20, 2022

Wow! Milestone reached! Thanks for driving on this and making it happen @houqp 👍

@Igosuki
Copy link
Contributor

Igosuki commented Jan 20, 2022 via email

@xudong963
Copy link
Member

Thanks @houqp , epic work !

@Igosuki
Copy link
Contributor

Igosuki commented Jan 23, 2022

arrow2

@andygrove What tool did you use to get such a smooth CPU chart ?

@houqp
Copy link
Member Author

houqp commented Jan 24, 2022

Quick update on this, I have cleaned up the issues in the arrow2 milestone: https://github.com/apache/arrow-datafusion/milestone/3. The main remaining items are:

I will keep work on issues in the arrow2 milestone whenever I have capacity. If anyone of you are interested in helping, please feel free to comment on those issues or send PRs to the official arrow2 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ballista datafusion Changes in the datafusion crate documentation Improvements or additions to documentation sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants