Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Switch back to official DataFusion repo and arrow-rs after Arrow Java 16 is released #403

Merged
merged 12 commits into from
Jun 7, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented May 8, 2024

Which issue does this PR close?

Closes #248.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya
Copy link
Member Author

viirya commented May 8, 2024

Note that this is blocked by #250. We need to verify Java Arrow after #250 is merged.

@viirya viirya force-pushed the use_official_releases branch 2 times, most recently from 3909ccf to 29c89bf Compare May 8, 2024 19:39
@viirya
Copy link
Member Author

viirya commented May 9, 2024

The fix (apache/datafusion#10094) is not released yet. So the following error still happens in CI:

CometAggregateSuite:
- count with aggregation filter (371 milliseconds)
- lead/lag should return the default value if the offset row does not exist *** FAILED *** (133 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 45) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Arrow error: Invalid argument error: must either specify a row count or at least one column

Need to wait for next release.

@viirya
Copy link
Member Author

viirya commented May 9, 2024

Besides, the Java Arrow bug is supported to be fixed in Arrow Java 16. But I still see it happens in a few pipelines:

- multiple column distinct count *** FAILED *** (392 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 59.0 failed 1 times, most recent failure: Lost task 5.0 in stage 59.0 (TID 117) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

And more interesting is, the test doesn't always have the failure. I also cannot reproduce it locally.
Wondering if there are more than one Java Arrow jar is included to randomly it picks up an older version. 🤔

BTW, it also only failed on ubuntu pipelines, mac os-14 pipelines are all okay without the C Data interface error.

@viirya viirya force-pushed the use_official_releases branch 3 times, most recently from 225a9f3 to cb2275b Compare May 9, 2024 18:58
@viirya
Copy link
Member Author

viirya commented May 9, 2024

Hmm, the error is actually different:

  Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

Previously, the fixed one is the buffer at position 1 (offset buffer):

General execution error with reason org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 1 is null...

I need to look at what the buffer is at position 2 and why Java Arrow passes a null buffer again...

@viirya viirya marked this pull request as draft May 9, 2024 20:03
@viirya
Copy link
Member Author

viirya commented May 9, 2024

Ah, I found that I made a mistake in the Java Arrow PR that it doesn't initiate the offset buffer well. Proposed another issue at Java Arrow apache/arrow#41609

I found it by trying to dump the buffer value:

- multiple column distinct count *** FAILED *** (391 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 71.0 failed 1 times, most recent failure: Lost task 5.0 in stage 71.0 (TID 125) (ed8fe556741f executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer with len = 18446744071618653625 at position 2 is null..

@viirya
Copy link
Member Author

viirya commented May 13, 2024

I fixed the issue at arrow-rs in the PR apache/arrow-rs#5756.

We can continue this PR after the fix is released. Keep this as draft for now.

@viirya
Copy link
Member Author

viirya commented May 14, 2024

I have verified the fix apache/arrow-rs#5756 can work by cherry-picking it in my forked branch of arrow-rs.

We only need to wait for new release of arrow-rs and DataFusion.

@advancedxy
Copy link
Contributor

We only need to wait for new release of arrow-rs and DataFusion.

Do we have an estimate on when new releases of arrow-rs and DataFusion will be available? I'm asking because I'm wondering whether to include this in the first release of Comet. It seems correct to use the released version of arrow-rs and DataFusion in the official release of Comet.

@viirya
Copy link
Member Author

viirya commented May 16, 2024

arrow-rs and DataFusion have fast release cycle. We can hold Comet release after the new releases of arrow-rs and DataFusion.

@advancedxy
Copy link
Contributor

We can hold Comet release after the new releases of arrow-rs and DataFusion.

Thanks for the clarification. I second this.

@viirya viirya changed the title build: Switch back to released version of DataFusion and arrow-rs after Arrow Java 16 is released build: Switch back to official DataFusion repo and arrow-rs after Arrow Java 16 is released May 30, 2024
@viirya
Copy link
Member Author

viirya commented May 30, 2024

I changed from my forked of DataFusion repo to official DataFusion repo in Cargo.toml.

@viirya
Copy link
Member Author

viirya commented May 31, 2024

@andygrove I can changed to use official DataFusion repo, but cannot do same for arrow-rs repo. The official DataFusion repo uses arrow-rs release 51.0.0. If I point to arrow-rs repo, rust compiler will complain that two different versions of crate of arrow are used.

But we need one merged patch in arrow-rs which is not released yet.

Thus we still need to wait for next arrow-rs and DataFusion releases.

@viirya
Copy link
Member Author

viirya commented Jun 4, 2024

Switching back to latest DataFusion release/repo causes two issues right now:

  1. C Data interface issue: fixed in arrow-rs. Wait for arrow-rs 52.0.0 and new DataFusion release which uses it.
  2. Aggregate functions (like first/last) issue: Incorrect return type on aggregate functions implemented by AggregateUDF when upgrading to latest DataFusion #511

@viirya
Copy link
Member Author

viirya commented Jun 6, 2024

Okay. After #511 is fixed, there is only one failures:

  Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

which is known and we already fixed it at the upstream. We only need to wait for next DataFusion and arrow-rs releases that should be happened soon.

@viirya
Copy link
Member Author

viirya commented Jun 7, 2024

Using 39.0.0-rc1 tag now.

@viirya viirya marked this pull request as ready for review June 7, 2024 17:34
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending ci

@viirya
Copy link
Member Author

viirya commented Jun 7, 2024

All tests are passed!

@viirya
Copy link
Member Author

viirya commented Jun 7, 2024

cc @andygrove

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@andygrove andygrove merged commit fd596ed into apache:main Jun 7, 2024
43 checks passed
@viirya
Copy link
Member Author

viirya commented Jun 7, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch back to released version of DataFusion and arrow-rs after Arrow Java 16 is released
4 participants