Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.4: Initial support #7378

Merged
merged 3 commits into from
Apr 19, 2023
Merged

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Apr 19, 2023

This PR adds initial support for Spark 3.4 and consists of 3 commits that must be preserved while merging.

The last change is the most important to review.

Note that this approach preserves the commit history only for 3.4 (our new default version). There are tricks to keep history both for 3.3 and 3.4 but they may cause issues while rebasing. That's why I followed the exact approach we used in 3.3.

It is worth mentioning that the DROP table behavior in the Spark session catalog is broken. That's why some tests had to be adapted. We are exploring a Spark fix at the moment.

Resolves #7174.

@aokolnychyi
Copy link
Contributor Author

@aokolnychyi
Copy link
Contributor Author

Note to reviewers, Russell left comments directly on 723487b. I'll create a JIRA for the Spark issue so that we can reference it in currently ignored tests.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @aokolnychyi! Should we do a performance benchmark at some point between 3.3 and 3.4?

@aokolnychyi
Copy link
Contributor Author

@Fokko, absolutely! I remember @bryanck was mentioning some benchmarking framework. Is there any chance we can use it for this? We also have some benchmarks internally, I'll ask @szehon-ho once 3.4 work is complete.

@bryanck
Copy link
Contributor

bryanck commented Apr 19, 2023

The TPC-DS benchmarking tool we currently use is EMR-specific unfortunately.

@aokolnychyi aokolnychyi merged commit a880794 into apache:master Apr 19, 2023
34 checks passed
@aokolnychyi
Copy link
Contributor Author

I merged this PR as it is really hard to keep up with changes in master. I had started from scratch multiple times. We will need to cherry-pick #6480 back to 3.3 as it is only in 3.4 after this PR. I'll do that next, it is easier than doing the copy again.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, Thanks a ton for adding this @aokolnychyi !

Just some minor comments on 723487b and 3.4 directory

* A benchmark that evaluates the performance of writing Parquet data with a flat schema using
* Iceberg and Spark Parquet writers.
*
* <p>To run this benchmark for spark-3.3: <code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] do we need to update this to 3.4 as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll follow up.

@aokolnychyi
Copy link
Contributor Author

@singhpk234, I replied to comments on the change. Let me know if that makes sense, I'll follow up to fix JMH Javadoc.

@singhpk234
Copy link
Contributor

@singhpk234, I replied to comments on the change. Let me know if that makes sense

Makes sense to me. Thanks @aokolnychyi !

@mgorsk1
Copy link

mgorsk1 commented May 7, 2023

thanks for this @aokolnychyi any eta on when we can expect iceberg-spark-runtime-3.14_2.12:1.2.0 in maven?

@vakarisbk
Copy link

I'm also interested in spark-runtime for 3.4.0

@aokolnychyi
Copy link
Contributor Author

@mgorsk1 @vakarisbk, the plan to get a public release out mid May.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Spark 3.4
7 participants