-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.4: Initial support #7378
Conversation
723487b
to
089dc4d
Compare
Note to reviewers, Russell left comments directly on 723487b. I'll create a JIRA for the Spark issue so that we can reference it in currently ignored tests. |
089dc4d
to
59571fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @aokolnychyi! Should we do a performance benchmark at some point between 3.3 and 3.4?
@Fokko, absolutely! I remember @bryanck was mentioning some benchmarking framework. Is there any chance we can use it for this? We also have some benchmarks internally, I'll ask @szehon-ho once 3.4 work is complete. |
59571fa
to
1d8481e
Compare
The TPC-DS benchmarking tool we currently use is EMR-specific unfortunately. |
I merged this PR as it is really hard to keep up with changes in master. I had started from scratch multiple times. We will need to cherry-pick #6480 back to 3.3 as it is only in 3.4 after this PR. I'll do that next, it is easier than doing the copy again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, Thanks a ton for adding this @aokolnychyi !
Just some minor comments on 723487b and 3.4 directory
* A benchmark that evaluates the performance of writing Parquet data with a flat schema using | ||
* Iceberg and Spark Parquet writers. | ||
* | ||
* <p>To run this benchmark for spark-3.3: <code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] do we need to update this to 3.4 as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll follow up.
@singhpk234, I replied to comments on the change. Let me know if that makes sense, I'll follow up to fix JMH Javadoc. |
Makes sense to me. Thanks @aokolnychyi ! |
thanks for this @aokolnychyi any eta on when we can expect iceberg-spark-runtime-3.14_2.12:1.2.0 in maven? |
I'm also interested in spark-runtime for 3.4.0 |
@mgorsk1 @vakarisbk, the plan to get a public release out mid May. |
This PR adds initial support for Spark 3.4 and consists of 3 commits that must be preserved while merging.
The last change is the most important to review.
Note that this approach preserves the commit history only for 3.4 (our new default version). There are tricks to keep history both for 3.3 and 3.4 but they may cause issues while rebasing. That's why I followed the exact approach we used in 3.3.
It is worth mentioning that the DROP table behavior in the Spark session catalog is broken. That's why some tests had to be adapted. We are exploring a Spark fix at the moment.
Resolves #7174.