Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 #8082

Merged
merged 2 commits into from
May 11, 2023

Conversation

rahil-c
Copy link
Contributor

@rahil-c rahil-c commented Mar 1, 2023

Change Logs

Minor changes to pom

Impact

Upgrade spark to 3.3.2

Risk level

Low, Minor upgade

Documentation Update

Will need doc update

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@rahil-c
Copy link
Contributor Author

rahil-c commented Mar 3, 2023

@hudi-bot run azure

@rahil-c
Copy link
Contributor Author

rahil-c commented Mar 8, 2023

@hudi-bot run azure

@wolf8334
Copy link

I am new to both github and hudi,but I want to help.
so plz tell me what can I do for you?

@nikoshet
Copy link

Any info on when will this be merged?

@rahil-c
Copy link
Contributor Author

rahil-c commented Apr 11, 2023

Any info on when will this be merged?

@nikoshet Will try to get this in ideally sometime this week.

@yihua If you can review this when you get a chance would be great.

I am new to both github and hudi,but I want to help. so plz tell me what can I do for you?
Thanks @wolf8334 for the most part I think this is a small pr so I think I can do this one, but if you would like to help out on other prs feel free to message me at Rahil C via the hudi slack.

@rahil-c
Copy link
Contributor Author

rahil-c commented Apr 11, 2023

@hudi-bot run azure

import org.apache.spark.sql.sources.Filter
import org.apache.spark.sql.types.StructType

class Spark332PlusHoodieParquetFileFormat(override protected val shouldAppendPartitionValues: Boolean) extends Spark32PlusHoodieParquetFileFormat(shouldAppendPartitionValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this class under hudi-spark3.3.x, Hudi won't be able to compile with Spark 3.3.1 anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is true since, but my understanding is currently that hudi supports only minor version of spark. https://github.com/apache/hudi/blob/master/pom.xml#L159 im not sure if users should be changing the version in the poms and trying to compile with a different version. cc @yihua

Currently when im taking the open source jars from Hudi which are compiled via spark 3.3.1 and try running them on spark 3.3.2 they hit failures due to the new config causing issues, so we eventually I think have to support this.

However I think the main thing we need to check is runtime support meaning if this change of upgrading to spark 3.3.2 atleast can run on a Spark 3.3.1 runtime so we dont break users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this when compiling with spark 3.3.1 things are working fine even with these 3.3.2 changes .
mvn clean install -Dspark3 -Dscala-2.12 -DskipTests -pl packaging/hudi-spark-bundle -am -Dspark.version=3.3.1

[INFO] Hudi ............................................... SUCCESS [  2.629 s]
[INFO] hudi-tests-common .................................. SUCCESS [  1.752 s]
[INFO] hudi-common ........................................ SUCCESS [ 29.852 s]
[INFO] hudi-hadoop-mr ..................................... SUCCESS [  7.301 s]
[INFO] hudi-sync-common ................................... SUCCESS [  2.429 s]
[INFO] hudi-hive-sync ..................................... SUCCESS [  6.965 s]
[INFO] hudi-aws ........................................... SUCCESS [  4.255 s]
[INFO] hudi-timeline-service .............................. SUCCESS [  2.550 s]
[INFO] hudi-client ........................................ SUCCESS [  0.082 s]
[INFO] hudi-client-common ................................. SUCCESS [ 13.594 s]
[INFO] hudi-spark-client .................................. SUCCESS [ 36.826 s]
[INFO] hudi-spark-datasource .............................. SUCCESS [  0.079 s]
[INFO] hudi-spark-common_2.12 ............................. SUCCESS [ 28.910 s]
[INFO] hudi-spark3-common ................................. SUCCESS [  9.304 s]
[INFO] hudi-spark3.2plus-common ........................... SUCCESS [ 12.118 s]
[INFO] hudi-spark3.3.x_2.12 ............................... SUCCESS [ 23.851 s]
[INFO] hudi-java-client ................................... SUCCESS [  4.792 s]
[INFO] hudi-spark_2.12 .................................... SUCCESS [ 53.643 s]
[INFO] hudi-spark3.3-bundle_2.12 .......................... SUCCESS [ 29.585 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:31 min
[INFO] Finished at: 2023-04-19T11:56:41-07:00
[INFO] ---------------------------------------------

confirmed via mvn logs that it was using 3.3.1

[INFO] Copying derby-10.14.2.0.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/derby-10.14.2.0.jar
[INFO] Copying spark-sql_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-sql_2.12-3.3.1-tests.jar
[INFO] Copying spark-core_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-core_2.12-3.3.1-tests.jar
[INFO] Copying spark-catalyst_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-catalyst_2.12-3.3.1-tests.jar
[INFO] Copying scala-parser-combinators_2.12-1.1.2.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/scala-parser-combinators_2.12-1.1.2.jar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good if we are not running into any compilation issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log you show is different; it copies the jars to target/lib. Could you try modifying the Spark version directly in POM and see if the changes can be compiled with Spark 3.3.1, to be safe?

@@ -18,7 +18,7 @@
package org.apache.spark.sql.adapter

import org.apache.avro.Schema
import org.apache.hudi.Spark33HoodieFileScanRDD
import org.apache.hudi.{HoodieSparkUtils, Spark33HoodieFileScanRDD}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will remove thanks

@danny0405
Copy link
Contributor

Thanks for the fix @rahil-c , you may need to rebase with the latest master to get rid of the test failures.

@bvaradar
Copy link
Contributor

bvaradar commented May 2, 2023

@yihua : FYI

@danny0405 danny0405 added priority:blocker release-0.13.1 and removed priority:major degraded perf; unable to move forward; potential bugs labels May 2, 2023
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahil-c overall LGTM. Could you run some tests on Spark 3.3.1 as well (either using Java CI or manually), which should execute the changes, to make sure it's backward compatible?

@@ -58,6 +58,7 @@ private[hudi] trait SparkVersionsSupport {
def gteqSpark3_2_1: Boolean = getSparkVersion >= "3.2.1"
def gteqSpark3_2_2: Boolean = getSparkVersion >= "3.2.2"
def gteqSpark3_3: Boolean = getSparkVersion >= "3.3"
def gteqSpark3_3_2: Boolean = getSparkVersion >= "3.3.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this based on compile-time Spark version or runtime Spark version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be runtime, I think.

@danny0405
Copy link
Contributor

There was one failure in the CI: TestAvroSchemaResolutionSupport.testDataTypePromotions

@rahil-c
Copy link
Contributor Author

rahil-c commented May 9, 2023

There was one failure in the CI: TestAvroSchemaResolutionSupport.testDataTypePromotions

Had offline conversation, will disable this test for spark 332 for now.

@danny0405
Copy link
Contributor

Let's disable the vec reader for spark 3.3.2 first, will ping @xiarixiaoyao to dig into the format issue.

@rahil-c rahil-c changed the title [HUDI-5868] Upgrade Spark to 3.3.2 [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 May 10, 2023
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, nice job @rahil-c ~

@yihua
Copy link
Contributor

yihua commented May 10, 2023

Agree that at this point let's disable vectorized reader for spark 3.3.2 only without affecting the Hudi integration with existing Spark versions, and keep compile version to be Spark 3.3.1.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 65172d3 into apache:master May 11, 2023
kazdy added a commit to kazdy/hudi that referenced this pull request May 12, 2023
mansipp pushed a commit to mansipp/hudi that referenced this pull request May 12, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
yihua added a commit that referenced this pull request May 13, 2023
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in #8082.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
yihua added a commit to yihua/hudi that referenced this pull request May 15, 2023
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
yihua added a commit to yihua/hudi that referenced this pull request May 15, 2023
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
mansipp pushed a commit to mansipp/hudi that referenced this pull request May 16, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
mansipp added a commit to mansipp/hudi that referenced this pull request May 16, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
yihua added a commit to yihua/hudi that referenced this pull request May 17, 2023
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
rahil-c added a commit to rahil-c/hudi that referenced this pull request May 26, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
rahil-c added a commit to rahil-c/hudi that referenced this pull request May 26, 2023
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants