-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 #8082
Conversation
@hudi-bot run azure |
f795825
to
f5b3f09
Compare
@hudi-bot run azure |
I am new to both github and hudi,but I want to help. |
Any info on when will this be merged? |
@nikoshet Will try to get this in ideally sometime this week. @yihua If you can review this when you get a chance would be great.
|
@hudi-bot run azure |
import org.apache.spark.sql.sources.Filter | ||
import org.apache.spark.sql.types.StructType | ||
|
||
class Spark332PlusHoodieParquetFileFormat(override protected val shouldAppendPartitionValues: Boolean) extends Spark32PlusHoodieParquetFileFormat(shouldAppendPartitionValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this class under hudi-spark3.3.x
, Hudi won't be able to compile with Spark 3.3.1 anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is true since, but my understanding is currently that hudi supports only minor version of spark. https://github.com/apache/hudi/blob/master/pom.xml#L159 im not sure if users should be changing the version in the poms and trying to compile with a different version. cc @yihua
Currently when im taking the open source jars from Hudi which are compiled via spark 3.3.1 and try running them on spark 3.3.2 they hit failures due to the new config causing issues, so we eventually I think have to support this.
However I think the main thing we need to check is runtime support meaning if this change of upgrading to spark 3.3.2 atleast can run on a Spark 3.3.1 runtime so we dont break users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this when compiling with spark 3.3.1 things are working fine even with these 3.3.2 changes .
mvn clean install -Dspark3 -Dscala-2.12 -DskipTests -pl packaging/hudi-spark-bundle -am -Dspark.version=3.3.1
[INFO] Hudi ............................................... SUCCESS [ 2.629 s]
[INFO] hudi-tests-common .................................. SUCCESS [ 1.752 s]
[INFO] hudi-common ........................................ SUCCESS [ 29.852 s]
[INFO] hudi-hadoop-mr ..................................... SUCCESS [ 7.301 s]
[INFO] hudi-sync-common ................................... SUCCESS [ 2.429 s]
[INFO] hudi-hive-sync ..................................... SUCCESS [ 6.965 s]
[INFO] hudi-aws ........................................... SUCCESS [ 4.255 s]
[INFO] hudi-timeline-service .............................. SUCCESS [ 2.550 s]
[INFO] hudi-client ........................................ SUCCESS [ 0.082 s]
[INFO] hudi-client-common ................................. SUCCESS [ 13.594 s]
[INFO] hudi-spark-client .................................. SUCCESS [ 36.826 s]
[INFO] hudi-spark-datasource .............................. SUCCESS [ 0.079 s]
[INFO] hudi-spark-common_2.12 ............................. SUCCESS [ 28.910 s]
[INFO] hudi-spark3-common ................................. SUCCESS [ 9.304 s]
[INFO] hudi-spark3.2plus-common ........................... SUCCESS [ 12.118 s]
[INFO] hudi-spark3.3.x_2.12 ............................... SUCCESS [ 23.851 s]
[INFO] hudi-java-client ................................... SUCCESS [ 4.792 s]
[INFO] hudi-spark_2.12 .................................... SUCCESS [ 53.643 s]
[INFO] hudi-spark3.3-bundle_2.12 .......................... SUCCESS [ 29.585 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:31 min
[INFO] Finished at: 2023-04-19T11:56:41-07:00
[INFO] ---------------------------------------------
confirmed via mvn logs that it was using 3.3.1
[INFO] Copying derby-10.14.2.0.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/derby-10.14.2.0.jar
[INFO] Copying spark-sql_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-sql_2.12-3.3.1-tests.jar
[INFO] Copying spark-core_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-core_2.12-3.3.1-tests.jar
[INFO] Copying spark-catalyst_2.12-3.3.1-tests.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/spark-catalyst_2.12-3.3.1-tests.jar
[INFO] Copying scala-parser-combinators_2.12-1.1.2.jar to /Users/rchertar/hudi/hudi-spark-datasource/hudi-spark/target/lib/scala-parser-combinators_2.12-1.1.2.jar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good if we are not running into any compilation issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log you show is different; it copies the jars to target/lib
. Could you try modifying the Spark version directly in POM and see if the changes can be compiled with Spark 3.3.1, to be safe?
...org/apache/spark/sql/execution/datasources/parquet/Spark332PlusHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
...hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java
Show resolved
Hide resolved
@@ -18,7 +18,7 @@ | |||
package org.apache.spark.sql.adapter | |||
|
|||
import org.apache.avro.Schema | |||
import org.apache.hudi.Spark33HoodieFileScanRDD | |||
import org.apache.hudi.{HoodieSparkUtils, Spark33HoodieFileScanRDD} | |||
import org.apache.spark.sql._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will remove thanks
Thanks for the fix @rahil-c , you may need to rebase with the latest master to get rid of the test failures. |
@yihua : FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rahil-c overall LGTM. Could you run some tests on Spark 3.3.1 as well (either using Java CI or manually), which should execute the changes, to make sure it's backward compatible?
@@ -58,6 +58,7 @@ private[hudi] trait SparkVersionsSupport { | |||
def gteqSpark3_2_1: Boolean = getSparkVersion >= "3.2.1" | |||
def gteqSpark3_2_2: Boolean = getSparkVersion >= "3.2.2" | |||
def gteqSpark3_3: Boolean = getSparkVersion >= "3.3" | |||
def gteqSpark3_3_2: Boolean = getSparkVersion >= "3.3.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on compile-time Spark version or runtime Spark version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be runtime, I think.
...hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java
Show resolved
Hide resolved
.../org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala
Show resolved
Hide resolved
There was one failure in the CI: TestAvroSchemaResolutionSupport.testDataTypePromotions |
Had offline conversation, will disable this test for spark 332 for now. |
.../org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
Let's disable the vec reader for spark 3.3.2 first, will ping @xiarixiaoyao to dig into the format issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, nice job @rahil-c ~
Agree that at this point let's disable vectorized reader for spark 3.3.2 only without affecting the Hudi integration with existing Spark versions, and keep compile version to be Spark 3.3.1. |
…pache#8082)" This reverts commit 65172d3.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
…pache#8082)" This reverts commit 7db547f.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>
Change Logs
Minor changes to pom
Impact
Upgrade spark to 3.3.2
Risk level
Low, Minor upgade
Documentation Update
Will need doc update
Contributor's checklist