-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6198] Support Hudi on Spark 3.4.0 #8885
Conversation
spark 2.4 and spark 3.0 are not supported by iceberg recently. Does hudi consider reducing the maintenance of the version |
Maintaining compatibility across multiple Spark versions is indeed challenging and time-taking. However, there are still users using Hudi on Spark 2.4 and 3.0, so we cannot drop the support on Spark 2.4 and 3.0 at least for the next two releases. |
hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java
Outdated
Show resolved
Hide resolved
...rce/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/HoodieSpark32CatalystPlanUtils.scala
Show resolved
Hide resolved
...cala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
@@ -34,6 +34,15 @@ class HoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport | |||
|
|||
override def toString: String = "Hoodie-Parquet" | |||
|
|||
override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FR: Spark 3.4 now supports vectorized reader on nested fields. However, Hudi does not support this yet due to custom schema evolution logic. So we add logic to override supportBatch
in HoodieParquetFileFormat
for Spark 3.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the JIRA for this as a comment here? https://issues.apache.org/jira/browse/HUDI-6262 for why we have this override, and that we should remove this once we support complex types in hudi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIRA to track: HUDI-6347
@@ -56,30 +58,37 @@ class TestIndexSyntax extends HoodieSparkSqlTestBase { | |||
|
|||
var logicalPlan = sqlParser.parsePlan(s"show indexes from default.$tableName") | |||
var resolvedLogicalPlan = analyzer.execute(logicalPlan) | |||
assertResult(s"`default`.`$tableName`")(resolvedLogicalPlan.asInstanceOf[ShowIndexesCommand].table.identifier.quotedString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FR: table.identifier.quotedString
now also has catalog name as the prefix.
@@ -733,8 +734,8 @@ object HoodieBaseRelation extends SparkAdapterSupport { | |||
|
|||
partitionedFile => { | |||
val hadoopConf = hadoopConfBroadcast.value.get() | |||
val reader = new HoodieAvroHFileReader(hadoopConf, new Path(partitionedFile.filePath), | |||
new CacheConfig(hadoopConf)) | |||
val filePath = sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(partitionedFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Reviewer (FR): all the changes in the common module of introducing new adapter support are because of Spark 3.4 class and API changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
Hi @zhangyue19921010 @xiarixiaoyao @nsivabalan @xushiyan @danny0405, could you also review this PR? |
…on and add checkColumnNameDuplication to fix AlterHoodieTableAddColumnsCommand
…w more compatibility issues
// dialect: JdbcDialect, | ||
// alwaysNullable: Boolean = false, | ||
// isTimestampNTZ: Boolean = false): StructType | ||
JdbcUtils.getSchema(resultSet, dialect, alwaysNullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, where does this get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used by UtilHelpers.getJDBCSchema
for JDBC-based schema provider.
* notMatchedActions: Seq[MergeAction], | ||
* notMatchedBySourceActions: Seq[MergeAction]) extends BinaryCommand with SupportsSubquery | ||
*/ | ||
def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, LogicalPlan, Expression)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, where does this get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ultimately used in HoodieAnalysis
to match the plan.
override def updatePrunedDataSchema(prunedSchema: StructType): Relation = | ||
this.copy(prunedDataSchema = Some(prunedSchema)) | ||
|
||
override def imbueConfigs(sqlContext: SQLContext): Unit = { | ||
super.imbueConfigs(sqlContext) | ||
// TODO Issue with setting this to true in spark 332 | ||
if (!HoodieSparkUtils.gteqSpark3_3_2) { | ||
if (HoodieSparkUtils.gteqSpark3_4 || !HoodieSparkUtils.gteqSpark3_3_2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua Now when I think about it, I think the changes we made in 3.4(Specifically around supportBatch
and supportsColumnar
in ParquetFileFormat) should be brought to our 3.3.2 impl. This would then allow us to remove this if check all together, and allow vectorizedReader to be enabled regardless of Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say let's keep the logic for other spark versions as is and not balloon the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we sync I can try to help you get past this error for the other spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIRA to track: HUDI-6347
@@ -34,6 +34,15 @@ class HoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport | |||
|
|||
override def toString: String = "Hoodie-Parquet" | |||
|
|||
override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { | |||
if (HoodieSparkUtils.gteqSpark3_4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im wondering if this should be for greater than equal to spark 332, so that we can have this fix there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take this separately since this PR is not related to Spark 3.3.2. The code here is to make sure all other versions maintain the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests fail for other spark versions if I don't add this check.
Merge Hudi to Hudi *** FAILED ***
2023-06-06T23:38:24.7660935Z org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3194.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3194.0 (TID 3768) (fv-az1128-658 executor driver): java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatchRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch
2023-06-06T23:38:24.7662056Z at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
2023-06-06T23:38:24.7662628Z at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
2023-06-06T23:38:24.7663391Z at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
...mon/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala
Outdated
Show resolved
Hide resolved
...mon/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala
Show resolved
Hide resolved
* <li>Schema on-read</li> | ||
* </ol> | ||
*/ | ||
class Spark33HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, for this class we migrated logic as what spark32plus originally had? Are there any other changes we have added in this, and is this what spark 332 will use?
Im thinking if we should make a ParquertFileFormat for Spark 332 and add a similar supportBatch
and supportColumnar
so we can avoid cast issues instead of disabling vectorized reader for spark 332
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from Spark32PlusHoodieParquetFileFormat
. Let's take the Spark 3.3.2 improvement in a separate PR and not put everything in the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. few minor comments. Good job folks on the 3.4.0 support 👏
* @param partitionedFile Spark [[PartitionedFile]] instance. | ||
* @return Hadoop [[Path]] instance. | ||
*/ | ||
def getPathFromPartitionedFile(partitionedFile: PartitionedFile): Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may I know why do we need two methods, one to fetch string version and another to fetch Path. we can return string and the caller can construct the Path if need be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkPath
has toPath
function to efficiently return the Path
instance in Spark 3.4, so I opt to have two separate APIs here.
override def updatePrunedDataSchema(prunedSchema: StructType): Relation = | ||
this.copy(prunedDataSchema = Some(prunedSchema)) | ||
|
||
override def imbueConfigs(sqlContext: SQLContext): Unit = { | ||
super.imbueConfigs(sqlContext) | ||
// TODO Issue with setting this to true in spark 332 | ||
if (!HoodieSparkUtils.gteqSpark3_3_2) { | ||
if (HoodieSparkUtils.gteqSpark3_4 || !HoodieSparkUtils.gteqSpark3_3_2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we introduce "HoodieSparkUtils.ltSpark3_3_2" instead of "!HoodieSparkUtils.gteqSpark3_3_2"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that. I keep the existing code the same in this PR.
@@ -733,8 +734,8 @@ object HoodieBaseRelation extends SparkAdapterSupport { | |||
|
|||
partitionedFile => { | |||
val hadoopConf = hadoopConfBroadcast.value.get() | |||
val reader = new HoodieAvroHFileReader(hadoopConf, new Path(partitionedFile.filePath), | |||
new CacheConfig(hadoopConf)) | |||
val filePath = sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(partitionedFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
Thank you folks for the review. I'll merge this PR now. Feel free to add more comments if there is more, given we're having ongoing discussion on the Hudi Spark integration. |
Change Logs
This PR adds the support of Apache Hudi on Spark 3.4.0.
There are a few changes in Spark 3.4.0 that the Hudi Spark integration needs to adjust to, particularly:
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED
(spark.sql.parquet.inferTimestampNTZ.enabled
) is required for parquet readingMergeIntoTable
(five to six arguments)SchemaUtils.checkColumnNameDuplication
PartitionedFile
changeTo support Apache Hudi on Spark 3.4.0, here the changes made:
SparkAdapterSupport
andHoodieSparkUtils
.SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED
(spark.sql.parquet.inferTimestampNTZ.enabled
) required for parquet reading in Spark 3.4.0, the config is set now inHoodieSparkFileReaderFactory
.MergeIntoTable
(five to six arguments), a new functiondef unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, LogicalPlan, Expression)] in
HoodieCatalystPlansUtils` and corresponding implementation in each Spark version are added in the Spark adapter framework.SchemaUtils.checkColumnNameDuplication
,HoodieSchemaUtils
andcheckColumnNameDuplication
function and corresponding implementation in each Spark version are added to the Spark adapter framework.PartitionedFile
change where the filePath is switch to [[SparkPath]] for type safety, theHoodieSparkPartitionedFileUtils
and corresponding implementation in each Spark version are added to the Spark adapter framework.BASE_PATH_PARAM = "basePath"
as Spark 3.4.0 usesFileIndexOptions.BASE_PATH_PARAM
insteadPartitioningAwareFileIndex.BASE_PATH_PARAM
BaseFileOnlyRelation
,HoodieBaseRelation
,HoodieBootstrapRelation
,HoodieDataSourceHelper
,Iterators
,MergeOnReadSnapshotRelation
,HoodieCDCRDD
,AlterHoodieTableAddColumnsCommand
, andHoodieAnalysis
.supportBatch
inHoodieParquetFileFormat
for Spark 3.4.ColumnVector
instead ofWritableColumnVector
to get general inSpark32PlusHoodieVectorizedParquetRecordReader
hudi-spark3.4.x
module by mainly copying the code fromhudi-spark3.3.x
module and making following adjustments (more details below on the code comparison):Spark32PlusHoodieParquetFileFormat
is removed. Each spark version module has its own file format class.Spark34HoodieParquetFileFormat
adds custom logic on when to enable vectorized reader based on various conditions, compared toSpark33HoodieParquetFileFormat
.HoodieSpark34PartitionedFileUtils
is different fromHoodieSpark33PartitionedFileUtils
because ofPartitionedFile
changesSpark34ResolveHudiAlterTableCommand
adjusts logic based onAlterColumn
class changes.HoodieSpark3_4ExtendedSqlAstBuilder
adjusts logic based on Spark 3.4.HoodieSpark34CatalystExpressionUtils
adjusts logic based onParseToTimestamp
change and removal ofAnsiCast
compared toHoodieSpark33CatalystExpressionUtils
.HoodieSpark34CatalystPlanUtils
adjusts logic based onMergeIntoTable
change.HoodieSpark34SchemaUtils
is different fromHoodieSpark33SchemaUtils
because of API change ofSchemaUtils.checkColumnNameDuplication
TestInsertTable
,TestMergeIntoTable
,TestIndexSyntax
(table.identifier.quotedString
now also has catalog name as the prefix),TestCopyToTempViewProcedure
. New error messages in Spark SQL are presented in https://github.com/apache/spark/blob/master/core/src/main/resources/error/error-classes.json.This PR address all compatibility issues in Hudi Spark integration across Spark versions on top of the #8826, which adds support of Hudi for Spark 3.4.0 on Hudi 0.13.1 release only. @mansipp, @rahil-c, @CTTY, and @umehrot2 have significantly contributed to #8826 to make Hudi 0.13.1 release work on Spark 3.4.0 for EMR release. This PR makes sure all recent changes in Hudi Spark integration since Hudi 0.13.1 are considered and taken care of, including:
Rule
s #7871More details
Major changes from
hudi-spark3.3.x
tohudi-spark3.4.x
module:Spark33HoodieParquetFileFormat
vsSpark34HoodieParquetFileFormat
:HoodieSpark33PartitionedFileUtils
vsHoodieSpark34PartitionedFileUtils
:Spark33ResolveHudiAlterTableCommand
vsSpark34ResolveHudiAlterTableCommand
:HoodieSpark3_3ExtendedSqlAstBuilder
vsHoodieSpark3_4ExtendedSqlAstBuilder
:HoodieSpark33CatalystExpressionUtils
vsHoodieSpark34CatalystExpressionUtils
HoodieSpark33CatalystPlanUtils
vsHoodieSpark34CatalystPlanUtils
HoodieSpark33SchemaUtils
vsHoodieSpark34SchemaUtils
Impact
This enables Spark 3.4.0 to run Hudi workload.
Risk level
medium
The PR is thoroughly tested by the Java CI, Azure CI, and integration tests on EMR and S3.
Documentation Update
We'll update Hudi website on the Spark 3.4.0 support: HUDI-6341.
Contributor's checklist