[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

boneanxs · 2023-10-20T03:19:52Z

Change Logs

CreateIndex is added in HUDI-4165, and spark 3.3 also include this in SPARK-36895. Since CreateIndex uses same package org.apache.spark.sql.catalyst.plans.logical in HUDI and Spark3.3, but params are not same. So it could introduce class conflict issues if we use it.

One simple modification to fix this issue is we directly move CreateIndex to a different package path, but I think it's not proper to do, given

We may not get benefit from Spark to perform other analyze rules for CreateIndex
Give us extra effort when moving to DSV2
HUDI always use org.apache.spark.sql.catalyst.plans.logical for logical plan(like MergeInto)

So here I still keep the same package path with Spark, but change

Use the same params like Spark, so there should be no class conflict
Since Spark2 doesn't have org.apache.spark.sql.catalyst.analysis.FieldName but CreateIndex requires, so we only support Index related commands from Spark3.2(I can also add support for Spark2 and Spark3.0, spark3.1, but not sure it's still needed or not)
Address TODO: resolve columns for CreateIndex during Analyze stage

Impact

Describe any public API or user-facing feature change or any performance impact.

Index related commands are not supported anymore below Spark3.2

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.
low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

yihua · 2023-10-24T05:10:41Z

cc @codope

yihua · 2023-10-24T05:12:57Z

...rk-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4

-     | IN
-     | INDEX
-     | INDEXES
-     | IF


Do we still need some of these tokens for other SQL statements?

No, I checked and only remove unneeded.

yihua · 2023-10-24T05:19:25Z

...k-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/IndexCommands.scala


  override def run(sparkSession: SparkSession): Seq[Row] = {
    val tableId = table.identifier
    val metaClient = createHoodieTableMetaClient(tableId, sparkSession)
    val columnsMap: java.util.LinkedHashMap[String, java.util.Map[String, String]] =
      new util.LinkedHashMap[String, java.util.Map[String, String]]()
-    columns.map(c => columnsMap.put(c._1.name, c._2.asJava))
+    columns.map(c => columnsMap.put(c._1.mkString("."), c._2.asJava))


Why change this? for nested fields?

Now columns' name is Seq[String] instead of Attribute(UnresolvedAttribute to be more specific), since this pr will try to resolve column names

hudi/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala

Line 208 in 6eb38c5

case (u: UnresolvedFieldName, prop) => resolveFieldNames(cmd.table, u.name, u) -> prop

UnresolvedAttribute is no need anymore, so here I directly use Seq[String] to represent the column name here

yihua · 2023-10-24T05:23:54Z

...source/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala

@@ -149,144 +149,4 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserInterface
  private def typedVisit[T](ctx: ParseTree): T = {
    ctx.accept(this).asInstanceOf[T]
  }
-
-  /**


Are all of this code copied from Spark 3.3+? Wondering if the original PR introducing this intends to support CreateIndex for Spark 3.2 and below. cc @huberylee

These codes are introduced from #5761, maybe it's only for supporting Index commands

yihua · 2023-10-24T05:27:16Z

...urce/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestIndexSyntax.scala

-             | )
-             | partitioned by(ts)
-             | location '$basePath'
+    if (HoodieSparkUtils.gteqSpark3_2) {


Looks like TestSecondaryIndex should also have a precondition on the spark version.

yea, will fix it as well

yihua · 2023-10-24T05:32:31Z

...rk3.2.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_2ExtendedSqlAstBuilder.scala

@@ -3317,6 +3317,145 @@ class HoodieSpark3_2ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa
      position = Option(ctx.colPosition).map(pos =>
        UnresolvedFieldPosition(typedVisit[ColumnPosition](pos))))
  }
+
+   /**


Got it. So at least CreateIndex is still supported in Spark 3.2.

yihua · 2023-10-24T05:37:03Z

...datasource/hudi-spark3.3.x/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlBase.g4

@@ -29,5 +29,12 @@ statement
    | createTableHeader ('(' colTypeList ')')? tableProvider?
        createTableClauses
        (AS? query)?                                                   #createTable
+    | CREATE INDEX (IF NOT EXISTS)? identifier ON TABLE?


Could we still maintain the grammar in a single place for all Spark versions, but fail the logical plan of INDEX SQL statement in Spark 3.1 and below, so the grammar can be easily maintained?

We cannot put the grammar in HoodieSqlCommon.g4, since we have to parse it in HoodieSqlCommonAstBuilder to build CreateIndex, while one param FieldName doesn't exist below Spark3.

We already have this for time travel, and grammar is rarely modified, so changing it like so is acceptable?

Got it. @codope is working on functional indexes in Hudi and may extend the grammar here. So I'm thinking if it's not too hard to put the grammar in a common place, we should do that.

yihua · 2023-10-24T05:38:37Z

...rk3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlAstBuilder.scala

@@ -3327,6 +3327,145 @@ class HoodieSpark3_3ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa
      position = Option(ctx.colPosition).map(pos =>
        UnresolvedFieldPosition(typedVisit[ColumnPosition](pos))))
  }
+
+  /**


I assume the SQL parsing of INDEX SQL statement should not be different across Spark versions.

Yea, it should be same

hudi-bot · 2023-10-27T10:39:28Z

CI report:

f462cf7 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

yihua reviewed Oct 24, 2023

View reviewed changes

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3

f7c1981

boneanxs force-pushed the HUDI-6963 branch from 6eb38c5 to b7afca0 Compare October 26, 2023 07:03

Fix tests

f462cf7

boneanxs force-pushed the HUDI-6963 branch from b7afca0 to f462cf7 Compare October 27, 2023 02:31

apache deleted a comment from hudi-bot Oct 27, 2023

yihua approved these changes Oct 27, 2023

View reviewed changes

yihua merged commit 4f723fb into apache:master Oct 27, 2023
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

boneanxs commented Oct 20, 2023 •

edited by yihua

Loading

yihua commented Oct 24, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

yihua Oct 26, 2023

yihua Oct 24, 2023

boneanxs Oct 26, 2023

hudi-bot commented Oct 27, 2023

yihua left a comment

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

Conversation

boneanxs commented Oct 20, 2023 • edited by yihua Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

yihua commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Oct 27, 2023

CI report:

yihua left a comment

Choose a reason for hiding this comment

boneanxs commented Oct 20, 2023 •

edited by yihua

Loading