Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

Merged
merged 2 commits into from
Oct 27, 2023

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Oct 20, 2023

Change Logs

CreateIndex is added in HUDI-4165, and spark 3.3 also include this in SPARK-36895. Since CreateIndex uses same package org.apache.spark.sql.catalyst.plans.logical in HUDI and Spark3.3, but params are not same. So it could introduce class conflict issues if we use it.

One simple modification to fix this issue is we directly move CreateIndex to a different package path, but I think it's not proper to do, given

  1. We may not get benefit from Spark to perform other analyze rules for CreateIndex
  2. Give us extra effort when moving to DSV2
  3. HUDI always use org.apache.spark.sql.catalyst.plans.logical for logical plan(like MergeInto)

So here I still keep the same package path with Spark, but change

  1. Use the same params like Spark, so there should be no class conflict
  2. Since Spark2 doesn't have org.apache.spark.sql.catalyst.analysis.FieldName but CreateIndex requires, so we only support Index related commands from Spark3.2(I can also add support for Spark2 and Spark3.0, spark3.1, but not sure it's still needed or not)
  3. Address TODO: resolve columns for CreateIndex during Analyze stage

Impact

Describe any public API or user-facing feature change or any performance impact.

Index related commands are not supported anymore below Spark3.2

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.
low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua
Copy link
Contributor

yihua commented Oct 24, 2023

cc @codope

| IN
| INDEX
| INDEXES
| IF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need some of these tokens for other SQL statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I checked and only remove unneeded.


override def run(sparkSession: SparkSession): Seq[Row] = {
val tableId = table.identifier
val metaClient = createHoodieTableMetaClient(tableId, sparkSession)
val columnsMap: java.util.LinkedHashMap[String, java.util.Map[String, String]] =
new util.LinkedHashMap[String, java.util.Map[String, String]]()
columns.map(c => columnsMap.put(c._1.name, c._2.asJava))
columns.map(c => columnsMap.put(c._1.mkString("."), c._2.asJava))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this? for nested fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now columns' name is Seq[String] instead of Attribute(UnresolvedAttribute to be more specific), since this pr will try to resolve column names

case (u: UnresolvedFieldName, prop) => resolveFieldNames(cmd.table, u.name, u) -> prop

UnresolvedAttribute is no need anymore, so here I directly use Seq[String] to represent the column name here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@@ -149,144 +149,4 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserInterface
private def typedVisit[T](ctx: ParseTree): T = {
ctx.accept(this).asInstanceOf[T]
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of this code copied from Spark 3.3+? Wondering if the original PR introducing this intends to support CreateIndex for Spark 3.2 and below. cc @huberylee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes are introduced from #5761, maybe it's only for supporting Index commands

| )
| partitioned by(ts)
| location '$basePath'
if (HoodieSparkUtils.gteqSpark3_2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like TestSecondaryIndex should also have a precondition on the spark version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, will fix it as well

@@ -3317,6 +3317,145 @@ class HoodieSpark3_2ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa
position = Option(ctx.colPosition).map(pos =>
UnresolvedFieldPosition(typedVisit[ColumnPosition](pos))))
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So at least CreateIndex is still supported in Spark 3.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -29,5 +29,12 @@ statement
| createTableHeader ('(' colTypeList ')')? tableProvider?
createTableClauses
(AS? query)? #createTable
| CREATE INDEX (IF NOT EXISTS)? identifier ON TABLE?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we still maintain the grammar in a single place for all Spark versions, but fail the logical plan of INDEX SQL statement in Spark 3.1 and below, so the grammar can be easily maintained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot put the grammar in HoodieSqlCommon.g4, since we have to parse it in HoodieSqlCommonAstBuilder to build CreateIndex, while one param FieldName doesn't exist below Spark3.

We already have this for time travel, and grammar is rarely modified, so changing it like so is acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. @codope is working on functional indexes in Hudi and may extend the grammar here. So I'm thinking if it's not too hard to put the grammar in a common place, we should do that.

@@ -3327,6 +3327,145 @@ class HoodieSpark3_3ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa
position = Option(ctx.colPosition).map(pos =>
UnresolvedFieldPosition(typedVisit[ColumnPosition](pos))))
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the SQL parsing of INDEX SQL statement should not be different across Spark versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it should be same

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit 4f723fb into apache:master Oct 27, 2023
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants