[HUDI-6464] Spark SQL Merge Into for pkless tables by jonvex · Pull Request #9083 · apache/hudi

jonvex · 2023-06-29T01:50:51Z

Change Logs

Tables with a primary key now must join on all primary key columns. Additionally, they can join on the partition path columns as well, which is recommended if the table is not using a global index.

Tables without a primary key can join on any columns in the table. If multiple source table columns match a single target table column, precombine field will be used if set; otherwise, behavior is nondeterminate. To improve performance, the hudi meta cols are retained after the join, so that index lookup and keygeneration can be skipped.

NOTE: non-case sensitive column name recognition no longer working

Impact

Allows usage of sql feature for pkless tables.

Risk level (write none, low medium or high below)

medium
Additional tests have been written

Documentation Update

Release notes

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieNonIndex.java

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java

...park-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java

...lus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

nsivabalan · 2023-07-02T06:45:41Z

yet to review tests. but you can start addressing the source code comments in the mean time.

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java

yihua · 2023-07-03T21:12:54Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java

+
+  @Override
+  public boolean isGlobal() {
+    return false;


If the global version is to be implemented, does the user need to simply set a config and we return the true here for the global index? Since the record location is already known from the meta column, how does the global/non-global part come into play here?

I believe it comes into play when changing the partition https://issues.apache.org/jira/browse/HUDI-6471

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java

yihua · 2023-07-03T21:20:47Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

+  /**
+   * Calls fail analysis on
+   *
+   */
+  def failAnalysisForMIT(a: Attribute, cols: String): Unit = {}
+
+  def createMITJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression], hint: String): LogicalPlan


nit: Put these into HoodieCatalystPlansUtils?

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java

yihua · 2023-07-03T21:33:37Z

...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala


    val hoodieKey = new HoodieKey(recordKey, partitionPath)
-    val instantTime: Option[String] = if (isPrepped) {
+    val instantTime: Option[String] = if (isPrepped || mergeIntoWrites) {


Is the commit time pre-populated based on the current commit (not the commit time from the meta columns in the existing files) for MIT records?

I think it might be changed at write time?

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java

yihua · 2023-07-03T21:42:54Z

...di-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala

+
+
+/**
+ * NOTE TO USERS: YOU SHOULD NOT SET THIS AS YOUR KEYGENERATOR


Should we add validation in a follow-up PR on the key generator so that SqlKeyGenerator and MergeIntoKeyGenerator should be be set by the user?

...di-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala

yihua · 2023-07-03T21:56:22Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

    // NOTE: This rule adjusts [[LogicalRelation]]s resolving into Hudi tables such that
-    //       meta-fields are not affecting the resolution of the target columns to be updated by Spark.
+    //       meta-fields are not affecting the resolution of the target columns to be updated by Spark (Except in the
+    //       case of MergeInto. We leave the meta columns on the target table, and use other means to ensure resolution)


Minor: my understanding is that for pkless table, we need record key, partition path, and filename from the meta columns. Other meta columns can be pruned. I assume if we only keep the three, Spark does not read the others, which is a small improvement we can do.

It makes aligning the schema more difficult so it could be done in a followup

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala

yihua · 2023-07-03T22:28:38Z

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable3.scala

+
+import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport}
+
+class TestMergeIntoTable3 extends HoodieSparkSqlTestBase with ScalaAssertionSupport {


It's OK to add a new test class. I think Siva's point is that, instead of naming it with numbers, we should name it with readability in mind, sth like TestMergeIntoWithNonRecordKeyField.

yihua · 2023-07-03T22:57:56Z

...lus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala

+      //
+      ////


docs to add? Could you check all places where the comment is empty?

yihua · 2023-07-03T23:00:29Z

...lus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala

+        case _ =>
+          val newMatchedActions = m.matchedActions.map {
+            case DeleteAction(deleteCondition) =>
+              val resolvedDeleteCondition = deleteCondition.map(
+                resolveExpressionByPlanChildren(_, m))
+              DeleteAction(resolvedDeleteCondition)
+            case UpdateAction(updateCondition, assignments) =>
+              val resolvedUpdateCondition = updateCondition.map(
+                resolveExpressionByPlanChildren(_, m))
+              UpdateAction(
+                resolvedUpdateCondition,
+                // The update value can access columns from both target and source tables.
+                resolveAssignments(assignments, m, resolveValuesWithSourceOnly = false))
+            case UpdateStarAction(updateCondition) =>
+              ////Hudi change: filter out meta fields
+              //
+              val assignments = targetTable.output.filter(a => !isMetaField(a.name)).map { attr =>
+                Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+              }
+              //
+              ////
+              UpdateAction(
+                updateCondition.map(resolveExpressionByPlanChildren(_, m)),
+                // For UPDATE *, the value must from source table.
+                resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true))
+            case o => o
+          }
+          val newNotMatchedActions = m.notMatchedActions.map {
+            case InsertAction(insertCondition, assignments) =>
+              // The insert action is used when not matched, so its condition and value can only
+              // access columns from the source table.
+              val resolvedInsertCondition = insertCondition.map(
+                resolveExpressionByPlanChildren(_, Project(Nil, m.sourceTable)))
+              InsertAction(
+                resolvedInsertCondition,
+                resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true))
+            case InsertStarAction(insertCondition) =>
+              // The insert action is used when not matched, so its condition and value can only
+              // access columns from the source table.
+              val resolvedInsertCondition = insertCondition.map(
+                resolveExpressionByPlanChildren(_, Project(Nil, m.sourceTable)))
+              ////Hudi change: filter out meta fields
+              //
+              val assignments = targetTable.output.filter(a => !isMetaField(a.name)).map { attr =>
+                Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+              }
+              //
+              ////
+              InsertAction(
+                resolvedInsertCondition,
+                resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true))
+            case o => o


Most of the code here are similar to Spark's ResolveReferences. Could you make a note of this and add docs before case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _) to summarize the custom changes?

Also, is the code copied from Spark 3.2? Any difference among Spark 3.2, 3.3, and 3.4?

I marked the custom changes by surrounding them in
////
//
changes
//
////

Spark 3.2 and 3.3 are the same. In a followup we may want to use the 3.4 code

yihua · 2023-07-03T23:03:28Z

...spark3.0.x/src/main/scala/org/apache/spark/sql/catalyst/analysis/HoodieSpark30Analysis.scala

+ *
+ *       PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY
+ */
+object HoodieSpark30Analysis {


For Spark 3.0 and 3.1, have you checked if the code here is different from Spark's ResolveReferences. Given we introduce the custom rule here, we should still match the implementation of ResolveReferences in the corresponding Spark version except for the custom logic you added.

nsivabalan · 2023-07-06T04:16:45Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java

+
+public class HoodieInternalProxyIndex extends HoodieIndex<Object, Object> {
+
+  /**


can we move the docs to L29.

nsivabalan · 2023-07-06T04:18:49Z

Lets wait for all GH actions to succeed before we can land.

yihua

LGTM on the core functionality. Let's address comments in a follow-up PR.

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

hudi-bot · 2023-07-07T01:48:21Z

CI report:

3a0bfb8 UNKNOWN
a0808cd Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…bles apache#9083'

Jonathan Vexler added 7 commits June 23, 2023 19:52

current progress

0666389

current progress

abc154e

current state

17a92e7

add comment

f601879

next progress

4efb90e

most tests passing

b266ca5

mostly working besides case sensitive

00fabf1

jonvex changed the title ~~Mit add filters~~ PKLess Merge Into Jun 29, 2023

jonvex and others added 6 commits June 28, 2023 21:52

Delete MergeIntoHoodieTableCommand2.scala

be6801e

dedup

3a0bfb8

finish merge

767eb9c

fix more cases

ac4f2ce

working for all supported spark versions

dfb0060

clean up a bit

69f68c8

jonvex changed the title ~~PKLess Merge Into~~ [HUDI-6464] Spark SQL Merge Into for pkless tables Jul 1, 2023

jonvex marked this pull request as ready for review July 1, 2023 02:33

jonvex commented Jul 1, 2023

View reviewed changes

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala Show resolved Hide resolved

nsivabalan requested changes Jul 2, 2023

View reviewed changes

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jul 2, 2023

jonvex requested a review from nsivabalan July 2, 2023 16:38

Jonathan Vexler added 3 commits July 2, 2023 17:05

addressed comments

b402e3b

add some more comments

cbf6d03

add another comment

633f78c

jonvex commented Jul 3, 2023

View reviewed changes

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java Show resolved Hide resolved

jonvex commented Jul 3, 2023

View reviewed changes

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java Outdated Show resolved Hide resolved

jonvex commented Jul 3, 2023

View reviewed changes

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala Outdated Show resolved Hide resolved

addressed comments

575c165

yihua reviewed Jul 3, 2023

View reviewed changes

fix index for rollbacks

f156c16

yihua reviewed Jul 3, 2023

View reviewed changes

address most comments

1d32092

nsivabalan reviewed Jul 6, 2023

View reviewed changes

nsivabalan approved these changes Jul 6, 2023

View reviewed changes

yihua approved these changes Jul 6, 2023

View reviewed changes

yihua reviewed Jul 6, 2023

View reviewed changes

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala Outdated Show resolved Hide resolved

Fixing instantiation of resolveRef class for spark3 version

a0808cd

nsivabalan merged commit 0ca0999 into apache:master Jul 7, 2023

amrishlal added a commit to amrishlal/hudi that referenced this pull request Jul 7, 2023

Codreview changes for '[HUDI-6464] Spark SQL Merge Into for pkless ta…

4a18fce

…bles apache#9083'

amrishlal mentioned this pull request Jul 7, 2023

[HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables' #9145

Merged

4 tasks

yihao-tcf mentioned this pull request Jan 24, 2024

[SUPPORT] After upgrading hudi 0.14.1, use Spark SQL merge into to update the matched_action, the case of the column name and the expression name does not match, resulting in an exception. #10558

Open



		/**
		* NOTE TO USERS: YOU SHOULD NOT SET THIS AS YOUR KEYGENERATOR


		import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport}

		class TestMergeIntoTable3 extends HoodieSparkSqlTestBase with ScalaAssertionSupport {


		public class HoodieInternalProxyIndex extends HoodieIndex<Object, Object> {

		/**

Conversation

jonvex commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Jul 2, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jul 6, 2023

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-bot commented Jul 7, 2023

CI report:

Uh oh!

Reviewers

jonvex commented Jun 29, 2023 •

edited

Loading

yihua Jul 3, 2023 •

edited

Loading