[HUDI-6464] Spark SQL Merge Into for pkless tables#9083
[HUDI-6464] Spark SQL Merge Into for pkless tables#9083nsivabalan merged 20 commits intoapache:masterfrom
Conversation
...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieNonIndex.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieNonIndex.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
Outdated
Show resolved
Hide resolved
...park-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
Outdated
Show resolved
Hide resolved
...lus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Outdated
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Show resolved
Hide resolved
|
yet to review tests. but you can start addressing the source code comments in the mean time. |
...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
Outdated
Show resolved
Hide resolved
...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public boolean isGlobal() { | ||
| return false; |
There was a problem hiding this comment.
If the global version is to be implemented, does the user need to simply set a config and we return the true here for the global index? Since the record location is already known from the meta column, how does the global/non-global part come into play here?
There was a problem hiding this comment.
I believe it comes into play when changing the partition https://issues.apache.org/jira/browse/HUDI-6471
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Calls fail analysis on | ||
| * | ||
| */ | ||
| def failAnalysisForMIT(a: Attribute, cols: String): Unit = {} | ||
|
|
||
| def createMITJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression], hint: String): LogicalPlan |
There was a problem hiding this comment.
nit: Put these into HoodieCatalystPlansUtils?
...-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java
Show resolved
Hide resolved
|
|
||
| val hoodieKey = new HoodieKey(recordKey, partitionPath) | ||
| val instantTime: Option[String] = if (isPrepped) { | ||
| val instantTime: Option[String] = if (isPrepped || mergeIntoWrites) { |
There was a problem hiding this comment.
Is the commit time pre-populated based on the current commit (not the commit time from the meta columns in the existing files) for MIT records?
There was a problem hiding this comment.
I think it might be changed at write time?
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
Outdated
Show resolved
Hide resolved
|
|
||
|
|
||
| /** | ||
| * NOTE TO USERS: YOU SHOULD NOT SET THIS AS YOUR KEYGENERATOR |
There was a problem hiding this comment.
Should we add validation in a follow-up PR on the key generator so that SqlKeyGenerator and MergeIntoKeyGenerator should be be set by the user?
...di-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala
Outdated
Show resolved
Hide resolved
...di-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala
Outdated
Show resolved
Hide resolved
| // NOTE: This rule adjusts [[LogicalRelation]]s resolving into Hudi tables such that | ||
| // meta-fields are not affecting the resolution of the target columns to be updated by Spark. | ||
| // meta-fields are not affecting the resolution of the target columns to be updated by Spark (Except in the | ||
| // case of MergeInto. We leave the meta columns on the target table, and use other means to ensure resolution) |
There was a problem hiding this comment.
Minor: my understanding is that for pkless table, we need record key, partition path, and filename from the meta columns. Other meta columns can be pruned. I assume if we only keep the three, Spark does not read the others, which is a small improvement we can do.
There was a problem hiding this comment.
It makes aligning the schema more difficult so it could be done in a followup
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Outdated
Show resolved
Hide resolved
...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
Outdated
Show resolved
Hide resolved
...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
Outdated
Show resolved
Hide resolved
|
|
||
| import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport} | ||
|
|
||
| class TestMergeIntoTable3 extends HoodieSparkSqlTestBase with ScalaAssertionSupport { |
There was a problem hiding this comment.
It's OK to add a new test class. I think Siva's point is that, instead of naming it with numbers, we should name it with readability in mind, sth like TestMergeIntoWithNonRecordKeyField.
| // | ||
| //// |
There was a problem hiding this comment.
docs to add? Could you check all places where the comment is empty?
| case _ => | ||
| val newMatchedActions = m.matchedActions.map { | ||
| case DeleteAction(deleteCondition) => | ||
| val resolvedDeleteCondition = deleteCondition.map( | ||
| resolveExpressionByPlanChildren(_, m)) | ||
| DeleteAction(resolvedDeleteCondition) | ||
| case UpdateAction(updateCondition, assignments) => | ||
| val resolvedUpdateCondition = updateCondition.map( | ||
| resolveExpressionByPlanChildren(_, m)) | ||
| UpdateAction( | ||
| resolvedUpdateCondition, | ||
| // The update value can access columns from both target and source tables. | ||
| resolveAssignments(assignments, m, resolveValuesWithSourceOnly = false)) | ||
| case UpdateStarAction(updateCondition) => | ||
| ////Hudi change: filter out meta fields | ||
| // | ||
| val assignments = targetTable.output.filter(a => !isMetaField(a.name)).map { attr => | ||
| Assignment(attr, UnresolvedAttribute(Seq(attr.name))) | ||
| } | ||
| // | ||
| //// | ||
| UpdateAction( | ||
| updateCondition.map(resolveExpressionByPlanChildren(_, m)), | ||
| // For UPDATE *, the value must from source table. | ||
| resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) | ||
| case o => o | ||
| } | ||
| val newNotMatchedActions = m.notMatchedActions.map { | ||
| case InsertAction(insertCondition, assignments) => | ||
| // The insert action is used when not matched, so its condition and value can only | ||
| // access columns from the source table. | ||
| val resolvedInsertCondition = insertCondition.map( | ||
| resolveExpressionByPlanChildren(_, Project(Nil, m.sourceTable))) | ||
| InsertAction( | ||
| resolvedInsertCondition, | ||
| resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) | ||
| case InsertStarAction(insertCondition) => | ||
| // The insert action is used when not matched, so its condition and value can only | ||
| // access columns from the source table. | ||
| val resolvedInsertCondition = insertCondition.map( | ||
| resolveExpressionByPlanChildren(_, Project(Nil, m.sourceTable))) | ||
| ////Hudi change: filter out meta fields | ||
| // | ||
| val assignments = targetTable.output.filter(a => !isMetaField(a.name)).map { attr => | ||
| Assignment(attr, UnresolvedAttribute(Seq(attr.name))) | ||
| } | ||
| // | ||
| //// | ||
| InsertAction( | ||
| resolvedInsertCondition, | ||
| resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) | ||
| case o => o |
There was a problem hiding this comment.
Most of the code here are similar to Spark's ResolveReferences. Could you make a note of this and add docs before case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _) to summarize the custom changes?
There was a problem hiding this comment.
Also, is the code copied from Spark 3.2? Any difference among Spark 3.2, 3.3, and 3.4?
There was a problem hiding this comment.
I marked the custom changes by surrounding them in
////
//
changes
//
////
There was a problem hiding this comment.
Spark 3.2 and 3.3 are the same. In a followup we may want to use the 3.4 code
| * | ||
| * PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY | ||
| */ | ||
| object HoodieSpark30Analysis { |
There was a problem hiding this comment.
For Spark 3.0 and 3.1, have you checked if the code here is different from Spark's ResolveReferences. Given we introduce the custom rule here, we should still match the implementation of ResolveReferences in the corresponding Spark version except for the custom logic you added.
|
|
||
| public class HoodieInternalProxyIndex extends HoodieIndex<Object, Object> { | ||
|
|
||
| /** |
There was a problem hiding this comment.
can we move the docs to L29.
|
Lets wait for all GH actions to succeed before we can land. |
yihua
left a comment
There was a problem hiding this comment.
LGTM on the core functionality. Let's address comments in a follow-up PR.
Change Logs
Tables with a primary key now must join on all primary key columns. Additionally, they can join on the partition path columns as well, which is recommended if the table is not using a global index.
Tables without a primary key can join on any columns in the table. If multiple source table columns match a single target table column, precombine field will be used if set; otherwise, behavior is nondeterminate. To improve performance, the hudi meta cols are retained after the join, so that index lookup and keygeneration can be skipped.
NOTE: non-case sensitive column name recognition no longer working
Impact
Allows usage of sql feature for pkless tables.
Risk level (write none, low medium or high below)
medium
Additional tests have been written
Documentation Update
Release notes
Contributor's checklist