HIVE-26385: Iceberg integration: Implement merge into iceberg table #3430

kasakrisz · 2022-07-12T07:45:28Z

What changes were proposed in this pull request?

Extract common parts which collects and appends the sort and delete columns in case of split update and merge.
Merge IcebergWriter data and delete file collections.
Append the number of writers in the job to the operationId to generate a unique file name for each writer.

Why are the changes needed?

Implement merge into iceberg table by reusing split update code parts.
A merge with an insert and an update branch has at least two insert IcebergWriters. If the target table is not partitioned no sorting and exchange is required and both FileSink operators wrapping the IcebergWriters are located in the same Reducer and Job.

Does this PR introduce any user-facing change?

Yes. Merge statements should not throw any exceptions when the target is an iceberg table.

How was this patch tested?

mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestIcebergLlapLocalCliDriver -Dqfile=merge_iceberg_orc.q,merge_iceberg_partitioned_orc.q -pl itests/qtest-iceberg -Piceberg -Pitests

pvary · 2022-07-18T16:38:25Z

ql/src/java/org/apache/hadoop/hive/ql/parse/MergeSemanticAnalyzer.java

@@ -297,15 +309,16 @@ private boolean handleCardinalityViolation(StringBuilder rewrittenQueryStr, ASTN
    //this is a tmp table and thus Session scoped and acid requires SQL statement to be serial in a
    // given session, i.e. the name can be fixed across all invocations
    String tableName = "merge_tmp_table";
+    List<String> sortKeys = columnAppender.getSortKeys();


Please talk with @lcspinter about the sorted Iceberg tables. How the 2 PR will work together.

Thanks,
Peter

We discussed with @lcspinter that his patch does independent changes.
I also applied both his and this patch to the same branch locally and wrote a test case for updating sorted table. It works only for non vectorized mode.
The reason why it does not work in vectorized mode is not related to this patch. It depends on the setting of iceberg.mr.in.memory.data.model
It is set to HIVE when vectorization is enabled:

hive/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

Line 158 in 855c064

job.setEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.HIVE);

Delete delta files are not read with this setting.

pvary · 2022-07-18T16:41:48Z

ql/src/java/org/apache/hadoop/hive/ql/parse/UpdateDeleteSemanticAnalyzer.java

    boolean nonNativeAcid = AcidUtils.isNonNativeAcidTable(mTable);
-    int columnOffset;
-    List<String> deleteValues;
-    if (nonNativeAcid) {
-      List<FieldSchema> acidSelectColumns = mTable.getStorageHandler().acidSelectColumns(mTable, operation);
-      deleteValues = new ArrayList<>(acidSelectColumns.size());
-      for (FieldSchema fieldSchema : acidSelectColumns) {
-        String identifier = HiveUtils.unparseIdentifier(fieldSchema.getName(), this.conf);
-        rewrittenQueryStr.append(identifier).append(" AS ");
-        String prefixedIdentifier = HiveUtils.unparseIdentifier(DELETE_PREFIX + fieldSchema.getName(), this.conf);
-        rewrittenQueryStr.append(prefixedIdentifier);
-        rewrittenQueryStr.append(",");
-        deleteValues.add(String.format("%s.%s", SUB_QUERY_ALIAS, prefixedIdentifier));
-      }
-
-      columnOffset = acidSelectColumns.size();
-    } else {
-      rewrittenQueryStr.append("ROW__ID,");
-      deleteValues = new ArrayList<>(1 + mTable.getPartCols().size());
-      deleteValues.add(SUB_QUERY_ALIAS + ".ROW__ID");
-      for (FieldSchema fieldSchema : mTable.getPartCols()) {
-        deleteValues.add(SUB_QUERY_ALIAS + "." + HiveUtils.unparseIdentifier(fieldSchema.getName(), conf));
-      }
-      columnOffset = 1;
-    }
+    ColumnAppender columnAppender = nonNativeAcid ? new NonNativeAcidColumnAppender(mTable, conf, SUB_QUERY_ALIAS) :
+            new NativeAcidColumnAppender(mTable, conf, SUB_QUERY_ALIAS);


Would it worth to create an util method for this?

pubic static ColumnAppender AcidUtils.getAppender(mTable);

I prefer keeping this in RewriteSemanticAnalyzer because these classes are used for generating query text when rewriting updates and merges.
Added a factory method:

protected ColumnAppender getColumnAppender(String subQueryAlias)

pvary · 2022-07-18T16:43:49Z

ql/src/test/results/clientpositive/llap/check_constraint.q.out

@@ -2519,23 +2519,23 @@ STAGE PLANS:
        Map 7 
            Map Operator Tree:
                TableScan
-                  alias: t
+                  alias: tmerge


How sure we are of these changes?
My first guess that this is an independent change/fix where we fix using the alias name, but I might be wrong.

In the past the statement

explain MERGE INTO tmerge as t using nonacid as s ON t.key = s.key WHEN MATCHED AND s.key < 5 THEN DELETE WHEN MATCHED AND s.key < 3 THEN UPDATE set a1 = '1' WHEN NOT MATCHED THEN INSERT VALUES (s.key, s.a1, s.value)

was rewritten to

FROM `tmerge` `t` RIGHT OUTER JOIN `default`.`nonacid` `s` ON `t`.`key` = `s`.`key` <insert branches>

In this patch I introduced a subquery because some of the target table columns needed twice in case of iceberg target and it is generated in the same way in case of updates.

FROM (SELECT ROW__ID, `key`, `a1`, `value` FROM `default`.`tmerge`) `t` RIGHT OUTER JOIN `default`.`nonacid` `s` ON `t`.`key` = `s`.`key` <insert branches>

In both the new and the old plan TS you highlighted scans the same table: tmerge
But in the old plane tmerge and t refers to the same object. In the new plan the table does not have an alias so it is referenced by its name. The alias t refers the subquery.

pvary · 2022-07-19T10:44:13Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/WriterBuilder.java

-    String operationId = queryId + "-" + attemptID.getJobID();
+    Map<String, List<HiveIcebergWriter>> writers = WriterRegistry.writers(attemptID);
+    int writerCount = writers == null ? 0 : writers.size();
+    String operationId = queryId + "-" + attemptID.getJobID() + "-" + writerCount;


What about adding a random UUID adding to this, or using an AtomicInteger?

Added AtomicInteger instead of getting the number of writers.

kasakrisz marked this pull request as draft July 12, 2022 07:45

kasakrisz self-assigned this Jul 12, 2022

kgyrtkirk added tests pending tests failed and removed tests pending tests failed labels Jul 12, 2022

kasakrisz force-pushed the HIVE-26385-master-iceberg-merge branch from 0c291c4 to 6b2b4a1 Compare July 13, 2022 09:34

kgyrtkirk added tests pending tests passed and removed tests failed tests pending labels Jul 13, 2022

kasakrisz requested a review from pvary July 13, 2022 13:20

kasakrisz marked this pull request as ready for review July 13, 2022 13:20

kasakrisz force-pushed the HIVE-26385-master-iceberg-merge branch from 6b2b4a1 to a65a2ae Compare July 14, 2022 07:32

kgyrtkirk added tests pending tests passed and removed tests passed tests pending labels Jul 14, 2022

pvary reviewed Jul 18, 2022

View reviewed changes

kasakrisz and others added 6 commits July 19, 2022 08:26

HIVE-26385: Iceberg integration: Implement merge into iceberg table

880104d

fix native-acid partition cols

05caf79

update q.out

3454879

refac: rename method append

36b42fb

move q.outs

6ec24ce

update q.outs

fd68261

remove unused parameter

40c440a

pvary reviewed Jul 19, 2022

View reviewed changes

use AtomicInteger to ensure operationId uniqueness

9bc6607

kasakrisz force-pushed the HIVE-26385-master-iceberg-merge branch from a65a2ae to 9bc6607 Compare July 20, 2022 03:51

kgyrtkirk added tests pending and removed tests passed labels Jul 20, 2022

create factory method for ColumnAppender

b00096a

kasakrisz requested a review from pvary July 20, 2022 04:04

kgyrtkirk added tests failed tests pending tests passed and removed tests pending tests failed labels Jul 20, 2022

pvary approved these changes Jul 20, 2022

View reviewed changes

kasakrisz merged commit 8fdf802 into apache:master Jul 20, 2022

kasakrisz deleted the HIVE-26385-master-iceberg-merge branch July 20, 2022 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-26385: Iceberg integration: Implement merge into iceberg table #3430

HIVE-26385: Iceberg integration: Implement merge into iceberg table #3430

kasakrisz commented Jul 12, 2022

pvary Jul 18, 2022

kasakrisz Jul 19, 2022

pvary Jul 18, 2022

kasakrisz Jul 20, 2022

pvary Jul 18, 2022

kasakrisz Jul 19, 2022

pvary Jul 19, 2022

kasakrisz Jul 20, 2022

HIVE-26385: Iceberg integration: Implement merge into iceberg table #3430

HIVE-26385: Iceberg integration: Implement merge into iceberg table #3430

Conversation

kasakrisz commented Jul 12, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment