HIVE-28047: Iceberg: Major QB Compaction with a single commit #5052

difin · 2024-01-30T20:06:26Z

What changes were proposed in this pull request?

Improvements to Hive Iceberg QB Major Compaction to perform compaction in one commit instead of two commits as was done till now.

Why are the changes needed?

Existing implementation of compaction creates 2 commits which creates 2 snapshots: first snapshot with all the files deleted and second snapshot with compacted files. If a user queries the table based on snapshot id of the first snapshot, the result would be invalid as no data is present in the table in that snapshot. To avoid this problem this PR is proposed.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Hive contains 4 query tests for testing Hive Iceberg QB Major Compaction. The outputs of these q-tests were updated as part of this PR.

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

zhangbutao · 2024-01-31T07:50:40Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+
+        List<DataFile> existingDataFiles = Lists.newArrayList();
+        List<DeleteFile> existingDeleteFiles = Lists.newArrayList();
+        IcebergTableUtil.getFiles(table, existingDataFiles, existingDeleteFiles);


Can we reuse FilesForCommit results to get the dataFiles and deleteFiles?

I don't think so, because FilesForCommit contains only the compacted, new files. They don't contain existing data and delete files.

Since the Major QB Compaction is essentially a IOW operation, why don't we follow the regular IOW flow? Why do we need the extra operation to find the existing data and delete files?
Just reuse the ReplacePartitions api:

hive/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

Lines 556 to 561 in e08a600

ReplacePartitions overwrite = transaction.newReplacePartitions();

results.dataFiles().forEach(overwrite::addFile);

if (StringUtils.isNotEmpty(branchName)) {

overwrite.toBranch(HiveUtils.getTableSnapshotRef(branchName));

}

overwrite.commit();

Because IOW isn't supported on tables that had schema evolution as per https://issues.apache.org/jira/browse/HIVE-26133, but we want to support compaction even for tables that have had schema evolution.

Got it, thx.

It seems that we have tradeoff about the change.
(twice commit) vs (single commit: loop to get and store all existing data and delte files)

I know twice commit will generate two snapshot and maybe get inconsistency of data when users reading table.
But for this change, if the original iceberg table has many many small data and delete files, and we need to loop and store these files in the List, can this behavior would give memory pressure on HS?
I am not sure which way is the best one. ;(

Can other folks give some other thought? @deniskuzZ @ayushtkn @SourabhBadhya

@difin, do we know if DeleteFiles API is doing the same full table listing when Expressions.alwaysTrue() filter is provided. If not, maybe we could extend OverwriteFiles API to support delete filters?

…ve/IcebergTableUtil.java Code cleaning in getting Iceberg tables' data and delete files Co-authored-by: Butao Zhang <butaozhang1@163.com>

…ve/HiveIcebergOutputCommitter.java Code cleaning, removed unneeded call to stream() Co-authored-by: Butao Zhang <butaozhang1@163.com>

…eFiles

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

deniskuzZ

In my opinion this approach looks more expensive, maybe we should reach out to iceberg community with the. proposal to extend current API with atomic IOW semantic?

…d transaction API usage from IOW commit method.

difin · 2024-02-09T00:37:41Z

In my opinion this approach looks more expensive, maybe we should reach out to iceberg community with the. proposal to extend current API with atomic IOW semantic?

I agree it is a little bit more expensive approach, but it doesn't have the correctness issue like the existing approach and this approach is used in another engines like Amoro, Trino and Impala is also in process of changing to this approach.

sonarcloud · 2024-02-09T02:38:05Z

Quality Gate passed

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

github-actions · 2024-04-10T00:19:27Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

deniskuzZ · 2024-04-12T09:48:05Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

@@ -328,4 +336,26 @@ public static PartitionData toPartitionData(StructLike key, Types.StructType key
    }
    return data;
  }
+
+  public static Pair<List<DataFile>, List<DeleteFile>> getDataAndDeleteFiles(Table table) {


you should be doing

long startingSnapshotId = table.currentSnapshot().snapshotId(); StructLikeMap<List<List<FileScanTask>>> fileGroupsByPartition = planFileGroups(startingSnapshotId); CloseableIterable<FileScanTask> fileScanTasks = table .newScan() .useSnapshot(startingSnapshotId) .filter(filter) .ignoreResiduals() .planFiles();

see RewriteDataFilesSparkAction

github-actions · 2024-06-12T00:21:10Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

asf-ci-hive added the tests pending label Jan 30, 2024

difin force-pushed the hive_27980_iceberg_compaction_one_commit branch from 7e4a2de to db45274 Compare January 30, 2024 23:54

HIVE-28047: Iceberg: Major QB Compaction with a single commit

2ab395d

difin force-pushed the hive_27980_iceberg_compaction_one_commit branch from db45274 to 2ab395d Compare January 30, 2024 23:55

asf-ci-hive added tests failed tests pending tests passed and removed tests pending tests failed labels Jan 31, 2024

zhangbutao reviewed Jan 31, 2024

View reviewed changes

Update iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hi…

384c74b

…ve/IcebergTableUtil.java Code cleaning in getting Iceberg tables' data and delete files Co-authored-by: Butao Zhang <butaozhang1@163.com>

asf-ci-hive added tests pending and removed tests passed labels Jan 31, 2024

difin and others added 2 commits January 31, 2024 12:33

Update iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hi…

c676a14

…ve/HiveIcebergOutputCommitter.java Code cleaning, removed unneeded call to stream() Co-authored-by: Butao Zhang <butaozhang1@163.com>

Renamed IcebergTableUtil.getFiles to IcebergTableUtil.getDataAndDelet…

0278376

…eFiles

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed tests unstable labels Jan 31, 2024

ayushtkn mentioned this pull request Feb 1, 2024

HIVE-28043 : Upgrade ZooKeeper to 3.8.3 #5046

Merged

asf-ci-hive added tests passed and removed tests pending labels Feb 1, 2024

deniskuzZ reviewed Feb 2, 2024

View reviewed changes

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java Outdated Show resolved Hide resolved

deniskuzZ reviewed Feb 2, 2024

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java Outdated Show resolved Hide resolved

deniskuzZ reviewed Feb 2, 2024

View reviewed changes

Changed signature of getDataAndDeleteFiles method and removed unneede…

9ec0b9e

…d transaction API usage from IOW commit method.

asf-ci-hive added tests pending and removed tests passed labels Feb 9, 2024

asf-ci-hive added tests passed and removed tests pending labels Feb 9, 2024

github-actions bot added the stale label Apr 10, 2024

deniskuzZ removed the stale label Apr 10, 2024

deniskuzZ reviewed Apr 12, 2024

View reviewed changes

github-actions bot added the stale label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28047: Iceberg: Major QB Compaction with a single commit #5052

HIVE-28047: Iceberg: Major QB Compaction with a single commit #5052

difin commented Jan 30, 2024 •

edited

zhangbutao Jan 31, 2024

difin Jan 31, 2024

zhangbutao Feb 1, 2024

difin Feb 1, 2024

zhangbutao Feb 2, 2024

deniskuzZ Feb 9, 2024

deniskuzZ left a comment

difin commented Feb 9, 2024

sonarcloud bot commented Feb 9, 2024

github-actions bot commented Apr 10, 2024

deniskuzZ Apr 12, 2024

github-actions bot commented Jun 12, 2024

	ReplacePartitions overwrite = transaction.newReplacePartitions();
	results.dataFiles().forEach(overwrite::addFile);
	if (StringUtils.isNotEmpty(branchName)) {
	overwrite.toBranch(HiveUtils.getTableSnapshotRef(branchName));
	}
	overwrite.commit();

HIVE-28047: Iceberg: Major QB Compaction with a single commit #5052

Are you sure you want to change the base?

HIVE-28047: Iceberg: Major QB Compaction with a single commit #5052

Conversation

difin commented Jan 30, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

zhangbutao Jan 31, 2024

Choose a reason for hiding this comment

difin Jan 31, 2024

Choose a reason for hiding this comment

zhangbutao Feb 1, 2024

Choose a reason for hiding this comment

difin Feb 1, 2024

Choose a reason for hiding this comment

zhangbutao Feb 2, 2024

Choose a reason for hiding this comment

deniskuzZ Feb 9, 2024

Choose a reason for hiding this comment

deniskuzZ left a comment

Choose a reason for hiding this comment

difin commented Feb 9, 2024

sonarcloud bot commented Feb 9, 2024

Quality Gate passed

github-actions bot commented Apr 10, 2024

deniskuzZ Apr 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 12, 2024

difin commented Jan 30, 2024 •

edited