[HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering by suryaprasanna · Pull Request #9006 · apache/hudi

suryaprasanna · 2023-06-17T22:39:39Z

Change Logs

This change implements a new clustering strategy to execute parquet tools commands for rewriting use cases.
If there is a use case of pruning some columns to save storage memory, current approach of clustering will iterate over every record and remove the unused column, this is so much time consuming. By directly using ParquetTools we can achieve this by running a command within the clustering strategy.
Here, the logic goes through the process of creating marker files, so that on any event of failure it could rely on rollback's MarkerBasedRollbackStrategy to remove the inflight files and parquet files.

Impact

No impact addition of a new feature.

Risk level (write none, low medium or high below)

None.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-06-19T04:15:23Z

...nt-common/src/main/java/org/apache/hudi/execution/ParquetFileMetaToWriteStatusConvertor.java

+
+  private static final Logger LOG = LoggerFactory.getLogger(ParquetFileMetaToWriteStatusConvertor.class);
+  private final HoodieTable<T,I,K,O> hoodieTable;
+  private final HoodieWriteConfig writeConfig;


HoodieTable<T, I, K, O> hoodieTable

Fixed indentation.

danny0405 · 2023-06-19T04:15:46Z

...nt-common/src/main/java/org/apache/hudi/execution/ParquetFileMetaToWriteStatusConvertor.java

+    stat.setPartitionPath(writeStatus.getPartitionPath());
+    stat.setPath(new Path(writeConfig.getBasePath()), parquetFilePath);
+    stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
+    stat.setPrevCommit(String.valueOf(executionConfigs.get("prevCommit")));


It is cool if we can avoid to hardcode these keys.

Moved to static final variables.

danny0405 · 2023-06-19T04:18:03Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+ * Write handle that is used to work on top of files rather than on individual records.
+ */
+public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> {
+


HoodieFileWriteHandler -> HoodieFileWriteHandle

I am also thinking, if we can name this
HoodieFileRewriteHandle

danny0405 · 2023-06-19T04:19:06Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+
+  public HoodieFileWriteHandler(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                String partitionPath, String fileId, TaskContextSupplier taskContextSupplier,
+                                Path srcPath) {


Can we give some document to these parameters, especially srcPath.

I think using srcPath is confusing changed it to oldFilePath, to keep it consistent with other handles.

danny0405 · 2023-06-19T04:21:49Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+      this.writeStatus = generateWriteStatus(path.toString(), partitionPath, executionConfigs);
+
+      // TODO: Create completed marker file here once the marker PR is landed.
+      // createCompleteMarkerFile throws hoodieException, if marker directory is not present.


Not sure what these TODO really mean, the marker file should be created anyway.

please do add jira id here

This PR is dependent on Marker changes PR from Balajee. So, for now added these comments will revert as soon as those changes are landed.

Added it as part of TODO statement.

danny0405 · 2023-06-19T04:26:56Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+ * that use parquet-tools commands.
+ */
+public abstract class ParquetToolsExecutionStrategy<T extends HoodieRecordPayload<T>>
+    extends SingleSparkJobExecutionStrategy<T> {


Do we have any impl class not exclusively for testing?

We have column prune and other encryption related implementation classes that use parquet-tools. Will need to check with other teams before pushing them to OSS.

nsivabalan · 2023-06-19T18:23:12Z

...nt-common/src/main/java/org/apache/hudi/execution/ParquetFileMetaToWriteStatusConvertor.java

+  private final HoodieWriteConfig writeConfig;
+  private final FileSystem fs;
+
+  public ParquetFileMetaToWriteStatusConvertor(HoodieTable<T, I, K, O> hoodieTable, HoodieWriteConfig writeConfig) {


awesome. we might need something like this to support insert_overwrite, delete_partition w/ RLI.

nsivabalan · 2023-06-19T18:25:18Z

...nt-common/src/main/java/org/apache/hudi/execution/ParquetFileMetaToWriteStatusConvertor.java

+    stat.setPrevCommit(String.valueOf(executionConfigs.get("prevCommit")));
+
+    writeStatus.setStat(stat);
+  }


oh this is not populating any succ records. anyways, we can build something on top of this for insert_overwrite support

We can improve this to include record_keys later, for now we can keep this simple to convert the parquet meta to write status.

nsivabalan · 2023-06-19T18:28:01Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+/**
+ * Write handle that is used to work on top of files rather than on individual records.
+ */
+public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> {


rename to "HoodieFileWriteHandle". no "r" in the end. all of your handles are named this way. lets not add "r" in the end.

nsivabalan · 2023-06-19T18:28:26Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+
+    // Create inProgress marker file
+    createMarkerFile(partitionPath, FSUtils.makeBaseFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension()));
+    // TODO: Create inprogress marker here and remove above marker file creation, once the marker PR is landed.


can you add the jira number here please

nsivabalan · 2023-06-19T18:28:54Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+    createMarkerFile(partitionPath, FSUtils.makeBaseFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension()));
+    // TODO: Create inprogress marker here and remove above marker file creation, once the marker PR is landed.
+    // createInProgressMarkerFile(partitionPath,FSUtils.makeDataFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension()));
+    LOG.info("New CreateHandle for partition :" + partitionPath + " with fileId " + fileId);


fix logging. this is no CreateHandle

nsivabalan · 2023-06-19T18:31:14Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+ * This class gives skeleton implementation for set of clustering execution strategy
+ * that use parquet-tools commands.
+ */
+public abstract class ParquetToolsExecutionStrategy<T extends HoodieRecordPayload<T>>


should we name this class as EfficientParquetReWriteExecutionStrategy

embedding Parquettools in the class name somehow does not sit well.

Since, the class runs on parquet_tools commands, so I thought ParquetToolsExecutionStrategy name might be better. By renaming it to EfficientParquetReWriteExecutionStrategy, we are reducing the emphasis on the ParquetTools. Let me know, what you think.

taking a closer look at the patch, I don't see anything specific to parquet tools here. cna you help me understand?
we could also name this
BaseFileRewriteStrategy (or BaseFileTransformStrategy) since it takes in a old file and returns a new file.
any base file format should be supported.
for BaseFileRewriteStrategy, we can introduce ParquetFileRewriteStrategy if need be.

you are right, this patch only add an interface but no specific impl for parquet-tools.

nsivabalan · 2023-06-19T18:33:27Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+    LOG.info("Starting clustering operation on input file ids.");
+    List<ClusteringOperation> clusteringOperations = clusteringOps.getOperations();
+    if (clusteringOperations.size() > 1) {
+      throw new HoodieClusteringException("Expect only one clustering operation during rewrite: " + getClass().getName());


whey throw if we have more than 1 CO?

parquet tools operate at file level. So, HoodieClusteringGroups that are created during clustering plan creation will take creating one group per file group. That way SingleSparkJobExecutionStrategy class can parallelize the execution on all these file groups in parallel.
Later when including other tools like merge etc we can relax this condition.

nsivabalan · 2023-06-19T18:35:03Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieFileWriteHandler.java

+ * Write handle that is used to work on top of files rather than on individual records.
+ */
+public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> {
+


I am also thinking, if we can name this
HoodieFileRewriteHandle

nsivabalan · 2023-06-19T18:35:52Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+   * In this method parquet-tools command can be created and executed.
+   * Assuming that the parquet-tools command operate per file basis this interface allows command to run once per file.
+   */
+  protected abstract void executeTools(Path srcFilePath, Path destFilePath);


executeRewrite or executeConvert

Since the class is ParquetToolsExecutionStrategy I thought executeTools might be better here since it will actually run parqeut_tools commands here. Let me know what you think.

nsivabalan · 2023-06-19T18:39:06Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+      final Iterator<HoodieRecord<T>> records, final int numOutputGroups, final String instantTime,
+      final Map<String, String> strategyParams, final Schema schema, final List<HoodieFileGroupId> fileGroupIdList,
+      final boolean preserveHoodieMetadata, final TaskContextSupplier taskContextSupplier) {
+    return null;


then, should we throw here then.

Yeah, corrected it.

danny0405 · 2023-06-28T09:28:03Z

If there is a use case of pruning some columns to save storage memory, current approach of clustering will iterate over every record and remove the unused column, this is so much time consuming.

Thanks @suryaprasanna , can you clarify what's the relationship between column pruning and clustering, for regular notion of Hudi clustering, it only merges small file groups into larger ones with optional soring on columns, there is no pruning happens here, how the user expects to improve the efficiency with this patch overall?

suryaprasanna · 2023-06-29T21:54:56Z

If there is a use case of pruning some columns to save storage memory, current approach of clustering will iterate over every record and remove the unused column, this is so much time consuming.

Thanks @suryaprasanna , can you clarify what's the relationship between column pruning and clustering, for regular notion of Hudi clustering, it only merges small file groups into larger ones with optional soring on columns, there is no pruning happens here, how the user expects to improve the efficiency with this patch overall?

Clustering is initially added to do sorting and stitching. But its framework is flexible enough to accommodate wide variety of rewriter use cases. Following are the other rewriter usecases that can be done using Clustering framework.

Encryption. Async encryption on data files can be done on demand basis, by restricting the clustering group to be 1. Which then becomes a update of the file.
Column pruning. This current change be used to run parquet_tools prune command on unused columns to reduce the storage footprint.

danny0405 · 2023-06-30T06:33:05Z

2. Column pruning. This current change be used to run parquet_tools prune command on unused columns to reduce the storage footprint.

So you mean, a user action like alter table drop column a, b, c may utilize this new strategy. Makes sense to me.

danny0405

+1, I'm fine with the change.

danny0405 · 2023-07-03T01:45:29Z

The tests have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18153&view=results

Summary: Create new ParquetToolsExecutionStrategy within hudi clustering to support parquet-tools commands like column prune. Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq!, #ldap_hudi JIRA Issues: HUDI-1011 Differential Revision: https://code.uberinternal.com/D7271275

hudi-bot · 2023-07-05T01:49:56Z

CI report:

b385ea4 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-07-07T19:34:04Z

.../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java

+        partitionPath, fileId, taskContextSupplier, oldFilePath);
+
+    // Executes the parquet-tools command.
+    executeTools(oldFilePath, writeHandler.getPath());


can we move the executeTools within HoodieFileWriteHandle

suryaprasanna changed the title ~~HUDI-6404 Implement ParquetToolsExecutionStrategy for clustering~~ [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering Jun 17, 2023

suryaprasanna force-pushed the parquet-tools branch from 2a3a12b to ab3ebc5 Compare June 19, 2023 04:02

danny0405 reviewed Jun 19, 2023

View reviewed changes

danny0405 self-assigned this Jun 19, 2023

danny0405 added area:table-service Table services engine:spark Spark integration labels Jun 19, 2023

nsivabalan requested changes Jun 19, 2023

View reviewed changes

prashantwason added the release-0.14.0 label Jun 21, 2023

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jun 21, 2023

suryaprasanna force-pushed the parquet-tools branch from ab3ebc5 to 775343a Compare June 28, 2023 07:42

danny0405 approved these changes Jun 30, 2023

View reviewed changes

suryaprasanna added 7 commits July 4, 2023 14:54

[UBER] Include ClusteringIdentityTestExecutionStrategy for testing

e77ed90

Fix compilation issues

ece09fa

[UBER] Fix TestParquetFileMetaToWriteStatusConvertor test

fe80da4

Fix checkstyle

a0ebd06

Address review comments

e35eb7b

Fix indentation

b385ea4

suryaprasanna force-pushed the parquet-tools branch from 775343a to b385ea4 Compare July 4, 2023 21:55

nsivabalan reviewed Jul 7, 2023

View reviewed changes

nsivabalan removed the release-0.14.0 label Aug 3, 2023

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024

hudi-bot mentioned this pull request Dec 9, 2025

Create a new clustering strategy to execute parquet tools commands during clustering #16034

Open

Conversation

suryaprasanna commented Jun 17, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Jun 28, 2023

Uh oh!

suryaprasanna commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

suryaprasanna commented Jun 29, 2023 •

edited

Loading

danny0405 commented Jul 3, 2023 •

edited

Loading