[HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering#9006
[HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering#9006suryaprasanna wants to merge 7 commits intoapache:masterfrom
Conversation
2a3a12b to
ab3ebc5
Compare
|
|
||
| private static final Logger LOG = LoggerFactory.getLogger(ParquetFileMetaToWriteStatusConvertor.class); | ||
| private final HoodieTable<T,I,K,O> hoodieTable; | ||
| private final HoodieWriteConfig writeConfig; |
There was a problem hiding this comment.
HoodieTable<T, I, K, O> hoodieTable
There was a problem hiding this comment.
Fixed indentation.
| stat.setPartitionPath(writeStatus.getPartitionPath()); | ||
| stat.setPath(new Path(writeConfig.getBasePath()), parquetFilePath); | ||
| stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords()); | ||
| stat.setPrevCommit(String.valueOf(executionConfigs.get("prevCommit"))); |
There was a problem hiding this comment.
It is cool if we can avoid to hardcode these keys.
There was a problem hiding this comment.
Moved to static final variables.
| * Write handle that is used to work on top of files rather than on individual records. | ||
| */ | ||
| public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> { | ||
|
|
There was a problem hiding this comment.
HoodieFileWriteHandler -> HoodieFileWriteHandle
There was a problem hiding this comment.
I am also thinking, if we can name this
HoodieFileRewriteHandle
|
|
||
| public HoodieFileWriteHandler(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable, | ||
| String partitionPath, String fileId, TaskContextSupplier taskContextSupplier, | ||
| Path srcPath) { |
There was a problem hiding this comment.
Can we give some document to these parameters, especially srcPath.
There was a problem hiding this comment.
I think using srcPath is confusing changed it to oldFilePath, to keep it consistent with other handles.
| this.writeStatus = generateWriteStatus(path.toString(), partitionPath, executionConfigs); | ||
|
|
||
| // TODO: Create completed marker file here once the marker PR is landed. | ||
| // createCompleteMarkerFile throws hoodieException, if marker directory is not present. |
There was a problem hiding this comment.
Not sure what these TODO really mean, the marker file should be created anyway.
There was a problem hiding this comment.
please do add jira id here
There was a problem hiding this comment.
This PR is dependent on Marker changes PR from Balajee. So, for now added these comments will revert as soon as those changes are landed.
There was a problem hiding this comment.
Added it as part of TODO statement.
| * that use parquet-tools commands. | ||
| */ | ||
| public abstract class ParquetToolsExecutionStrategy<T extends HoodieRecordPayload<T>> | ||
| extends SingleSparkJobExecutionStrategy<T> { |
There was a problem hiding this comment.
Do we have any impl class not exclusively for testing?
There was a problem hiding this comment.
We have column prune and other encryption related implementation classes that use parquet-tools. Will need to check with other teams before pushing them to OSS.
| private final HoodieWriteConfig writeConfig; | ||
| private final FileSystem fs; | ||
|
|
||
| public ParquetFileMetaToWriteStatusConvertor(HoodieTable<T, I, K, O> hoodieTable, HoodieWriteConfig writeConfig) { |
There was a problem hiding this comment.
awesome. we might need something like this to support insert_overwrite, delete_partition w/ RLI.
| stat.setPrevCommit(String.valueOf(executionConfigs.get("prevCommit"))); | ||
|
|
||
| writeStatus.setStat(stat); | ||
| } |
There was a problem hiding this comment.
oh this is not populating any succ records. anyways, we can build something on top of this for insert_overwrite support
There was a problem hiding this comment.
We can improve this to include record_keys later, for now we can keep this simple to convert the parquet meta to write status.
| /** | ||
| * Write handle that is used to work on top of files rather than on individual records. | ||
| */ | ||
| public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> { |
There was a problem hiding this comment.
rename to "HoodieFileWriteHandle". no "r" in the end. all of your handles are named this way. lets not add "r" in the end.
|
|
||
| // Create inProgress marker file | ||
| createMarkerFile(partitionPath, FSUtils.makeBaseFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension())); | ||
| // TODO: Create inprogress marker here and remove above marker file creation, once the marker PR is landed. |
There was a problem hiding this comment.
can you add the jira number here please
| createMarkerFile(partitionPath, FSUtils.makeBaseFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension())); | ||
| // TODO: Create inprogress marker here and remove above marker file creation, once the marker PR is landed. | ||
| // createInProgressMarkerFile(partitionPath,FSUtils.makeDataFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension())); | ||
| LOG.info("New CreateHandle for partition :" + partitionPath + " with fileId " + fileId); |
There was a problem hiding this comment.
fix logging. this is no CreateHandle
| * This class gives skeleton implementation for set of clustering execution strategy | ||
| * that use parquet-tools commands. | ||
| */ | ||
| public abstract class ParquetToolsExecutionStrategy<T extends HoodieRecordPayload<T>> |
There was a problem hiding this comment.
should we name this class as EfficientParquetReWriteExecutionStrategy
embedding Parquettools in the class name somehow does not sit well.
There was a problem hiding this comment.
Since, the class runs on parquet_tools commands, so I thought ParquetToolsExecutionStrategy name might be better. By renaming it to EfficientParquetReWriteExecutionStrategy, we are reducing the emphasis on the ParquetTools. Let me know, what you think.
There was a problem hiding this comment.
taking a closer look at the patch, I don't see anything specific to parquet tools here. cna you help me understand?
we could also name this
BaseFileRewriteStrategy (or BaseFileTransformStrategy) since it takes in a old file and returns a new file.
any base file format should be supported.
for BaseFileRewriteStrategy, we can introduce ParquetFileRewriteStrategy if need be.
There was a problem hiding this comment.
you are right, this patch only add an interface but no specific impl for parquet-tools.
| LOG.info("Starting clustering operation on input file ids."); | ||
| List<ClusteringOperation> clusteringOperations = clusteringOps.getOperations(); | ||
| if (clusteringOperations.size() > 1) { | ||
| throw new HoodieClusteringException("Expect only one clustering operation during rewrite: " + getClass().getName()); |
There was a problem hiding this comment.
whey throw if we have more than 1 CO?
There was a problem hiding this comment.
parquet tools operate at file level. So, HoodieClusteringGroups that are created during clustering plan creation will take creating one group per file group. That way SingleSparkJobExecutionStrategy class can parallelize the execution on all these file groups in parallel.
Later when including other tools like merge etc we can relax this condition.
| * Write handle that is used to work on top of files rather than on individual records. | ||
| */ | ||
| public class HoodieFileWriteHandler<T extends HoodieRecordPayload, I, K, O> extends HoodieWriteHandle<T, I, K, O> { | ||
|
|
There was a problem hiding this comment.
I am also thinking, if we can name this
HoodieFileRewriteHandle
| * In this method parquet-tools command can be created and executed. | ||
| * Assuming that the parquet-tools command operate per file basis this interface allows command to run once per file. | ||
| */ | ||
| protected abstract void executeTools(Path srcFilePath, Path destFilePath); |
There was a problem hiding this comment.
executeRewrite or executeConvert
There was a problem hiding this comment.
Since the class is ParquetToolsExecutionStrategy I thought executeTools might be better here since it will actually run parqeut_tools commands here. Let me know what you think.
| final Iterator<HoodieRecord<T>> records, final int numOutputGroups, final String instantTime, | ||
| final Map<String, String> strategyParams, final Schema schema, final List<HoodieFileGroupId> fileGroupIdList, | ||
| final boolean preserveHoodieMetadata, final TaskContextSupplier taskContextSupplier) { | ||
| return null; |
There was a problem hiding this comment.
then, should we throw here then.
There was a problem hiding this comment.
Yeah, corrected it.
ab3ebc5 to
775343a
Compare
Thanks @suryaprasanna , can you clarify what's the relationship between column pruning and clustering, for regular notion of Hudi clustering, it only merges small file groups into larger ones with optional soring on columns, there is no pruning happens here, how the user expects to improve the efficiency with this patch overall? |
Clustering is initially added to do sorting and stitching. But its framework is flexible enough to accommodate wide variety of rewriter use cases. Following are the other rewriter usecases that can be done using Clustering framework.
|
So you mean, a user action like |
danny0405
left a comment
There was a problem hiding this comment.
+1, I'm fine with the change.
Summary: Create new ParquetToolsExecutionStrategy within hudi clustering to support parquet-tools commands like column prune. Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq!, #ldap_hudi JIRA Issues: HUDI-1011 Differential Revision: https://code.uberinternal.com/D7271275
775343a to
b385ea4
Compare
| partitionPath, fileId, taskContextSupplier, oldFilePath); | ||
|
|
||
| // Executes the parquet-tools command. | ||
| executeTools(oldFilePath, writeHandler.getPath()); |
There was a problem hiding this comment.
can we move the executeTools within HoodieFileWriteHandle
Change Logs
This change implements a new clustering strategy to execute parquet tools commands for rewriting use cases.
If there is a use case of pruning some columns to save storage memory, current approach of clustering will iterate over every record and remove the unused column, this is so much time consuming. By directly using ParquetTools we can achieve this by running a command within the clustering strategy.
Here, the logic goes through the process of creating marker files, so that on any event of failure it could rely on rollback's MarkerBasedRollbackStrategy to remove the inflight files and parquet files.
Impact
No impact addition of a new feature.
Risk level (write none, low medium or high below)
None.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist