Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Incremental cleaning broken in 0.11.0 #5835

Closed
parisni opened this issue Jun 10, 2022 · 1 comment
Closed

[SUPPORT] Incremental cleaning broken in 0.11.0 #5835

parisni opened this issue Jun 10, 2022 · 1 comment
Assignees

Comments

@parisni
Copy link
Contributor

parisni commented Jun 10, 2022

hudi 0.11.0 / spark 3.2.1
I experience a major problem with cleaning, which time increase linearly with partitions number.

See below logs, of batch loop of inserts in a hudi table with growing number of partitions:

Incremental Cleaning mode is enabled
Total Partitions to clean : 6960, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7080, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7200, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7320, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7440, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7560, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7680, with policy KEEP_LATEST_COMMITS
Total Partitions to clean : 7800, with policy KEEP_LATEST_COMMITS

I debugged this and the cleaner allways fall back to brute force partition cleaning, on file system because the lastClean commit is always empty:

private List<String> getPartitionPathsForFullCleaning() {
// Go to brute force mode of scanning all partitions
try {
// Because the partition of BaseTableMetadata has been deleted,
// all partition information can only be obtained from FileSystemBackedTableMetadata.
FileSystemBackedTableMetadata fsBackedTableMetadata = new FileSystemBackedTableMetadata(context,
context.getHadoopConf(), config.getBasePath(), config.shouldAssumeDatePartitioning());
return fsBackedTableMetadata.getAllPartitionPaths();

This makes the cleaner unable to work on table with large number of partitions

@parisni
Copy link
Contributor Author

parisni commented Jun 10, 2022

Apparently this behavior is expected when using bulk_insert operation, which cannot lead to cleaning and thus no clean commit get in the timeline.

When bulk_insert, then one shall disable automatic cleaning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants