HBASE-29209: Implement Backup Cleanup Command to Remove Older WALs Not Required for PITR of Any Table #6847

vinayakphegde · 2025-03-24T05:09:13Z

This PR introduces the CleanupCommand, which is used in the removal of obsolete Write-Ahead Log (WAL) and bulk-loaded files to optimize storage usage in HBase's continuous backup system. The cleanup process identifies the oldest full backup timestamp and deletes WAL files older than this cutoff, ensuring Point-in-Time Recovery (PITR) is not compromised.

Apache-HBase · 2025-03-26T11:06:55Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 31s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ HBASE-28957 Compile Tests _
+1 💚	mvninstall	4m 5s		HBASE-28957 passed
+1 💚	compile	1m 2s		HBASE-28957 passed
+1 💚	checkstyle	0m 15s		HBASE-28957 passed
+1 💚	spotbugs	0m 55s		HBASE-28957 passed
+1 💚	spotless	1m 9s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+1 💚	mvninstall	4m 3s		the patch passed
+1 💚	compile	0m 29s		the patch passed
-0 ⚠️	javac	0m 29s	/results-compile-javac-hbase-backup.txt	hbase-backup generated 5 new + 108 unchanged - 0 fixed = 113 total (was 108)
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 9s		the patch passed
+1 💚	spotbugs	0m 35s		the patch passed
+1 💚	hadoopcheck	12m 23s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	spotless	0m 58s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 12s		The patch does not generate ASF License warnings.
		35m 58s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6847/4/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6847
JIRA Issue	HBASE-29209
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname	Linux dd9c547cafc1 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	HBASE-28957 / `8887d7e`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	84 (vs. ulimit of 30000)
modules	C: hbase-backup U: hbase-backup
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6847/4/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-03-26T11:12:13Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 29s		Docker mode activated.
-0 ⚠️	yetus	0m 2s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ HBASE-28957 Compile Tests _
+1 💚	mvninstall	3m 16s		HBASE-28957 passed
+1 💚	compile	0m 18s		HBASE-28957 passed
+1 💚	javadoc	0m 15s		HBASE-28957 passed
+1 💚	shadedjars	6m 2s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	3m 2s		the patch passed
+1 💚	compile	0m 19s		the patch passed
+1 💚	javac	0m 19s		the patch passed
+1 💚	javadoc	0m 14s		the patch passed
+1 💚	shadedjars	5m 56s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	20m 31s		hbase-backup in the patch passed.
		41m 21s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6847/4/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6847
JIRA Issue	HBASE-29209
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 60314a5d5273 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	HBASE-28957 / `8887d7e`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6847/4/testReport/
Max. process+thread count	3754 (vs. ulimit of 30000)
modules	C: hbase-backup U: hbase-backup
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6847/4/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

rmdmattingly · 2025-03-26T11:19:35Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/BackupDriver.java

+    } else if (BackupCommand.CLEANUP.name().equalsIgnoreCase(cmd)) {
+      type = BackupCommand.CLEANUP;
    } else {


I would name this something more specific, unless this command intends to clean up entries that may be left behind for full and incremental backups as well

rmdmattingly · 2025-03-26T11:24:40Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupCommands.java

+   * The {@code CleanupCommand} class is responsible for removing Write-Ahead Log (WAL) and
+   * bulk-loaded files that are no longer needed for Point-in-Time Recovery (PITR).
+   * <p>
+   * The cleanup process follows these steps:
+   * <ol>
+   * <li>Identify the oldest full backup and its start timestamp.</li>
+   * <li>Delete WAL files older than this timestamp, as they are no longer usable for PITR with any
+   * backup.</li>
+   * </ol>


The standard approach in HBase is to delete old files via extensions of the BaseFileCleanerDelegate. For example, the BackupLogCleaner which already handles cleaning up WALs as they relate to backups.

These cleaners should be run by the HMaster's CleanerChore, which will ensure that we only delete files which live in the intersection of all cleaners' outputs. On top of that critical safety guarantee, this also has the advantage of being run periodically, automatically — for more sophisticated HBase operators, this is a critical advantage because manual operations for textbook operations like backups do not scale well.

Thanks, @rmdmattingly, for the review comments!

I’d like to clarify that we are specifically cleaning up WALs in the backup location (e.g., S3, where they are continuously replicated), not the cluster’s WALs. If we were dealing with cluster WALs, your point would certainly apply—does that sound correct?

Regarding keeping this command manual:

All other backup-related commands are currently manual.

This command depends on the delete command. What we are doing here is identifying the first full backup in the system and deleting all WALs before that point.

These WALs are needed for point-in-time recovery (PITR), where we replay WALs from the full backup up to the specified recovery point. If the full backup itself no longer exists, keeping those WALs serves no purpose.

Since deleting full backups is already a manual operation, there is little benefit in automating this cleanup.

That said, we could explore the possibility of running this command periodically and automatically in future iterations.
Let me know your thoughts!

Ah okay, thanks for the clarifications here. Maybe we could bake this clarification into the JavaDocs

You make some good points here, but I don't think they full take into account the variety of ways in which people deploy HBase.

All other backup-related commands are currently manual.

This is true to a large extent, but the non-emergency commands have at least been exposed in the Admin interface to make programmatic backups easily achievable. Maybe wiring up through the Admin is a fair compromise?

This command depends on the delete command. What we are doing here is identifying the first full backup in the system and deleting all WALs before that point.

If this operation can only follow a delete, and WALs are made useless by said delete, then should this operation just be a part of the backup deletion process?

Since deleting full backups is already a manual operation, there is little benefit in automating this cleanup.

I don't think it's true that backup deletions are necessarily manual from an operator's perspective. For example, a company backing up their data in S3 could be making use of bucket TTLs to clean up their old backups. In that case, it would be nice for unusable WALs to clean themselves up organically too.

If this operation can only follow a delete, and WALs are made useless by said delete, then should this operation just be a part of the backup deletion process?

That's my point as well. We could do this cleanup as part of the backup delete command, in which case we don't need to deal with whether this should be automatic or manual.

Thanks, @rmdmattingly and @anmolnar.

Okay, we can incorporate this cleanup process into the delete command itself.

Currently, the delete command is used to remove both full and incremental backups. We have now introduced a new validation for PITR-Critical Backup Deletion. Please check the PR here: #6848 and review it.

I will also add this cleanup logic at the end of the delete process to remove any WALs that can be deleted (which were previously retained due to this backup). How does that sound?

rmdmattingly · 2025-03-26T11:25:59Z

I also want to say thank you for all of your work on the continuous backup idea! We have a homegrown solution at my company to achieve something similar, and I would love if we could stop maintaining that system upon adoption of your work here.

vinayakphegde · 2025-03-26T13:39:38Z

I also want to say thank you for all of your work on the continuous backup idea! We have a homegrown solution at my company to achieve something similar, and I would love if we could stop maintaining that system upon adoption of your work here.

Sure, @rmdmattingly! Thanks for the review. We'll keep you involved and look forward to your input. :)

anmolnar · 2025-03-28T22:54:20Z

ping @taklwu @ss77892

kgeisz

LGTM, although @rmdmattingly's change request has still not been addressed.

kgeisz · 2025-05-23T01:31:18Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupCommands.java

+     */
+    private void cleanupOldWALFiles(Configuration conf, String backupWalDir, long cutoffTime)
+      throws IOException {
+      System.out.println("Starting WAL cleanup in backup directory: " + backupWalDir


@vinayakphegde, are you using System.out.println() in this file so output can be displayed to the user when they use hbase command line commands?

kgeisz · 2025-05-23T01:32:41Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupCommands.java

+            .println("Checking WAL directory: " + dirName + " (Start Time: " + dayStart + ")");
+
+          // If WAL files of that day are older than cutoff time, delete them
+          if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) {


@vinayakphegde, just curious, what is the purpose of adding ONE_DAY_IN_MILLISECONDS - 1? Is dayStart always at the beginning of the day and ONE_DAY_IN_MILLISECONDS - 1 moves the timestamp to the end of the day?

vinayakphegde · 2025-06-13T09:25:06Z

Cleanup logic is integrated with Delete Command: https://issues.apache.org/jira/browse/HBASE-29255

anmolnar force-pushed the HBASE-28957 branch from fcee6f7 to 5c9bcbd Compare March 25, 2025 17:43

vinayakphegde added 3 commits March 26, 2025 13:35

Store bulkload files in daywise bucket as well

20b1f60

add cleanup command for wal cleanup

b8ee340

Update the system table new cutoff times

8a117d8

vinayakphegde force-pushed the HBASE-28992 branch from 146668b to 8a117d8 Compare March 26, 2025 08:13

This comment has been minimized.

Sign in to view

fix minor test error

8887d7e

rmdmattingly requested changes Mar 26, 2025

View reviewed changes

kgeisz approved these changes May 23, 2025

View reviewed changes

kgeisz reviewed May 23, 2025

View reviewed changes

vinayakphegde closed this Jun 13, 2025

HBASE-29209: Implement Backup Cleanup Command to Remove Older WALs Not Required for PITR of Any Table #6847

HBASE-29209: Implement Backup Cleanup Command to Remove Older WALs Not Required for PITR of Any Table #6847

Uh oh!

Conversation

vinayakphegde commented Mar 24, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Apache-HBase commented Mar 26, 2025

Uh oh!

Apache-HBase commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmdmattingly commented Mar 26, 2025

Uh oh!

vinayakphegde commented Mar 26, 2025

Uh oh!

anmolnar commented Mar 28, 2025

Uh oh!

kgeisz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinayakphegde commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants