Skip to content

Add option to only skip trash for Import files#3140

Closed
alerman wants to merge 8 commits intoapache:mainfrom
alerman:skipTrashImportsOnly
Closed

Add option to only skip trash for Import files#3140
alerman wants to merge 8 commits intoapache:mainfrom
alerman:skipTrashImportsOnly

Conversation

@alerman
Copy link
Copy Markdown
Contributor

@alerman alerman commented Dec 23, 2022

This MR adds an option to optionally skip trash for Import files only.
On our clusters, we find that we have to skip trash due to pressure on the namenodes and the scale that we deal with.

We have seen a few instances of files going missing. We are still in the process of investigating (we thought it would be fixed by #2792 but was not.). When this happens because of our use case we can typically track down the data that contributed to Import files. However, data that has been compacted by one or more tablet servers is more difficult to tack down. By enabling this new option, we will have time to catch this issue and recover the data before the trash is emptied while we determine the cause of the files going missing.

Intention is to backport this to 1.10 once the approach is agreed upon

@matthpeterson
Copy link
Copy Markdown
Contributor

matthpeterson commented Dec 23, 2022

Recommendation: add a test for an I-File using a path containing directories. Otherwise it looks good.

@alerman
Copy link
Copy Markdown
Contributor Author

alerman commented Dec 23, 2022

Recommendation: add a test for an I-File using a path containing directories. Otherwise it looks good.

Done. Nice catch.

@alerman alerman force-pushed the skipTrashImportsOnly branch from 231955f to 8181c9e Compare December 23, 2022 14:52
@alerman alerman force-pushed the skipTrashImportsOnly branch from 8181c9e to dbbba74 Compare December 23, 2022 14:54
GC_TRASH_IGNORE("gc.trash.ignore", "false", PropertyType.BOOLEAN,
"Do not use the Trash, even if it is configured.", "1.5.0"),
GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN,
"Skip trash only for Import files when gc.trash.ignore is true.", "2.1.0"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be since 3.0.0 (or maybe 2.1.1 if backported).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to show 3.0.0

@ivakegg
Copy link
Copy Markdown
Contributor

ivakegg commented Dec 30, 2022

Does a skipTrashRegex make sense? Just wondering if something a little more generic would be useful. The other thought I had was whether this could be made table based (may not be possible). I am fine with this as is, just a few thoughts to consider.

@alerman
Copy link
Copy Markdown
Contributor Author

alerman commented Jan 4, 2023

I don't want to go through the effort for a regex for this, as that brings up some other questions in my opinion. For now, we take the simple approach and we can expand later if we deem it necessary.

public void testMoveToTrash_NotUsingTrash_importsOnlyEnabled() throws Exception {
systemConfig.set(Property.GC_TRASH_IGNORE.getKey(), "true");
systemConfig.set(Property.GC_TRASH_IGNORE_IMPORTS_ONLY.getKey(), "true");
Path iFilePath = new Path("Ifile");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at TabletFileTest.testValidPaths for some more realistic file and path names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with paths similar to those in testValidPaths

@dlmarion dlmarion requested review from EdColeman and ctubbsii January 6, 2023 14:57
GC_TRASH_IGNORE("gc.trash.ignore", "false", PropertyType.BOOLEAN,
"Do not use the Trash, even if it is configured.", "1.5.0"),
GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN,
"Skip trash only for Import files when gc.trash.ignore is true.", "3.0.0"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing, Just from the descriptions, if the system is configured to use trash - then setting GC_TRASH_IGNORE = true overrides the system setting and files are not moved to trash and are immediately deleted.

With GC_TRASH_IGNORE=true, setting GC_TRASH_IGNORE_IMPORTS_ONLY=true - then bulk import files skip the trash, but all other files will use the trash?

Would it be easier for users if these were more independent?

  • GC_TRASH_IGNORE true - skip trash, false - use trash if configured for the system.
  • GC_TRASH_IGNORE_IMPORTS_ONLY - true - skip trash for bulk import files, overrides any GC_TRASH_IGNORE setting for bulk import files. false - use GC_TRASH_IGNORE or the system setting.

My thinking is along the lines of:

system GC_TRASH_IGNORE GC_TRASH_IGNORE_IMPORTS_ONLY impact
trash disable n/a n/a skip trash, all files
trash enabled N N ues trash, all files
trash enabled N Y skip trash, bulk only
trash enabled Y N skip trash, all files
trash enabled Y Y skip trash, all files

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alerman and I had a discussion about this. He suggested renaming the properties to GC_USE_TRASH. It was very confusing to me working on this because things were not named in the affirmative, there were a bunch of negations in the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I ran into too - but renaming the GC_TRASH_IGNORE property seemed something to be avoided. Maybe if the new property was GC_TRASH_IGNORE_IMPORTS - that would make having the usage I described above easier to communicate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  GC_USE_TRASH("gc.trash.ignore", "true", PropertyType.BOOLEAN,
      "If trash is configured, move deleted non-bulk import files there instead of deleting them", "1.5.0"),
  GC_BULK_IMPORT_USE_TRASH("gc.trash.use.bulk.imports", "true", PropertyType.BOOLEAN,
      "If trash is configured, move deleted bulk import files there instead of deleting them", "3.0.0"),
Trash Enabled GC_USE_TRASH GC_BULK_IMPORT_USE_TRASH Result
N N/A N/A all files deleted
Y true true all files moved to trash
Y false true only bulk import files moved to trash
Y true false bulk import files deleted , others moved to trash
Y false false all files deleted

Copy link
Copy Markdown
Contributor

@EdColeman EdColeman Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too - but that requires property upgrade code to translate the settings - and may have issues because we may need to deprecate, but keep GC_TRASH_IGNORE around for a while? Not sure...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table helps, but it also highlights how unintuitive having three separate properties are to control a desired outcome. I would much rather have a simple single property that had multiple options to select the specific policy, rather than leave it as an exercise for the user to try to determine the current policy by looking it up in a table like this.

Copy link
Copy Markdown
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one comment about how the new property is a bit confusing and unintuitive for how it's supposed to interact with the other properties.

However, I'm also thinking that we probably don't need this feature at all. I'm thinking it probably makes more sense for Accumulo to just have the simple trash-or-no-trash option, and users can just adjust their own workflows to remove stuff from the HDFS trash immediately if they don't need it there. They don't need Accumulo to do fine-grain management of their HDFS trash utilization.

Comment on lines +749 to +750
GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN,
"Skip Trash for Import files when the trash is in use.", "3.0.0"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spite of the description, it's not intuitive how this new property will interact with the previous gc.trash.ignore property. Rather than have a new BOOLEAN that modifies the behavior of the previous BOOLEAN only when the previous one was "true", how about changing gc.trash.ignore to accept a third possible value?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctubbsii, I was actually thinking you were going to take the discussion in a different direction and also suggest that we might consider doing this on a per table basis instead of as an all or nothing approach (all tables or no tables) as is now. That it would be better to make this easier and less error prone for users instead of having them manage and configure something like regexs over Accumulo files in HDFS trash. From a durability standpoint, we allow walogs and replication to be configured on a per table basis. Trash is just another safety net that a user may wish to user. Some tables can be more important than others and allowing that decision to be made on a per table basis to include bulk input files, seems like a useful feature.

I don't see this as Accumulo managing the trash. I see this as more of giving the user explicit control over what Accumulo puts in the trash. Once it is in the trash Accumulo isn't doing anything with it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a per-table property might be useful, but the unintuitiveness of two separate properties to control three modes of operation would still be an issue that needs addressing. The per-table option wasn't what was being proposed, and it didn't occur to me. I can see the merits of it, but it's not so simple to implement, since the accumulo-gc operates asynchronously, and doesn't know the table from which a file came... it can only make educated guesses based on the file's path, but that's not reliable, since it could have been used by multiple tables, or manually placed through some import process that didn't include a table id in the path... which is only done by convention, not a guarantee.

My view that Accumulo is managing the trash is based on our long-held desire to try to decouple ourselves from HDFS, and I'm recognizing the trash as an HDFS-specific feature. The more that filesystem-related behavior is handled within the filesystem config, and Accumulo-related behavior is handled within Accumulo's config, the better, I think. While this may be less convenient initially, I think it makes greater sense as the overall system architecture grows in complexity. Replication (number of block replicas) and WALOG durability (hsync, hflush) are also filesystem-specific behaviors, but those knobs already exist. I'm more inclined to scrutinize the knobs that further tightly couple us to HDFS when they are new, so we're not digging ourselves a deeper hole that is harder to get out of (to decouple). But this is just a heuristic I apply to help me think about the consequences of a change... it's not a fixed prerequisite for me to support a change or anything like that.

I was also thinking about an "on-delete" hook or similar... but I don't think I'd be in favor of that, because I think if a user wanted that, they could just subclass an existing FileSystem and add that feature to its delete method. If every user did that, it would be a hassle... but making a new FineGrainedTrashPolicyFileSystem only needs to be done once. I'm not suggesting this is the best option... just one possible alternative solution to the property changes being proposed here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I like the suggestion of one setting. Something like gc_skip_trash = { ALL_FILES, BULK_FILES, NO_FILES}. Is that more inline with what you were thinking?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was thinking when I said "third possible value" above. But, that solution is only good if that is the end of the requirements to expand this behavior and the number of configurable values stops at three. If the requirements expand to support additional use cases, then it would be better to make this behavior a pluggable policy than to enumerate a long list of possible values.

@dlmarion
Copy link
Copy Markdown
Contributor

I created #3428 against 2.1 as a replacement for this PR. It incorporates some of the suggestions here.

@ctubbsii
Copy link
Copy Markdown
Member

Closing this as superseded by #3428 . Ongoing design discussion in the code review there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants