Add option to only skip trash for Import files#3140
Add option to only skip trash for Import files#3140alerman wants to merge 8 commits intoapache:mainfrom
Conversation
|
Recommendation: add a test for an I-File using a path containing directories. Otherwise it looks good. |
Done. Nice catch. |
231955f to
8181c9e
Compare
8181c9e to
dbbba74
Compare
| GC_TRASH_IGNORE("gc.trash.ignore", "false", PropertyType.BOOLEAN, | ||
| "Do not use the Trash, even if it is configured.", "1.5.0"), | ||
| GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN, | ||
| "Skip trash only for Import files when gc.trash.ignore is true.", "2.1.0"), |
There was a problem hiding this comment.
I think this will be since 3.0.0 (or maybe 2.1.1 if backported).
There was a problem hiding this comment.
Updated to show 3.0.0
|
Does a skipTrashRegex make sense? Just wondering if something a little more generic would be useful. The other thought I had was whether this could be made table based (may not be possible). I am fine with this as is, just a few thoughts to consider. |
|
I don't want to go through the effort for a regex for this, as that brings up some other questions in my opinion. For now, we take the simple approach and we can expand later if we deem it necessary. |
| public void testMoveToTrash_NotUsingTrash_importsOnlyEnabled() throws Exception { | ||
| systemConfig.set(Property.GC_TRASH_IGNORE.getKey(), "true"); | ||
| systemConfig.set(Property.GC_TRASH_IGNORE_IMPORTS_ONLY.getKey(), "true"); | ||
| Path iFilePath = new Path("Ifile"); |
There was a problem hiding this comment.
Please take a look at TabletFileTest.testValidPaths for some more realistic file and path names.
There was a problem hiding this comment.
Updated the PR with paths similar to those in testValidPaths
| GC_TRASH_IGNORE("gc.trash.ignore", "false", PropertyType.BOOLEAN, | ||
| "Do not use the Trash, even if it is configured.", "1.5.0"), | ||
| GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN, | ||
| "Skip trash only for Import files when gc.trash.ignore is true.", "3.0.0"), |
There was a problem hiding this comment.
I find this confusing, Just from the descriptions, if the system is configured to use trash - then setting GC_TRASH_IGNORE = true overrides the system setting and files are not moved to trash and are immediately deleted.
With GC_TRASH_IGNORE=true, setting GC_TRASH_IGNORE_IMPORTS_ONLY=true - then bulk import files skip the trash, but all other files will use the trash?
Would it be easier for users if these were more independent?
- GC_TRASH_IGNORE true - skip trash, false - use trash if configured for the system.
- GC_TRASH_IGNORE_IMPORTS_ONLY - true - skip trash for bulk import files, overrides any GC_TRASH_IGNORE setting for bulk import files. false - use GC_TRASH_IGNORE or the system setting.
My thinking is along the lines of:
| system | GC_TRASH_IGNORE | GC_TRASH_IGNORE_IMPORTS_ONLY | impact |
|---|---|---|---|
| trash disable | n/a | n/a | skip trash, all files |
| trash enabled | N | N | ues trash, all files |
| trash enabled | N | Y | skip trash, bulk only |
| trash enabled | Y | N | skip trash, all files |
| trash enabled | Y | Y | skip trash, all files |
There was a problem hiding this comment.
@alerman and I had a discussion about this. He suggested renaming the properties to GC_USE_TRASH. It was very confusing to me working on this because things were not named in the affirmative, there were a bunch of negations in the code.
There was a problem hiding this comment.
That's what I ran into too - but renaming the GC_TRASH_IGNORE property seemed something to be avoided. Maybe if the new property was GC_TRASH_IGNORE_IMPORTS - that would make having the usage I described above easier to communicate?
There was a problem hiding this comment.
GC_USE_TRASH("gc.trash.ignore", "true", PropertyType.BOOLEAN,
"If trash is configured, move deleted non-bulk import files there instead of deleting them", "1.5.0"),
GC_BULK_IMPORT_USE_TRASH("gc.trash.use.bulk.imports", "true", PropertyType.BOOLEAN,
"If trash is configured, move deleted bulk import files there instead of deleting them", "3.0.0"),
| Trash Enabled | GC_USE_TRASH | GC_BULK_IMPORT_USE_TRASH | Result |
|---|---|---|---|
| N | N/A | N/A | all files deleted |
| Y | true | true | all files moved to trash |
| Y | false | true | only bulk import files moved to trash |
| Y | true | false | bulk import files deleted , others moved to trash |
| Y | false | false | all files deleted |
There was a problem hiding this comment.
That's fine too - but that requires property upgrade code to translate the settings - and may have issues because we may need to deprecate, but keep GC_TRASH_IGNORE around for a while? Not sure...
There was a problem hiding this comment.
The table helps, but it also highlights how unintuitive having three separate properties are to control a desired outcome. I would much rather have a simple single property that had multiple options to select the specific policy, rather than leave it as an exercise for the user to try to determine the current policy by looking it up in a table like this.
ctubbsii
left a comment
There was a problem hiding this comment.
I had one comment about how the new property is a bit confusing and unintuitive for how it's supposed to interact with the other properties.
However, I'm also thinking that we probably don't need this feature at all. I'm thinking it probably makes more sense for Accumulo to just have the simple trash-or-no-trash option, and users can just adjust their own workflows to remove stuff from the HDFS trash immediately if they don't need it there. They don't need Accumulo to do fine-grain management of their HDFS trash utilization.
| GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN, | ||
| "Skip Trash for Import files when the trash is in use.", "3.0.0"), |
There was a problem hiding this comment.
In spite of the description, it's not intuitive how this new property will interact with the previous gc.trash.ignore property. Rather than have a new BOOLEAN that modifies the behavior of the previous BOOLEAN only when the previous one was "true", how about changing gc.trash.ignore to accept a third possible value?
There was a problem hiding this comment.
@ctubbsii, I was actually thinking you were going to take the discussion in a different direction and also suggest that we might consider doing this on a per table basis instead of as an all or nothing approach (all tables or no tables) as is now. That it would be better to make this easier and less error prone for users instead of having them manage and configure something like regexs over Accumulo files in HDFS trash. From a durability standpoint, we allow walogs and replication to be configured on a per table basis. Trash is just another safety net that a user may wish to user. Some tables can be more important than others and allowing that decision to be made on a per table basis to include bulk input files, seems like a useful feature.
I don't see this as Accumulo managing the trash. I see this as more of giving the user explicit control over what Accumulo puts in the trash. Once it is in the trash Accumulo isn't doing anything with it.
There was a problem hiding this comment.
Having a per-table property might be useful, but the unintuitiveness of two separate properties to control three modes of operation would still be an issue that needs addressing. The per-table option wasn't what was being proposed, and it didn't occur to me. I can see the merits of it, but it's not so simple to implement, since the accumulo-gc operates asynchronously, and doesn't know the table from which a file came... it can only make educated guesses based on the file's path, but that's not reliable, since it could have been used by multiple tables, or manually placed through some import process that didn't include a table id in the path... which is only done by convention, not a guarantee.
My view that Accumulo is managing the trash is based on our long-held desire to try to decouple ourselves from HDFS, and I'm recognizing the trash as an HDFS-specific feature. The more that filesystem-related behavior is handled within the filesystem config, and Accumulo-related behavior is handled within Accumulo's config, the better, I think. While this may be less convenient initially, I think it makes greater sense as the overall system architecture grows in complexity. Replication (number of block replicas) and WALOG durability (hsync, hflush) are also filesystem-specific behaviors, but those knobs already exist. I'm more inclined to scrutinize the knobs that further tightly couple us to HDFS when they are new, so we're not digging ourselves a deeper hole that is harder to get out of (to decouple). But this is just a heuristic I apply to help me think about the consequences of a change... it's not a fixed prerequisite for me to support a change or anything like that.
I was also thinking about an "on-delete" hook or similar... but I don't think I'd be in favor of that, because I think if a user wanted that, they could just subclass an existing FileSystem and add that feature to its delete method. If every user did that, it would be a hassle... but making a new FineGrainedTrashPolicyFileSystem only needs to be done once. I'm not suggesting this is the best option... just one possible alternative solution to the property changes being proposed here.
There was a problem hiding this comment.
Oh, I like the suggestion of one setting. Something like gc_skip_trash = { ALL_FILES, BULK_FILES, NO_FILES}. Is that more inline with what you were thinking?
There was a problem hiding this comment.
Yes, that's what I was thinking when I said "third possible value" above. But, that solution is only good if that is the end of the requirements to expand this behavior and the number of configurable values stops at three. If the requirements expand to support additional use cases, then it would be better to make this behavior a pluggable policy than to enumerate a long list of possible values.
|
I created #3428 against 2.1 as a replacement for this PR. It incorporates some of the suggestions here. |
|
Closing this as superseded by #3428 . Ongoing design discussion in the code review there. |
This MR adds an option to optionally skip trash for Import files only.
On our clusters, we find that we have to skip trash due to pressure on the namenodes and the scale that we deal with.
We have seen a few instances of files going missing. We are still in the process of investigating (we thought it would be fixed by #2792 but was not.). When this happens because of our use case we can typically track down the data that contributed to Import files. However, data that has been compacted by one or more tablet servers is more difficult to tack down. By enabling this new option, we will have time to catch this issue and recover the data before the trash is emptied while we determine the cause of the files going missing.
Intention is to backport this to 1.10 once the approach is agreed upon