Conversation
This commit disables chop compactions on merge and modifies the tablet metadata update in TabletGroupWatcher to fence of each tablet file to its tablet range as its copied to the highest tablet on merge. This prevents data from coming back on merge as scans will be fenced to the provided range on the metadata updates.
| // TODO If there is an existing range we likely need to clip the range with something | ||
| // like the following which is commented out for now. We also may need to handle | ||
| // disjoint ranges | ||
| // fenced = existing.hasRange() ? fenced.clip(existing.getRange()) : fenced; |
There was a problem hiding this comment.
Wondering if the following would work here. If the existing range is infinite, then fenced should pass through. Depending on how the split code works, its possible a file could have a range that extends outside of a tablet, I think the following code handles that case along with the case where the range was completely inside the tablet.
| // TODO If there is an existing range we likely need to clip the range with something | |
| // like the following which is commented out for now. We also may need to handle | |
| // disjoint ranges | |
| // fenced = existing.hasRange() ? fenced.clip(existing.getRange()) : fenced; | |
| // If the ranges are disjoint this will throw an exception. It's not expected that a | |
| // tablet would contain a file with a range that was completely outside the tablet. | |
| // Neither split nor merge should create this situation. Split should only assign | |
| // files to child tablets that overlap the files range. | |
| fenced = existing.getRange().clip(fenced); |
There was a problem hiding this comment.
I added in the range clipping in the latest update but it also handles the disjoint case too
| && key.getColumnFamily().equals(DataFileColumnFamily.NAME)) { | ||
|
|
||
| // Delete exiting metadata as we are now adding a range | ||
| m.putDelete(key.getColumnFamily(), key.getColumnQualifier()); |
There was a problem hiding this comment.
May not want to always do this delete. If the existing files range falls completely within the old tablets range then its range will not change. If its range does not change then will its new and old metadata entries be identical? If so maybe we only bother with the delete and add if the computed fenced range is not equal to the files existing range.
There was a problem hiding this comment.
I will check on that, but yeah in theory we don't need to re-write or change things if the range isn't different or is still valid. I made a comment about optimizations and I figured there would certainly be things to optimize or improve and this might be one of them.
There was a problem hiding this comment.
If a mutation has an add and delete for the same key, it may ultimately be deleted. So it could be more than an optimization, may be needed for correctness but not completely sure. It depends on what timestamp each entry from the mutation gets on the server side. When two keys have the same timestamp and one is a delete, then the delete will sort first hiding the non delete.
There was a problem hiding this comment.
I changed this so it will only delete if the new metadata is different than before
| // } else { | ||
| // return this.extent.overlaps(otherExtent); | ||
| // } | ||
| return false; |
test/src/main/java/org/apache/accumulo/test/functional/MergeIT.java
Outdated
Show resolved
Hide resolved
|
As notated here, something else I need to add to this is handling range deletion. |
|
@keith-turner - I spent some time today just looking at how to update metadata to avoid chop compactions for deletions of ranges in tables (I'll look at getting rid of splits after that). I have some stuff locally where i was just doing a lot of testing to understand how things work. I ran into one thing that I got stuck on and haven't had time to figure out what's going on so was wondering if you might have some insight to save me some time. There is an IT for row deletion already called DeleteRowsIT and there is a test called testManyRows and it hangs when chop compactions are disabled. If you run that test against this branch it hangs when scanning inside the testSplit() method when it its verifying the deleted rows and gets to the metadata. Here is the thread dump from the client where it's stuck: What's interesting is there are no errors that I can see anywhere. I checked all the logs in the mini accumulo directory and there's no exceptions or anything in the Tablet server, Manager, etc logs so I'm not sure what's up. If you re-enable chop compactions by uncommenting out the code inside of |
|
Also, after looking at things in detail it seems like the changes for avoid chop compactions for deletions are not trivial (delete doesn't seem to update metadata already like merges do) so code has to be written to add ranges for RFiles on delete and also we need to update ranges to avoid splits. So instead of trying to do everything at once I am thinking maybe it's best to keep the scope here to just changing merges to avoid chop compactions and leave deletions alone (would still do chops and splits). Then we can have a follow on issue to tackle the deletion case next so that we can hopefully get rid of chop compaction code entirely and also remove the need to do splits on deletion for elasticity. |
|
Pushing the delete range work off to its own PR seems like a good plan to me. For something like this I usually open a follow on issue before closing the PR so that I don't forget about it. |
Created: #3679 |
f6a1ec2 to
204010e
Compare
42b54dd to
b8ac631
Compare
|
Latest commit adds some more tests to demonstrate split, delete, merge and then split and merge a second time to verify that files that have ranges already can be merged. I had to make some tweaks to make sure the proper ranges were carried forward if existing. Also I had to force push as I had to fix things after merging in the wrong branch but should be good now. |
|
@keith-turner - This is probably ready for another review. It still needs more tests and probably some stuff needs to be cleaned up but I am curious what you think after the latest tweaks I made on Friday. I have been testing things against this PR today and on Friday and so far things are looking pretty good. I ran all the ITs that do merges and everything passes. I also modified ScanConsistencyIT (which does random flushes, compactions, merges, etc) to run for 15 minutes and there were no errors. I've also been running some tests using accumulo-testing and Uno. I just ran a test with continuous ingest where I created a table with 20 tablets and ingested about 50 million entries with no deletes (about 1.8 gigs of data). I then created 200 splits and I ran a merge on the table. Using main (so chop compactions were needed) it took ~50 seconds to complete the merge. Running the same test with my no-chop merge prototype and it took ~3 seconds. After I ran verify which completely successfully. So pretty cool to see the time drop so drastically for a merge and that difference will be even greater with more servers and data. Merge should also be even faster with Elasticity and offline tablets when not having to unload things. I just started a full IT to see what the results look like and will report back when it is finished. |
|
Full IT build ran and passed. |
server/base/src/test/java/org/apache/accumulo/server/manager/state/MergeInfoTest.java
Outdated
Show resolved
Hide resolved
server/base/src/main/java/org/apache/accumulo/server/manager/state/MergeInfo.java
Outdated
Show resolved
Hide resolved
server/base/src/test/java/org/apache/accumulo/server/manager/state/MergeInfoTest.java
Outdated
Show resolved
Hide resolved
|
@keith-turner - I had 2 questions about this that I figured you might be able to help answer when reviewing since the TabletGroupWatcher class and Manager code in general is pretty new to me still.
|
That probably would be cleaner. That is what happens when using Ample it buffers the entire row and uses that column to constuct the extent for the tablet. Ample can also optionally check the linking of tablets when reading them.
There should not be anything else writing to the tablets metadata because of the following combination of factors.
These are some basics of Accumulo's current concurrency model. Its very different in the elasticity branch. |
keith-turner
left a comment
There was a problem hiding this comment.
Did not get a chance to look at the test changes, I still want to circle back to that.
| // merged to so we need to handle the deletes on update here as it won't be handled later | ||
| // TODO: Can this be re-written to not have a special edge case and make it simpler? | ||
| if (keyExtent.equals(stopExtent)) { | ||
| if (previousKeyExtent != null |
There was a problem hiding this comment.
It seems like previousKeyExtent being null would only happen if merging a single tablet? Does the merge code not have special handling for that case an exit much earlier?
There was a problem hiding this comment.
Good question, I just put the null check in there but you are right it would only happen in the single tablet case. You are right that i think the merge should just do nothing if it detects it's one tablet
| fenced = existing.hasRange() ? existing.getRange().clip(fenced, true) : fenced; | ||
|
|
||
| // If null the range is disjoint with the new tablet range so need to handle | ||
| if (fenced == null) { |
There was a problem hiding this comment.
For this case to happen, would a tablet have to contain a file with a disjoint range? Thinking Accumulo should never create this situation in the first place. It feels like this case could hide bugs, wondering if it should throw an exception or log something instead.
There was a problem hiding this comment.
The root cause of this is the fact that the second split I'm doing in my test causes a disjoint file in the Tablet, so based on your comment I think you are right that we should probably try and just prevent this. The following scenario shows it happening:
This case happens when you do multiple merges/splits as shown in the test I have called noChopMergeTestMultipleMerges() in MergeIt which demonstrates the issue.
In the test, the initial split/merge is done. After the first merge the following is in metadata with one tablet and ranged files:
Extent: 1<<; File Name: F0000001.rf; Range: [row_0000000500%00; : [] 9223372036854775807 false,row_0000000750%00; : [] 9223372036854775807 false); Entries: 222, Size: 114
Extent: 1<<; File Name: F0000001.rf; Range: [row_0000000750%00; : [] 9223372036854775807 false,+inf); Entries: 445, Size: 229
Extent: 1<<; File Name: A0000007.rf; Range: [row_0000000250%00; : [] 9223372036854775807 false,row_0000000500%00; : [] 9223372036854775807 false); Entries: 150, Size: 354
Extent: 1<<; File Name: A0000008.rf; Range: (-inf,row_0000000250%00; : [] 9223372036854775807 false); Entries: 150, Size: 343
I then add 3 splits at 250, 500, 750 and we end up with the following fenced files and their tablets:
Extent: 1;row_0000000250<; File Name: A0000008.rf; Range: (-inf,row_0000000250%00; : [] 9223372036854775807 false); Entries: 150, Size: 343
Extent: 1;row_0000000500;row_0000000250; File Name: A0000007.rf; Range: [row_0000000250%00; : [] 9223372036854775807 false,row_0000000500%00; : [] 9223372036854775807 false); Entries: 150, Size: 354
Extent: 1;row_0000000750;row_0000000500; File Name: F0000001.rf; Range: [row_0000000500%00; : [] 9223372036854775807 false,row_0000000750%00; : [] 9223372036854775807 false); Entries: 74, Size: 38
Extent: 1<;row_0000000750; File Name: F0000001.rf; Range: [row_0000000500%00; : [] 9223372036854775807 false,row_0000000750%00; : [] 9223372036854775807 false); Entries: 149, Size: 77
Extent: 1<;row_0000000750; File Name: F0000001.rf; Range: [row_0000000750%00; : [] 9223372036854775807 false,+inf); Entries: 445, Size: 229
And here lies the problem, when the split ran it kept an extra file around that is disjoint so that when we then perform a merge again and we end up with the disjoint issue.
The following file is disjoint:
Extent: 1<;row_0000000750; File Name: F0000001.rf; Range: [row_0000000500%00; : [] 9223372036854775807 false,row_0000000750%00; : [] 9223372036854775807 false); Entries: 149, Size: 77
As you can see with that Tablet, the range of the tablet is 750 (exclusive) to infinite. But the file that was added during the split's range is 500 (exclusive) to 750 (inclusive), so we either need to remove it during the split or handle it later during the merge.
There was a problem hiding this comment.
Ok. Ideally the split code would only move files to a new child tablet if the file had data in the range. The split code currently looks at each files first and last key to make this determination, so it seems like it should work. Need to figure out why its not.
| StoredTabletFile newFile = StoredTabletFile.of(existing.getPath(), fenced); | ||
| m.put(key.getColumnFamily(), newFile.getMetadataText(), value); | ||
| movedMetadataEntries.add(newFile.getMetadata()); | ||
| } |
There was a problem hiding this comment.
Maybe there should be an else that logs something or throws an exception.
There was a problem hiding this comment.
This bug with the split is due to getLastKey() returning a key outside the range (the key after since the last key is exlusive) so the split incorrectly includes the file in the wrong tablet. The simplest thing was to just require the end key of the range contain the 0x00 byte (which is generated automatically in some cases) so we can figure out the last real key in the range. This fix is part of #3710
|
|
||
| // Process deletes for the highest Tablet for any files that are not part of | ||
| // movedMetadataEntries | ||
| Sets.difference(possibleMetadataDeletes, movedMetadataEntries).forEach(delete -> { |
There was a problem hiding this comment.
What input data would cause possibleMetadataDeletes and movedMetadataEntries to not be disjoint sets?
There was a problem hiding this comment.
This is another side effect from the disjoint issue as explained in #3640 (comment)
Thanks for explaining that concurrent model better, I assumed it was safe (otherwise merge wouldn't work at all) but was trying to figure out why because I kept thinking "what happens if a Tablet server mutates the metadata for this tablet while the Manager is running merge" and obviously the fact that the tablets are unassigned is the key part. I knew they were unassigned (I see it happening in the code and saw it when testing) but for some reason my brain wasn't connecting that part when thinking about race conditions and concurrency issues. Obviously if the tablet is not hosted and if only the Manager is doing things it is safe since there's a write lock and it's the only process doing things. For elasticity that will be interesting since we should be able to merge offline tablets that are on demand (currently you can't merge offline tables). And there will be multiple managers, etc so we will have to see how it works out I guess. |
That makes sense, maybe it would be better to re-write the mergeMetadataRecords() method to use Ample to do the updates? I wonder if that should be done first in another PR into main. There's other places in TabletGroupWatcher like deleteTables() that could also use Ample I guess but not sure how big of a scope to make the change. |
In the elasticity branch I would like to eventually move all this merge code from TGW into fate steps. Also will need to make the code use conditional mutations. Would probably use Ample for these changes. I am avoiding doing those changes until this work is done. |
I think we can avoid using Ample for now as I can get the last end row by iterating like I'm doing now so we can target the change in Elasticity since it's much more work. |
|
@keith-turner - You can ignore that original comment about the DeleteRowsIT hanging. It turns out the client was getting exceptions with trying to load Tablets that were offline. Things got screwed up as metadata was being updated for online tablets but that has been fixed. I have deletes (seemingly) working without chop compactions in a branch that is based off of this PR: It loads the tablets that overlap the start/end of the delete range and then if the tablets share a boundary with the delete range (end row matches prevs end row) then the files are fenced. This is the same condition used to detect when to run a chop compaction. This isn't final because the condition on when metadata is updated is likely going to change when I look at getting rid of splits which I'll do next. Currently this works because splits run first and create the new tablets on the deletion boundaries. When splits go away in favor of fencing the files, i'll need to update the condition on when fencing happens but i at least wanted to get this working for now. At some point (when I work on splits) I guess I'll want to create a PR for the combination of no chop deletes and fencing files instead of splits but that runs into the chained PR issue so we will see how that goes, but until then I'll keep it just in a branch. |
|
I am going to go ahead and merge this into the no-chop-merge branch as I am going to create another PR for removing chops and splits for deletion that is based on this. The entire no-chop-merge branch will go through another final review before going back into main so it will be reviewed again. |
This PR is to switch merge and delete to no-chop merge and no-chop and no-split deletions. There are some follow on tasks to do still but the main functionality is complete. The goal here is to prevent having to run chop compactions when merging (or deleting) tablets which can be quite slow. The other goal is to get rid of having to split tablets during deletions to simplify the manager code. Instead, the changes here allow RFiles to be fenced off by by valid ranges so that we don't need to compact/split anymore and when scanning only the valid range is visible. Files can then be cleaned up later with the normal compaction process. This commit contains changes from the following PRs: apache#3418 apache#3480 apache#3614 apache#3621 apache#3640 apache#3710 apache#3728
This commit implements no-chop merge and no-chop and no-split deletions. There are some follow on tasks to do still but the main functionality is complete. This change updates merge so chop compactions are no longer used when when merging (or deleting) tablets, as that can be quite slow. The other change is that splitting tablets during deletions is no longer necessary or used to simplify the manager code. Instead, the changes here allow RFiles to be fenced off by by valid ranges so that we don't need to compact/split anymore and when scanning only the valid range is visible. Files can then be cleaned up later with the normal compaction process. This commit contains changes from the following PRs: #3418 #3480 #3614 #3621 #3640 #3710 #3728 This closes #1327 --------- Co-authored-by: Keith Turner <kturner@apache.org> Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
This commit disables chop compactions on merge and modifies the tablet metadata update in TabletGroupWatcher to fence of each tablet file to its tablet range as its copied to the highest tablet on merge. This prevents data from coming back on merge as scans will be fenced to the provided range on the metadata updates.
There's some updated tests that I added to show that data doesn't come back after merge. It's a very simple test for now but it at least shows the fencing working after updating metadata. Running the MergeIT will print out metadata for the new tests and also for the existing mergeTest() to help show the changes that are being done with the ranges to the metadata table.
This is part of #1327