-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-22627 Port HBASE-22617 (Recovered WAL directories not getting cleaned up) to branch-1 #339
Conversation
9ed92f3
to
ced4487
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
ae1d198
to
fc7dc76
Compare
Updated patch fixes javadoc and checkstyle warn from precommit, also a compile problem not caught by my IDE (unclean workspace I guess) |
I missed the original patch, but let me review this here (and try to remember why this was done originally). |
@z-york This fixes an issue where a path component to the wal dir is missing after HBASE-20734, so we get files under /<root>/<namespace> instead of /<root>/data/<namespace> with the default config, and the erroneous location is not expected by the hfile archiver or table delete procedure, so they are never cleaned up. We had a production incident where we hit the HDFS directory 1 million entry limit and couldn't online regions. |
@@ -640,19 +640,26 @@ Path commitDaughterRegion(final HRegionInfo regionInfo) | |||
/** | |||
* Create the region splits directory. | |||
*/ | |||
void createSplitsDir() throws IOException { | |||
void createSplitsDir(HRegionInfo daughterA, HRegionInfo daughterB) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me how this is related to the rest of the changes. Could you provide some insight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 included this in the branch-2/master patch, including it here in the backport too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this fixed an issue with a hbck test so precommit for the branch-2 version could be green. The daughter directories would not always be created. While technically an unrelated change we allowed it there and should allow it here. Or do an addendum to break it out into a separate commit (here and in branch-2 and up).
// Delete the directory on wal filesystem | ||
FileSystem walFs = mfs.getWALFileSystem(); | ||
Path tableWALDir = FSUtils.getWALTableDir(env.getMasterConfiguration(), tableName); | ||
if (walFs.exists(tableWALDir) && !walFs.delete(tableWALDir, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this have any unintended side-effects(rollbacks?)? Will this eventually be cleaned up by the cleaners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no rollback provisions for an IO exception late in the delete procedure, that is true. It's specific to branch-1 because branch-2 and up uses the procedure state machine to manage this better. Perhaps we could look at this in a follow up issue.
I added this at the very tail of the procedure so if there is an IO exception here at least it did not interfere with the deletion of the table directories. However if there is an IO exception deleting the table directories, we don't reach this code. We have this concern with the master procedures in branch-1 in a number of places.
💔 -1 overall
This message was automatically generated. |
+1 from me. It looks like the test failure was due to timeout. Can you run it locally to verify before committing? |
💔 -1 overall
This message was automatically generated. |
fc7dc76
to
cd946e8
Compare
TestReplicationDroppedTables timeout could be related, but it is a known flake and doesn't seem healthy in latest branch-1 when I run it locally. I think it should be removed. I filed HBASE-22629 for followup. Pushed an update to fix the reported checkstyle and javadoc issues. |
cd946e8
to
0c297b2
Compare
…leaned up) to branch-1 HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang)
Latest patch drops the parts of the branch-2+ patch which automatically clean up the WAL dirs. To be conservative I worry about the potential impact to replication. We will still delete the wrong region dir upon open after replay. Only recovered edits files should have been written there. |
💔 -1 overall
This message was automatically generated. |
TestVisibilityLabelsWithACL failure is unrelated. |
Thanks @z-york , merging now |
💔 -1 overall
This message was automatically generated. |
It should be safe as you said, only recovered edits will be rewritten there, and replication will not read recovered edits. |
…leaned up) to branch-1 (#339) HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang) Signed-off-by: Zach York <zyork@apache.org>
…leaned up) to branch-1 (#339) HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang) Signed-off-by: Zach York <zyork@apache.org>
HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang)