-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25297 [HBCK2] Regenerate missing table descriptors by hbck2 #79
Conversation
We just have been in a situation, when we saw many ServerCrashProcedure to be blocked, because of the AssignProcedure is blocked because of the missing .tabledesc files. java.io.IOException: Missing table descriptor for dd0676f57bdbff5d04ab735b7daf5e9b In our case it was OK for us to get rid of these tables and we were able to use setRegionState to move all these regions to FAILED_OPEN state, then disable and drop the tables. But this took a lot of time and we might not always have the option to drop these tables. HBCK 1 had a functionality (fixTableOrphans) to regenerate the table descriptors according to the memory cache or hdfs table directory structure. In this patch I implemented the same logic for HBCK 2. I created tests for the new feature and also tested it with a real HBase 2.6.1 cluster.
💔 -1 overall
This message was automatically generated. |
hbase-hbck2/README.md
Outdated
@@ -200,6 +200,26 @@ Command: | |||
for how to generate new report. | |||
SEE ALSO: reportMissingRegionsInMeta | |||
|
|||
generateMissingTableDescriptorFile <TABLENAME> | |||
Trying to fix an orphan table by generating a missing table descriptor | |||
file. This command will have no affect if the table folder is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "effect", not "affect"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
import org.apache.hadoop.hbase.client.TableDescriptor; | ||
import org.apache.hadoop.hbase.client.TableDescriptorBuilder; | ||
import org.apache.hadoop.hbase.util.FSTableDescriptors; | ||
import org.apache.hadoop.hbase.util.FSUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FSTableDescriptors and FSUtils are IA Private. This has caused frequently problems to maintain operators tools compiling, or even compatible at runtime. To solve that, we have been duplicating these utility classes in operator-tools project. See HBCKFsUtils and HBCKMetaTableAccessor for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'll duplicate these as well
try { | ||
if (!fs.exists(tableDir)) { | ||
throw new IllegalStateException("Exiting without changing anything. " + | ||
"Table folder not exists: " + tableDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Table folder does not exist"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
fstd.createTableDescriptor(tableDescriptorFromMaster.get(), false); | ||
LOG.info("Table descriptor written successfully. Orphan table {} fixed.", tableName); | ||
} else { | ||
generateDefaultTableInfo(fstd, tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to refresh master's cache with the table descriptor? Had quick checked master rpc interface, didn't find any available method, maybe something we could add on a next jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea, I'll create a follow-up Jira. Also I'll mention in the usage that currently a master restart might be required to force the cache to reinitialize.
This definitely can be improved further, but it might need some more investigation. During my manual tests I saw the table to reappear (shown by the list
command) quickly after the missing tableinfo file got generated. So something must have been trying to open the table periodically. However, the scan operations failed on the table until I did a rolling restart. (I haven't checked the procedures before restarting the cluster, I guess something got stucked in the Region Server still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my manual tests I saw the table to reappear (shown by the list command) quickly after the missing tableinfo file got generated.
Interesting. Have you tried disable/enable the table after re-creating the table info?
@wchevreuil thanks for the review and for the comments! I tried to address all the issues. I had to copy quite a few InterfaceAudit.Private functions / classes in the end. (I wonder if we should move these to a different package...) I also fixed all existing checkstyle issues in hbase-hbck2, so |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
We have been in a situation, when we saw many ServerCrashProcedure to be blocked, because of the AssignProcedure is blocked because of the missing .tabledesc files.
In our case it was OK for us to get rid of these tables and we were able to use setRegionState to move all these regions to FAILED_OPEN state, then disable and drop the tables. But this took a lot of time and we might not always have the option to drop these tables.
HBCK 1 had a functionality (fixTableOrphans) to regenerate the table descriptors according to the memory cache or hdfs table directory structure. In this patch I implemented the same logic for HBCK 2. I created tests for the new feature and also tested it with a real HBase 2.6.1 cluster.