New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-24420 Avoid Meaningless Retry Attempts in Unrecoverable Failure #1764
base: master
Are you sure you want to change the base?
Conversation
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
* But if we find that the region is not changed, then the maxRetries should be still | ||
* be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts | ||
*/ | ||
if(count != 0 && previousRegionNum != startEndKeys.size() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice find.
Well this will help your particular case. But don't think this is a generic solution.
What if a cluster having an issue like your's (config issue) and during this configured retry times there was a split also! It will change the max retries to be so large (if so many regions like ur case).
What if we just reset the 'count' only within the while loop if we notice a split happened in between the current run and previous? If there are splits happening at regular interval this will cause a never ending loop and would still need an upper bound. But blindly making the retry count to be #regions+1 just after seeing a split is something concerning to me.
cc @saintstack - Seems you reviewed the original jira so might be remembering that context. Any pointers sir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anoopsjohn
cc @saintstack
Yes, i agree with you , my above solution is not generic;
After totally review the whole calling path from 1) my business code , to 2) the HBase client retry logic and then finally to 3)the server processing logic, I think things are clear now and the fix becomes more easier;
Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
SecureBulkLoadManager.java#297, it eat all exceptions thrown from region.bulkLoadHFiles
, this is obviously incorrect. I checked the commit history of this catch
clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer BulkLoadHFilesTool.java#396
) and set up the retry policy;
Secondly, the exception in my incident is not processed correctly: SecureBulkLoadManager.java#L395
The exception information is
java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c
So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
After above 2 issues are processed correctly, then client side will get the error information correctly at BulkLoadHFilesTool.java#396 and then from the code, we could find that user could control the retry logic by configuration hbase.client.retries.number
and hbase.bulkload.retries.retryOnIOException
, instead of falling into retry disaster;
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java
Outdated
Show resolved
Hide resolved
clusterIds, request.getReplicate()); | ||
} catch (Exception e) { | ||
LOG.error("Failed to complete bulk load", e); | ||
throw new IOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception cannot be simply swallowed, this will make client fall into possible infinite retry;
305ae7b
to
bcb9620
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
83491d5
to
6517449
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@anoopsjohn Would you please have a further check for this fix? |
I would say you include the initial change also. That was a nice fix which increasing the retry count exactly when needed. |
🎊 +1 overall
This message was automatically generated. |
I didn't see what the original change was that @anoopsjohn refers to. This PR looks good to me though. We good to commit @anoopsjohn ? |
Yes seems like the current PR code is totally different than what we had initially. That was dealing with retry counter reset and all. So here we are changing the server end to create an IOE and throwing back to client layer when the region#bulkload lands in any Exception. And client layer already doing things right deciding whether and how many times to retry. Correct @VicoWu ? In that case +1 |
In https://issues.apache.org/jira/browse/HBASE-14541, the retry logic changed from a configurable retry times(by configuration item hbase.bulkload.retries.number) to below retry logic to process the issue that the RegionSplit happened during bulk load:
This issue caused another issue in our cluster, that is:
Our table has 2000 regions and our bulk load failed for an configuration issue:
This failure is caused by a client mis-configuration and cause the RegionServer fail to load the HFile, but the response is not thrown as exception to client, but marked as a variable
loaded
in theBulkLoadHFileResponse
After this failure, the bulk load fall into a retry disaster and after the retry reached about 200, our HDFS crashed for OOM.
But in fact during the retry, the HBase table splits never happened;
I think the patch in HBASE-14541 didn't handle the unrecoverable retry case and in this case(I think many problem which is not related with region split may incur this unrecoverable retry disaster) the meaningless retry attempts becomes disaster and is un-configurable because we cannot change the Region number of our table;