-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3234. Fix retry interval default in Ozone client. #698
Conversation
@bharatviswa504 , i would prefer to not change it right away, because it might not hold good for all cases. Let's implement a smarter retry policy instead. |
@bshashikant I agree with you. This is not a permanent solution, till exponential back off with exception based retry policy is implemented it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200. Do you see any issues with changing to 15s? |
In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now. |
@bharatviswa504 , the default retry policy would make it sleep for 15 sec even when a request fails with NotLeader or LeaderNotReady or in general any intermittent IO Exception from network as well. What instead we can do for now is., enforce ExceptionBasedRetryPolicy for ratis in Ozone and make it 15s for ResourceUnavailable which can be changed to exponential backoff retry policy in ozone later and for other exceptions make it 3s or so. What do you think? |
Thank You @bshashikant for offline discussion, I will change this using request based retry policy combined with exception-based retry policy. |
/pending "to address comments from @bshashikant" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this issue as un-mergeable as requested.
Please use /ready
comment when it's resolved.
"to address comments from @bshashikant"
Let's get in the simple fix to increase the retry interval and keep working on the more sophisticated retry policy. The retry policy may need more extensive testing and may take more time to stabilize. |
apache#698)"" This reverts commit f5fa408.
apache#698)"" This reverts commit f5fa408.
What changes were proposed in this pull request?
change retry interval value from 1s -> 15s.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3234
How was this patch tested?
Tested with this value on a cluster, where we are doing billion object test.