Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3234. Fix retry interval default in Ozone client. #698

Merged
merged 1 commit into from
Mar 23, 2020

Conversation

bharatviswa504
Copy link
Contributor

What changes were proposed in this pull request?

change retry interval value from 1s -> 15s.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3234

How was this patch tested?

Tested with this value on a cluster, where we are doing billion object test.

@bshashikant
Copy link
Contributor

@bharatviswa504 , i would prefer to not change it right away, because it might not hold good for all cases. Let's implement a smarter retry policy instead.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Mar 19, 2020

@bshashikant I agree with you. This is not a permanent solution, till exponential back off with exception based retry policy is implemented it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.

Do you see any issues with changing to 15s?

@dineshchitlangia
Copy link
Contributor

dineshchitlangia commented Mar 19, 2020

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.

Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

@bshashikant
Copy link
Contributor

bshashikant commented Mar 19, 2020

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.
Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

@bharatviswa504 , the default retry policy would make it sleep for 15 sec even when a request fails with NotLeader or LeaderNotReady or in general any intermittent IO Exception from network as well. What instead we can do for now is., enforce ExceptionBasedRetryPolicy for ratis in Ozone and make it 15s for ResourceUnavailable which can be changed to exponential backoff retry policy in ozone later and for other exceptions make it 3s or so. What do you think?

@bharatviswa504
Copy link
Contributor Author

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.
Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

@bharatviswa504 , the default retry policy would make it sleep for 15 sec even when a request fails with NotLeader or LeaderNotReady or in general any intermittent IO Exception from network as well. What instead we can do for now is., enforce ExceptionBasedRetryPolicy for ratis in Ozone and make it 15s for ResourceUnavailable which can be changed to exponential backoff retry policy in ozone later and for other exceptions make it 3s or so. What do you think?

Thank You @bshashikant for offline discussion, I will change this using request based retry policy combined with exception-based retry policy.

@bharatviswa504
Copy link
Contributor Author

/pending "to address comments from @bshashikant"

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

"to address comments from @bshashikant"

@arp7
Copy link
Contributor

arp7 commented Mar 19, 2020

Let's get in the simple fix to increase the retry interval and keep working on the more sophisticated retry policy. The retry policy may need more extensive testing and may take more time to stabilize.

@bshashikant bshashikant merged commit c64d86f into apache:master Mar 23, 2020
isahkemat pushed a commit to isahkemat/hadoop-ozone that referenced this pull request Mar 29, 2020
elek added a commit that referenced this pull request Mar 30, 2020
elek added a commit that referenced this pull request Mar 30, 2020
elek added a commit to elek/ozone that referenced this pull request Mar 30, 2020
elek added a commit to elek/ozone that referenced this pull request Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants