HDDS-3234. Fix retry interval default in Ozone client. #698

bharatviswa504 · 2020-03-19T04:48:08Z

What changes were proposed in this pull request?

change retry interval value from 1s -> 15s.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3234

How was this patch tested?

Tested with this value on a cluster, where we are doing billion object test.

bshashikant · 2020-03-19T04:55:19Z

@bharatviswa504 , i would prefer to not change it right away, because it might not hold good for all cases. Let's implement a smarter retry policy instead.

bharatviswa504 · 2020-03-19T05:02:03Z

@bshashikant I agree with you. This is not a permanent solution, till exponential back off with exception based retry policy is implemented it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.

Do you see any issues with changing to 15s?

dineshchitlangia · 2020-03-19T05:25:12Z

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.

Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

bshashikant · 2020-03-19T05:34:38Z

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.
Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

@bharatviswa504 , the default retry policy would make it sleep for 15 sec even when a request fails with NotLeader or LeaderNotReady or in general any intermittent IO Exception from network as well. What instead we can do for now is., enforce ExceptionBasedRetryPolicy for ratis in Ozone and make it 15s for ResourceUnavailable which can be changed to exponential backoff retry policy in ozone later and for other exceptions make it 3s or so. What do you think?

bharatviswa504 · 2020-03-19T06:04:21Z

@bshashikant I agree with you. This is not a permanent solution, till then it is a temporary fix. As right now default 1s, we see that the system is doing a lot of retries, and the queue limit is reaching its max size very quickly. By changing it to 15s, we have observed that the queue limit is under control and at max reached around 200.
Do you see any issues with changing to 15s?

In various other components outside of Ozone, I have seen a retry policy of 60s. Considering that, 15s is still reasonable for now.

@bharatviswa504 , the default retry policy would make it sleep for 15 sec even when a request fails with NotLeader or LeaderNotReady or in general any intermittent IO Exception from network as well. What instead we can do for now is., enforce ExceptionBasedRetryPolicy for ratis in Ozone and make it 15s for ResourceUnavailable which can be changed to exponential backoff retry policy in ozone later and for other exceptions make it 3s or so. What do you think?

Thank You @bshashikant for offline discussion, I will change this using request based retry policy combined with exception-based retry policy.

bharatviswa504 · 2020-03-19T06:04:45Z

/pending "to address comments from @bshashikant"

github-actions

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

"to address comments from @bshashikant"

arp7 · 2020-03-19T20:12:37Z

Let's get in the simple fix to increase the retry interval and keep working on the more sophisticated retry policy. The retry policy may need more extensive testing and may take more time to stabilize.

This reverts commit c64d86f.

apache#698)"" This reverts commit f5fa408.

HDDS-3234. Fix retry interval default in Ozone client.

a49848f

bharatviswa504 requested review from arp7 and bshashikant March 19, 2020 04:48

github-actions bot requested changes Mar 19, 2020

View reviewed changes

bshashikant approved these changes Mar 23, 2020

View reviewed changes

bshashikant merged commit c64d86f into apache:master Mar 23, 2020

isahkemat pushed a commit to isahkemat/hadoop-ozone that referenced this pull request Mar 29, 2020

HDDS-3234. Fix retry interval default in Ozone client. (apache#698)

9c829fb

elek added a commit that referenced this pull request Mar 30, 2020

Revert "HDDS-3234. Fix retry interval default in Ozone client. (#698)"

7e93d34

This reverts commit c64d86f.

elek added a commit that referenced this pull request Mar 30, 2020

Revert "HDDS-3234. Fix retry interval default in Ozone client. (#698)"

f5fa408

This reverts commit c64d86f.

elek added a commit to elek/ozone that referenced this pull request Mar 30, 2020

Revert "Revert "HDDS-3234. Fix retry interval default in Ozone client. (

3ca9817

apache#698)"" This reverts commit f5fa408.

elek added a commit to elek/ozone that referenced this pull request Apr 8, 2020

Revert "Revert "HDDS-3234. Fix retry interval default in Ozone client. (

0db7a91

apache#698)"" This reverts commit f5fa408.

elek mentioned this pull request Apr 8, 2020

HDDS-3234. Fix retry interval default in Ozone client #785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-3234. Fix retry interval default in Ozone client. #698

HDDS-3234. Fix retry interval default in Ozone client. #698

bharatviswa504 commented Mar 19, 2020

bshashikant commented Mar 19, 2020

bharatviswa504 commented Mar 19, 2020 •

edited

Loading

dineshchitlangia commented Mar 19, 2020 •

edited

Loading

bshashikant commented Mar 19, 2020 •

edited

Loading

bharatviswa504 commented Mar 19, 2020

bharatviswa504 commented Mar 19, 2020

github-actions bot left a comment

arp7 commented Mar 19, 2020

HDDS-3234. Fix retry interval default in Ozone client. #698

HDDS-3234. Fix retry interval default in Ozone client. #698

Conversation

bharatviswa504 commented Mar 19, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

bshashikant commented Mar 19, 2020

bharatviswa504 commented Mar 19, 2020 • edited Loading

dineshchitlangia commented Mar 19, 2020 • edited Loading

bshashikant commented Mar 19, 2020 • edited Loading

bharatviswa504 commented Mar 19, 2020

bharatviswa504 commented Mar 19, 2020

github-actions bot left a comment

Choose a reason for hiding this comment

arp7 commented Mar 19, 2020

bharatviswa504 commented Mar 19, 2020 •

edited

Loading

dineshchitlangia commented Mar 19, 2020 •

edited

Loading

bshashikant commented Mar 19, 2020 •

edited

Loading