Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fabric_util get_db timeout logic #3734

Merged
merged 1 commit into from Sep 9, 2021
Merged

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Sep 8, 2021

Previously, users with low {Q, N} dbs often got the "No DB shards could be opened." error when the cluster is overloaded. The hard-coded 100 msec timeout was too low to open the few available shards and the whole request would crash with a 500 error.

Attempt to calculate an optimal timeout value based on the number of shards and the max fabric request timeout limit.

The sequence of doubling (by default) timeouts forms a geometric progression. Use the well known closed form formula for the sum [0], and the maximum request timeout, to calculate the initial timeout. The test case illustrates a few examples with some default Q and N values.

Because we don't want the timeout value to be too low, since it takes time to open shards, and we don't want to quickly cycle through a few initial shards and discard the results, the minimum initial timeout is clipped to the
previously hard-coded 100 msec timeout. Unlike previously however, this minimum value can now also be configured.

[0] https://en.wikipedia.org/wiki/Geometric_series

Fixes: #3733

Copy link
Member

@rnewson rnewson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the basic idea of allowing the hardcoded 100 initial timeout to change is definitely useful but should be presented in isolation. the other changes can then be debated in other PR's.

src/fabric/src/fabric_util.erl Outdated Show resolved Hide resolved
src/fabric/src/fabric_util.erl Show resolved Hide resolved
Previously, users with low {Q, N} dbs often got the `"No DB shards could be
opened."` error when the cluster is overloaded. The hard-coded 100 msec timeout
was too low to open the few available shards and the whole request would crash
with a 500 error.

Attempt to calculate an optimal timeout value based on the number of shards and
the max fabric request timeout limit.

The sequence of doubling (by default) timeouts forms a geometric progression.
Use the well known closed form formula for the sum [0], and the maximum request
timeout, to calculate the initial timeout. The test case illustrates a few
examples with some default Q and N values.

Because we don't want the timeout value to be too low, since it takes time to
open shards, and we don't want to quickly cycle through a few initial shards
and discard the results, the minimum inital timeout is clipped to the
previously hard-coded 100 msec timeout. Unlike previously however, this minimum
value can now also be configured.

[0] https://en.wikipedia.org/wiki/Geometric_series

Fixes: #3733
@nickva nickva merged commit 4ea9f1e into 3.x Sep 9, 2021
@nickva nickva deleted the improve-get-db-timeouts branch September 9, 2021 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants