Improve fabric_util get_db timeout logic #3734

nickva · 2021-09-08T22:54:54Z

Previously, users with low {Q, N} dbs often got the "No DB shards could be opened." error when the cluster is overloaded. The hard-coded 100 msec timeout was too low to open the few available shards and the whole request would crash with a 500 error.

Attempt to calculate an optimal timeout value based on the number of shards and the max fabric request timeout limit.

The sequence of doubling (by default) timeouts forms a geometric progression. Use the well known closed form formula for the sum [0], and the maximum request timeout, to calculate the initial timeout. The test case illustrates a few examples with some default Q and N values.

Because we don't want the timeout value to be too low, since it takes time to open shards, and we don't want to quickly cycle through a few initial shards and discard the results, the minimum initial timeout is clipped to the
previously hard-coded 100 msec timeout. Unlike previously however, this minimum value can now also be configured.

[0] https://en.wikipedia.org/wiki/Geometric_series

Fixes: #3733

rnewson

the basic idea of allowing the hardcoded 100 initial timeout to change is definitely useful but should be presented in isolation. the other changes can then be debated in other PR's.

src/fabric/src/fabric_util.erl

Previously, users with low {Q, N} dbs often got the `"No DB shards could be opened."` error when the cluster is overloaded. The hard-coded 100 msec timeout was too low to open the few available shards and the whole request would crash with a 500 error. Attempt to calculate an optimal timeout value based on the number of shards and the max fabric request timeout limit. The sequence of doubling (by default) timeouts forms a geometric progression. Use the well known closed form formula for the sum [0], and the maximum request timeout, to calculate the initial timeout. The test case illustrates a few examples with some default Q and N values. Because we don't want the timeout value to be too low, since it takes time to open shards, and we don't want to quickly cycle through a few initial shards and discard the results, the minimum inital timeout is clipped to the previously hard-coded 100 msec timeout. Unlike previously however, this minimum value can now also be configured. [0] https://en.wikipedia.org/wiki/Geometric_series Fixes: #3733

rnewson requested changes Sep 8, 2021

View reviewed changes

src/fabric/src/fabric_util.erl Outdated Show resolved Hide resolved

src/fabric/src/fabric_util.erl Show resolved Hide resolved

nickva force-pushed the improve-get-db-timeouts branch from 45443c0 to 577c604 Compare September 8, 2021 23:28

nickva force-pushed the improve-get-db-timeouts branch from 577c604 to 5b7f4a7 Compare September 9, 2021 03:43

rnewson approved these changes Sep 9, 2021

View reviewed changes

nickva merged commit 4ea9f1e into 3.x Sep 9, 2021

nickva deleted the improve-get-db-timeouts branch September 9, 2021 14:44

nickva mentioned this pull request Nov 4, 2021

Make open_shard timeout value configurable #1964

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fabric_util get_db timeout logic #3734

Improve fabric_util get_db timeout logic #3734

nickva commented Sep 8, 2021 •

edited

rnewson left a comment

Improve fabric_util get_db timeout logic #3734

Improve fabric_util get_db timeout logic #3734

Conversation

nickva commented Sep 8, 2021 • edited

rnewson left a comment

Choose a reason for hiding this comment

nickva commented Sep 8, 2021 •

edited