Make open_shard timeout value configurable #1964

AlexanderKaraberov · 2019-03-04T17:07:48Z

Summary

In those cases when a CouchDB node is overloaded due to a high number of concurrent write and read requests (numbers are not fixed here) debug logs are spawned with Failed to open shard after 100 messages, and immediately after those No DB shards could be opened errors are thrown. This causes write requests to fail as a consequence. These tests were conducted with no cluster mode (1 replica set) and with a default 8 shards (n=1, q=8). Test machine had 8 cores and 16 GB of RAM.

After I had checked fabric_rpc:open_shard I noticed that all its traffic goes to couch_server:open which seems to cache references to open database handles in the ETS with read_concurrency hence this part seems to be fine. The only more or less heavy code there is update_lru but we have update_lru_on_read option set to false (this is a default config anyway) hence all is fine here as well. I haven't found any stats report for couch_dbs cache misses therefore I can't claim with 100% confidence that there are no cache misses in this case. Summarising everything I've just said I can only assume that general overload of the couch_server process on a single node caused longer response times from fabric_rpc:open -> couch_db:open on particular shards. As we can see here an expected timeout for open shard response is hardcoded to 100 ms which makes it impossible to increase this value.

UPD Load tests demonstrated that indeed providing a bigger timeout value (200-300 ms) helps avoid No DB shards could be opened errors under heavy load volumes. Of course I observed bigger latency when CouchDB processes read/write operations under heavy constant load of heterogeneous nature (bursty concurrent writes combined with constant 200/sec writes + concurrent read all docs) Nevertheless this is much better than just failed requests.

Desired Behaviour

There is a shard_timeout_factor value to configure open shard backoff behaviour but from what I saw increasing this value didn't help as I saw only Failed to open shard after 100 messages (instead of an expected exponential increase). Contrariwise, it would be more flexible to also configure a Timeout value which is passed to get_shard.

Possible Solution

Add open_shard_timeout fabric config to make possible changing of this timeout value.

cc @janl @rnewson If this sounds reasonable I can submit a PR. Also perhaps you could suggest some other improvements based on my description of the problem? Number of shards is not really impacting an upper limit of writes/reads, and as I've already mentioned above, bottleneck seems to be purely on the couch_server side and most probably is caused by some hard limit or congestion there.

The text was updated successfully, but these errors were encountered:

wohali · 2020-03-13T22:00:26Z

@AlexanderKaraberov If you'd like to submit a PR for this for 3.x, please do so - be sure to target the 3.x branch, not master, as obviously this change wouldn't make sense for CouchDB 4.x.

nickva · 2021-11-04T18:11:07Z

This should be fixed with #3734

AlexanderKaraberov added enhancement needs-triage labels Mar 4, 2019

wohali removed the needs-triage label Mar 13, 2020

nickva closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make open_shard timeout value configurable #1964

Make open_shard timeout value configurable #1964

AlexanderKaraberov commented Mar 4, 2019 •

edited

wohali commented Mar 13, 2020

nickva commented Nov 4, 2021

Make open_shard timeout value configurable #1964

Make open_shard timeout value configurable #1964

Comments

AlexanderKaraberov commented Mar 4, 2019 • edited

Summary

Desired Behaviour

Possible Solution

wohali commented Mar 13, 2020

nickva commented Nov 4, 2021

AlexanderKaraberov commented Mar 4, 2019 •

edited