Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make open_shard timeout value configurable #1964

Closed
AlexanderKaraberov opened this issue Mar 4, 2019 · 2 comments
Closed

Make open_shard timeout value configurable #1964

AlexanderKaraberov opened this issue Mar 4, 2019 · 2 comments

Comments

@AlexanderKaraberov
Copy link
Contributor

AlexanderKaraberov commented Mar 4, 2019

Summary

In those cases when a CouchDB node is overloaded due to a high number of concurrent write and read requests (numbers are not fixed here) debug logs are spawned with Failed to open shard after 100 messages, and immediately after those No DB shards could be opened errors are thrown. This causes write requests to fail as a consequence. These tests were conducted with no cluster mode (1 replica set) and with a default 8 shards (n=1, q=8). Test machine had 8 cores and 16 GB of RAM.

After I had checked fabric_rpc:open_shard I noticed that all its traffic goes to couch_server:open which seems to cache references to open database handles in the ETS with read_concurrency hence this part seems to be fine. The only more or less heavy code there is update_lru but we have update_lru_on_read option set to false (this is a default config anyway) hence all is fine here as well. I haven't found any stats report for couch_dbs cache misses therefore I can't claim with 100% confidence that there are no cache misses in this case. Summarising everything I've just said I can only assume that general overload of the couch_server process on a single node caused longer response times from fabric_rpc:open -> couch_db:open on particular shards. As we can see here an expected timeout for open shard response is hardcoded to 100 ms which makes it impossible to increase this value.

UPD Load tests demonstrated that indeed providing a bigger timeout value (200-300 ms) helps avoid No DB shards could be opened errors under heavy load volumes. Of course I observed bigger latency when CouchDB processes read/write operations under heavy constant load of heterogeneous nature (bursty concurrent writes combined with constant 200/sec writes + concurrent read all docs) Nevertheless this is much better than just failed requests.

Desired Behaviour

There is a shard_timeout_factor value to configure open shard backoff behaviour but from what I saw increasing this value didn't help as I saw only Failed to open shard after 100 messages (instead of an expected exponential increase). Contrariwise, it would be more flexible to also configure a Timeout value which is passed to get_shard.

Possible Solution

Add open_shard_timeout fabric config to make possible changing of this timeout value.

cc @janl @rnewson If this sounds reasonable I can submit a PR. Also perhaps you could suggest some other improvements based on my description of the problem? Number of shards is not really impacting an upper limit of writes/reads, and as I've already mentioned above, bottleneck seems to be purely on the couch_server side and most probably is caused by some hard limit or congestion there.

@wohali
Copy link
Member

wohali commented Mar 13, 2020

@AlexanderKaraberov If you'd like to submit a PR for this for 3.x, please do so - be sure to target the 3.x branch, not master, as obviously this change wouldn't make sense for CouchDB 4.x.

@nickva
Copy link
Contributor

nickva commented Nov 4, 2021

This should be fixed with #3734

@nickva nickva closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants