You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In those cases when a CouchDB node is overloaded due to a high number of concurrent write and read requests (numbers are not fixed here) debug logs are spawned with Failed to open shard after 100 messages, and immediately after those No DB shards could be opened errors are thrown. This causes write requests to fail as a consequence. These tests were conducted with no cluster mode (1 replica set) and with a default 8 shards (n=1, q=8). Test machine had 8 cores and 16 GB of RAM.
After I had checked fabric_rpc:open_shard I noticed that all its traffic goes to couch_server:open which seems to cache references to open database handles in the ETS with read_concurrency hence this part seems to be fine. The only more or less heavy code there is update_lru but we have update_lru_on_read option set to false (this is a default config anyway) hence all is fine here as well. I haven't found any stats report for couch_dbs cache misses therefore I can't claim with 100% confidence that there are no cache misses in this case. Summarising everything I've just said I can only assume that general overload of the couch_server process on a single node caused longer response times from fabric_rpc:open -> couch_db:open on particular shards. As we can see here an expected timeout for open shard response is hardcoded to 100 ms which makes it impossible to increase this value.
UPD Load tests demonstrated that indeed providing a bigger timeout value (200-300 ms) helps avoid No DB shards could be opened errors under heavy load volumes. Of course I observed bigger latency when CouchDB processes read/write operations under heavy constant load of heterogeneous nature (bursty concurrent writes combined with constant 200/sec writes + concurrent read all docs) Nevertheless this is much better than just failed requests.
Desired Behaviour
There is a shard_timeout_factor value to configure open shard backoff behaviour but from what I saw increasing this value didn't help as I saw only Failed to open shard after 100 messages (instead of an expected exponential increase). Contrariwise, it would be more flexible to also configure a Timeout value which is passed to get_shard.
Possible Solution
Add open_shard_timeout fabric config to make possible changing of this timeout value.
cc @janl@rnewson If this sounds reasonable I can submit a PR. Also perhaps you could suggest some other improvements based on my description of the problem? Number of shards is not really impacting an upper limit of writes/reads, and as I've already mentioned above, bottleneck seems to be purely on the couch_server side and most probably is caused by some hard limit or congestion there.
The text was updated successfully, but these errors were encountered:
@AlexanderKaraberov If you'd like to submit a PR for this for 3.x, please do so - be sure to target the 3.x branch, not master, as obviously this change wouldn't make sense for CouchDB 4.x.
Summary
In those cases when a CouchDB node is overloaded due to a high number of concurrent write and read requests (numbers are not fixed here)
debug
logs are spawned withFailed to open shard after 100
messages, and immediately after thoseNo DB shards could be opened
errors are thrown. This causes write requests to fail as a consequence. These tests were conducted with no cluster mode (1 replica set) and with a default 8 shards (n=1
,q=8
). Test machine had 8 cores and 16 GB of RAM.After I had checked
fabric_rpc:open_shard
I noticed that all its traffic goes tocouch_server:open
which seems to cache references to open database handles in theETS
withread_concurrency
hence this part seems to be fine. The only more or less heavy code there isupdate_lru
but we haveupdate_lru_on_read
option set tofalse
(this is a default config anyway) hence all is fine here as well. I haven't found any stats report forcouch_dbs
cache misses therefore I can't claim with 100% confidence that there are no cache misses in this case. Summarising everything I've just said I can only assume that general overload of thecouch_server
process on a single node caused longer response times fromfabric_rpc:open -> couch_db:open
on particular shards. As we can see here an expected timeout for open shard response is hardcoded to 100 ms which makes it impossible to increase this value.UPD Load tests demonstrated that indeed providing a bigger timeout value (200-300 ms) helps avoid
No DB shards could be opened
errors under heavy load volumes. Of course I observed bigger latency when CouchDB processes read/write operations under heavy constant load of heterogeneous nature (bursty concurrent writes combined with constant 200/sec writes + concurrent read all docs) Nevertheless this is much better than just failed requests.Desired Behaviour
There is a
shard_timeout_factor
value to configure open shard backoff behaviour but from what I saw increasing this value didn't help as I saw onlyFailed to open shard after 100
messages (instead of an expected exponential increase). Contrariwise, it would be more flexible to also configure aTimeout
value which is passed toget_shard
.Possible Solution
Add
open_shard_timeout
fabric config to make possible changing of this timeout value.cc @janl @rnewson If this sounds reasonable I can submit a PR. Also perhaps you could suggest some other improvements based on my description of the problem? Number of shards is not really impacting an upper limit of writes/reads, and as I've already mentioned above, bottleneck seems to be purely on the
couch_server
side and most probably is caused by some hard limit or congestion there.The text was updated successfully, but these errors were encountered: