-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couch server sharding #3366
Couch server sharding #3366
Conversation
Fun = fun(N, {TimeAcc, OpenAcc}) -> | ||
{ok, #server{start_time=Time,dbs_open=Open}} = | ||
gen_server:call(couch_server(N), get_server), | ||
{max(Time, TimeAcc), Open + OpenAcc} end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Time" here is a formatted datetime string, so 'max' is a bit arbitrary. all the couch_server_X's are started at the same time, not sure we care?
true -> ok | ||
end; | ||
false -> | ||
ok | ||
end. | ||
|
||
close_lru() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No calls to this from outside the module
Huh. Can you explain the failure mode in more detail, @rnewson ? |
79d68cc
to
20135cb
Compare
couch_server as a single gen_server has a finite throughput that, in busy clusters, is easily breached, causing a sizeable backlog in the message queue, ultimately leading to failure and errors. |
timeouts, memory blow ups, cats and dogs living together. |
@@ -343,7 +348,7 @@ all_databases() -> | |||
{ok, lists:usort(DbList)}. | |||
|
|||
all_databases(Fun, Acc0) -> | |||
{ok, #server{root_dir=Root}} = gen_server:call(couch_server, get_server), | |||
{ok, #server{root_dir=Root}} = gen_server:call(couch_server_1, get_server), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(because root_dir is the same in all of them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. I left a few minor comments, nothing major. We'll want to make sure this is well tested before bringing it in, but overall the structure looks solid. Nice one!
src/couch/src/couch_server.erl
Outdated
handle_config_change("couchdb", "max_dbs_open", _, _, N) -> | ||
{ok, gen_server:call(couch_server(N),{set_max_dbs_open,?MAX_DBS_OPEN})}; | ||
handle_config_change("couchdb_engines", _, _, _, N) -> | ||
{ok, gen_server:call(couch_server(N), reload_engines)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... do we want to trigger the config change callbacks N
times now? The current approach keeps the code simpler, and I'm being a bit pedantic here, but it might be worth making a dedicated config listener pid that loops over the couch_server pids to trigger updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of the couch_servers get the callback and do this once each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just mean it's N
numbers of config listeners for the same config changes, as each couch_server subscribes to the changes. Like I said, being pedantic here, but we could make a couch_server_config_listener
that subscribes once and then updates all the couch_servers
in a loop.
src/couch/src/couch_server.erl
Outdated
name(BaseName, N); | ||
|
||
name(BaseName, #server{} = Srv) -> | ||
name(BaseName, Srv#server.n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're caching N, should we also just cache the relevant atoms so we're not doing list_to_atom
on concatenated strings every time? Probably won't have a major impact given this approach will have parallelize couch_server pids, but it's certainly more work to do for them to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but there are a fair few paths where we're not inside the couch_server gen process, but I'll look to see if we can do it tidily.
src/couch/src/couch_server.erl
Outdated
|
||
|
||
name(BaseName, DbName) when is_binary(DbName) -> | ||
N = 1 + (erlang:crc32(DbName) rem num_servers()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 + erlang:phash2(DbName, num_servers())
might be cleaner here since we are not really computing a checksum but want to do hashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May even be a tiny bit faster...
> DbName = crypto:strong_rand_bytes(100).
<<23,241,236,53,92,151,12,217,246,88,65,153,224,214,236,
240,126,243,188,248,53,187,206,165,235,96,24,5,38,...>>
> f(USec), {USec, ok} = timer:tc(fun() -> [erlang:crc32(DbName) rem 10 || I <- lists:seq(1, 1000000)], ok end), USec/1.0e6.
1.499477
> f(USec), {USec, ok} = timer:tc(fun() -> [erlang:phash2(DbName, 10) || I <- lists:seq(1, 1000000)], ok end), USec/1.0e6.
1.331179
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that looks nice, sure. I don't think checksum vs hash is a strong argument though, they're both means to an end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickva I'm having trouble finding a reference, but I've been under the impression that timer:tc
is not to be trusted on anonymous functions in the shell and should be tested on a compiled module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chewbranca I think they'd both be hit by the same slowness...
I tried a module mostly out of curiosity:
-module(test_hash).
-export([test_crc32/2, test_phash/2]).
test_crc32(N, Val) ->
T0 = erlang:system_time(),
do_crc32(N, Val),
usec_per_cycle(erlang:system_time() - T0, N).
test_phash(N, Val) ->
T0 = erlang:system_time(),
do_phash2(N, Val),
usec_per_cycle(erlang:system_time() - T0, N).
do_crc32(0, _) ->
ok;
do_crc32(N, Val) ->
_ = erlang:crc32(Val) rem 10,
do_crc32(N - 1, Val).
do_phash2(0, _) ->
ok;
do_phash2(N, Val) ->
_ = erlang:phash2(Val, 10),
do_phash2(N - 1, Val).
usec_per_cycle(Native, N) ->
erlang:convert_time_unit(Native, native, microsecond) / N.
18> test_hash:test_crc32(10000000, DbName).
0.0722587
19> test_hash:test_phash(10000000, DbName).
0.070593
8a4c4f0
to
b815b4f
Compare
How will this affect the output of the This line calls I suspect it'd be useful to be able to generate a (new?) stat that showed message queues by |
31e6379
to
7fe1d1e
Compare
tests pass locally. getting random 1 or 2 failures each time on our CI chain. I would like to disable them if they aren't going to run reliably. |
If the development team is unable to put in the time to maintain test suites properly, then you have my vote. I'd bring this up on dev@, but I have, many times over many years, and nothing ever changes. |
7bd861e
to
7fb0106
Compare
I noticed one of the failed tests in
|
4ac6e7d
to
c4b254f
Compare
took me way to long to see my mistake. I called couch_server:couch_server(DbName) in couch_lru.erl instead of couch_server:couch_dbs(DbName). So the new mystery is why these tests pass locally... |
@wohali the processes (and ets tables) have different names. I checked _system output and its as expected
Consumers of this will need to adjust expectations for sure, though the _stats endpoint is fine, each couch_server_X process increments the same named stats as before. |
The only remaining question before merging is whether we redefine what |
c4b254f
to
7de2502
Compare
I've added the division at https://github.com/apache/couchdb/pull/3366/files#diff-2a7cfe9afa75338cd45f2ab010896e7cf24ea59b4355404f5412000cd9c7d934R296. The idea is that instead of 1 gen_server/ets tables handling all $max_dbs_open items, we have N gen_server/ets tables handling $max_dbs_open/N items. This is the least surprising behaviour. The alternative would be a silent multiplying of max_dbs_open by the number of schedulers, a number that is usually not consciously chosen and varies by host. |
7de2502
to
62b507f
Compare
turns out they were real bugs in this PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nicely done
62b507f
to
ef1c902
Compare
@rnewson so no way to determine which couch_server is which without a remsh? Boo. How would I go about that anyway? |
@wohali What do you mean which is which? Those message queues are the registered names. Thus, |
@davisp which shard is |
You can call |
@rnewson Talking people through problem solving remotely, people who are not generally facile with (or are frightened by) remsh. |
Hm, ok, I'm still lost. A problem that requires looking into the state of couch_server is going to need remsh access anyway. Can you give me an example? |
Never mind. Thanks for the good work on this PR. |
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness of other tests. This PR fixes the problem by configuring temporary `database_dir`. The important implementation detail is that we need to wait for all `couch_server` processes to restart. Before initroduction of sharded couch server in the apache#3366 this could be done as: ```erlang test_util:with_process_restart(couch_server, fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ``` This method has to be updated to support sharded couch_server. Following auxilary functions where added: - `couch_server:names/0` - returns list of registered names of each `couch_server` process - `test_util:with_processes_restart/{2,4}` - waits all process to be restarted returns `{Pids :: #{} | timeout, Res :: term()}` - `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes to finish restart The new way of configuring `database_dir` in test suites is: ```erlang test_util:with_couch_server_restart(fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ```
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness of other tests. This PR fixes the problem by configuring temporary `database_dir`. The important implementation detail is that we need to wait for all `couch_server` processes to restart. Before initroduction of sharded couch server in the apache#3366 this could be done as: ```erlang test_util:with_process_restart(couch_server, fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ``` This method has to be updated to support sharded couch_server. Following auxilary functions where added: - `couch_server:names/0` - returns list of registered names of each `couch_server` process - `test_util:with_processes_restart/{2,4}` - waits all process to be restarted returns `{Pids :: #{} | timeout, Res :: term()}` - `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes to finish restart The new way of configuring `database_dir` in test suites is: ```erlang test_util:with_couch_server_restart(fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ```
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness of other tests. This PR fixes the problem by configuring temporary `database_dir`. The important implementation detail is that we need to wait for all `couch_server` processes to restart. Before initroduction of sharded couch server in the apache#3366 this could be done as: ```erlang test_util:with_process_restart(couch_server, fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ``` This method has to be updated to support sharded couch_server. Following auxilary functions where added: - `couch_server:names/0` - returns list of registered names of each `couch_server` process - `test_util:with_processes_restart/{2,4}` - waits all process to be restarted returns `{Pids :: #{} | timeout, Res :: term()}` - `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes to finish restart The new way of configuring `database_dir` in test suites is: ```erlang test_util:with_couch_server_restart(fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ```
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness of other tests. This PR fixes the problem by configuring temporary `database_dir`. The important implementation detail is that we need to wait for all `couch_server` processes to restart. Before initroduction of sharded couch server in the apache#3366 this could be done as: ```erlang test_util:with_process_restart(couch_server, fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ``` This method has to be updated to support sharded couch_server. Following auxilary functions where added: - `couch_server:names/0` - returns list of registered names of each `couch_server` process - `test_util:with_processes_restart/{2,4}` - waits all process to be restarted returns `{Pids :: #{} | timeout, Res :: term()}` - `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes to finish restart The new way of configuring `database_dir` in test suites is: ```erlang test_util:with_couch_server_restart(fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ```
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness of other tests. This PR fixes the problem by configuring temporary `database_dir`. The important implementation detail is that we need to wait for all `couch_server` processes to restart. Before initroduction of sharded couch server in the apache#3366 this could be done as: ```erlang test_util:with_process_restart(couch_server, fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ``` This method has to be updated to support sharded couch_server. Following auxilary functions where added: - `couch_server:names/0` - returns list of registered names of each `couch_server` process - `test_util:with_processes_restart/{2,4}` - waits all process to be restarted returns `{Pids :: #{} | timeout, Res :: term()}` - `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes to finish restart The new way of configuring `database_dir` in test suites is: ```erlang test_util:with_couch_server_restart(fun() -> config:set("couchdb", "database_dir", NewDatabaseDir) end), ```
In #3860 and #3366 we added sharding to `couch_index_server` and `couch_server`. The `_system` endpoint surfaces a "fake" message queue for each of these contining the aggregated queue size across all shards. This commit adds the same for the `_prometheus` endpoint. Originally I had thought to just filter out the per-shard queue lengths as we've not found these to be useful in Cloudant, but I'll leave them in for now for consistency with the `_system` endpoint. Arguably, we should filter in both places if there's agreement that the per-shard queue lengths are just noise.
In #3860 and #3366 we added sharding to `couch_index_server` and `couch_server`. The `_system` endpoint surfaces a "fake" message queue for each of these contining the aggregated queue size across all shards. This commit adds the same for the `_prometheus` endpoint. Originally I had thought to just filter out the per-shard queue lengths as we've not found these to be useful in Cloudant, but I'll leave them in for now for consistency with the `_system` endpoint. Arguably, we should filter in both places if there's agreement that the per-shard queue lengths are just noise.
In #3860 and #3366 we added sharding to `couch_index_server` and `couch_server`. The `_system` endpoint surfaces a "fake" message queue for each of these contining the aggregated queue size across all shards. This commit adds the same for the `_prometheus` endpoint. Originally I had thought to just filter out the per-shard queue lengths as we've not found these to be useful in Cloudant, but I'll leave them in for now for consistency with the `_system` endpoint. Arguably, we should filter in both places if there's agreement that the per-shard queue lengths are just noise.
Overview
Break up the monolithic couch_server to avoid a bottleneck.
Testing recommendations
Heavy usage.
Related Issues or Pull Requests
N/A
Checklist
rel/overlay/etc/default.ini