Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couch server sharding #3366

Merged
merged 2 commits into from
Feb 12, 2021
Merged

Couch server sharding #3366

merged 2 commits into from
Feb 12, 2021

Conversation

rnewson
Copy link
Member

@rnewson rnewson commented Feb 10, 2021

Overview

Break up the monolithic couch_server to avoid a bottleneck.

Testing recommendations

Heavy usage.

Related Issues or Pull Requests

N/A

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

Fun = fun(N, {TimeAcc, OpenAcc}) ->
{ok, #server{start_time=Time,dbs_open=Open}} =
gen_server:call(couch_server(N), get_server),
{max(Time, TimeAcc), Open + OpenAcc} end,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Time" here is a formatted datetime string, so 'max' is a bit arbitrary. all the couch_server_X's are started at the same time, not sure we care?

true -> ok
end;
false ->
ok
end.

close_lru() ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No calls to this from outside the module

@wohali
Copy link
Member

wohali commented Feb 10, 2021

Huh. Can you explain the failure mode in more detail, @rnewson ?

@rnewson
Copy link
Member Author

rnewson commented Feb 10, 2021

couch_server as a single gen_server has a finite throughput that, in busy clusters, is easily breached, causing a sizeable backlog in the message queue, ultimately leading to failure and errors.

@rnewson
Copy link
Member Author

rnewson commented Feb 10, 2021

timeouts, memory blow ups, cats and dogs living together.

@@ -343,7 +348,7 @@ all_databases() ->
{ok, lists:usort(DbList)}.

all_databases(Fun, Acc0) ->
{ok, #server{root_dir=Root}} = gen_server:call(couch_server, get_server),
{ok, #server{root_dir=Root}} = gen_server:call(couch_server_1, get_server),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because root_dir is the same in all of them)

Copy link
Contributor

@chewbranca chewbranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. I left a few minor comments, nothing major. We'll want to make sure this is well tested before bringing it in, but overall the structure looks solid. Nice one!

handle_config_change("couchdb", "max_dbs_open", _, _, N) ->
{ok, gen_server:call(couch_server(N),{set_max_dbs_open,?MAX_DBS_OPEN})};
handle_config_change("couchdb_engines", _, _, _, N) ->
{ok, gen_server:call(couch_server(N), reload_engines)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... do we want to trigger the config change callbacks N times now? The current approach keeps the code simpler, and I'm being a bit pedantic here, but it might be worth making a dedicated config listener pid that loops over the couch_server pids to trigger updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the couch_servers get the callback and do this once each.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just mean it's N numbers of config listeners for the same config changes, as each couch_server subscribes to the changes. Like I said, being pedantic here, but we could make a couch_server_config_listener that subscribes once and then updates all the couch_servers in a loop.

name(BaseName, N);

name(BaseName, #server{} = Srv) ->
name(BaseName, Srv#server.n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're caching N, should we also just cache the relevant atoms so we're not doing list_to_atom on concatenated strings every time? Probably won't have a major impact given this approach will have parallelize couch_server pids, but it's certainly more work to do for them to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but there are a fair few paths where we're not inside the couch_server gen process, but I'll look to see if we can do it tidily.



name(BaseName, DbName) when is_binary(DbName) ->
N = 1 + (erlang:crc32(DbName) rem num_servers()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 + erlang:phash2(DbName, num_servers()) might be cleaner here since we are not really computing a checksum but want to do hashing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May even be a tiny bit faster...

> DbName = crypto:strong_rand_bytes(100).
<<23,241,236,53,92,151,12,217,246,88,65,153,224,214,236,
  240,126,243,188,248,53,187,206,165,235,96,24,5,38,...>>

> f(USec), {USec, ok} = timer:tc(fun() -> [erlang:crc32(DbName) rem 10 || I <- lists:seq(1, 1000000)], ok end), USec/1.0e6.
1.499477

> f(USec), {USec, ok} = timer:tc(fun() -> [erlang:phash2(DbName, 10) || I <- lists:seq(1, 1000000)], ok end), USec/1.0e6.
1.331179

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, that looks nice, sure. I don't think checksum vs hash is a strong argument though, they're both means to an end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickva I'm having trouble finding a reference, but I've been under the impression that timer:tc is not to be trusted on anonymous functions in the shell and should be tested on a compiled module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewbranca I think they'd both be hit by the same slowness...

I tried a module mostly out of curiosity:

-module(test_hash).

-export([test_crc32/2, test_phash/2]).

test_crc32(N, Val) ->
    T0 = erlang:system_time(),
    do_crc32(N, Val),
    usec_per_cycle(erlang:system_time() - T0, N).

test_phash(N, Val) ->
    T0 = erlang:system_time(),
    do_phash2(N, Val),
    usec_per_cycle(erlang:system_time() - T0, N).

do_crc32(0, _) ->
    ok;
do_crc32(N, Val) ->
    _ = erlang:crc32(Val) rem 10,
    do_crc32(N - 1, Val).

do_phash2(0, _) ->
    ok;
do_phash2(N, Val) ->
    _ = erlang:phash2(Val, 10),
    do_phash2(N - 1, Val).

usec_per_cycle(Native, N) ->
    erlang:convert_time_unit(Native, native, microsecond) / N.
18> test_hash:test_crc32(10000000, DbName).
0.0722587
19> test_hash:test_phash(10000000, DbName).
0.070593

@rnewson rnewson force-pushed the couch_server_sharding branch 2 times, most recently from 8a4c4f0 to b815b4f Compare February 10, 2021 22:06
@wohali
Copy link
Member

wohali commented Feb 10, 2021

How will this affect the output of the message_queues key in _system stats?

This line calls registered() and I don't know how that works for multiple gen_servers that all appear to have the same name.

I suspect it'd be useful to be able to generate a (new?) stat that showed message queues by couch_server shard, identifying by db/shard serviced by that gen_server. But I might be assuming something here that isn't real... 🙃

@rnewson rnewson force-pushed the couch_server_sharding branch 2 times, most recently from 31e6379 to 7fe1d1e Compare February 10, 2021 23:13
@rnewson
Copy link
Member Author

rnewson commented Feb 10, 2021

tests pass locally. getting random 1 or 2 failures each time on our CI chain. I would like to disable them if they aren't going to run reliably.

@wohali
Copy link
Member

wohali commented Feb 11, 2021

tests pass locally. getting random 1 or 2 failures each time on our CI chain. I would like to disable them if they aren't going to run reliably.

If the development team is unable to put in the time to maintain test suites properly, then you have my vote.

I'd bring this up on dev@, but I have, many times over many years, and nothing ever changes.

@rnewson rnewson force-pushed the couch_server_sharding branch 3 times, most recently from 7bd861e to 7fb0106 Compare February 11, 2021 18:35
@nickva
Copy link
Contributor

nickva commented Feb 11, 2021

I noticed one of the failed tests in continuous-integration/jenkins/pr-merge a couch_server failure in it. I ran the test suite locally and it passed so could be just a flaky test and the new code just highlights the flakiness, but since couch_server showed up it was a bit suspicious so thought I'd mention it here:

[2021-02-11T18:42:04.494Z]     Multiple database create/delete tests

[2021-02-11T18:42:04.853Z]       couch_db_tests:111: should_create_multiple_dbs...[0.090 s] ok

[2021-02-11T18:42:04.853Z]       couch_db_tests:122: should_delete_multiple_dbs...*failed*

[2021-02-11T18:42:04.853Z] in function gen_server:call/3 (gen_server.erl, line 223)

[2021-02-11T18:42:04.853Z] in call from couch_server:create_int/2 (src/couch_server.erl, line 143)

[2021-02-11T18:42:04.853Z] in call from couch_server:create/2 (src/couch_server.erl, line 135)

[2021-02-11T18:42:04.853Z] in call from couch_db_tests:create_db/2 (test/eunit/couch_db_tests.erl, line 193)

[2021-02-11T18:42:04.853Z] in call from couch_db_tests:'-should_delete_multiple_dbs/1-fun-1-'/1 (test/eunit/couch_db_tests.erl, line 123)

[2021-02-11T18:42:04.853Z] in call from couch_db_tests:'-should_delete_multiple_dbs/1-lc$^0/1-0-'/1 (test/eunit/couch_db_tests.erl, line 123)

[2021-02-11T18:42:04.853Z] in call from couch_db_tests:'-should_delete_multiple_dbs/1-fun-8-'/1 (test/eunit/couch_db_tests.erl, line 123)

[2021-02-11T18:42:04.853Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)

[2021-02-11T18:42:04.853Z] **exit:{{badarg,[{ets,update_element,

[2021-02-11T18:42:04.853Z]                [couch_server_4,

[2021-02-11T18:42:04.853Z]                 <<"eunit-test-db-cd448fbae5748e8885319d3a0d"...>>,

[2021-02-11T18:42:04.853Z]                 {5,locked}],

[2021-02-11T18:42:04.853Z]                []},

[2021-02-11T18:42:04.853Z]           {couch_lru,close_int,2,[{file,"src/couch_lru.erl"},{line,49}]},

[2021-02-11T18:42:04.853Z]           {couch_server,maybe_close_lru_db,1,

[2021-02-11T18:42:04.853Z]                         [{file,"src/couch_server.erl"},{line,405}]},

[2021-02-11T18:42:04.853Z]           {couch_server,handle_call,3,

[2021-02-11T18:42:04.853Z]                         [{file,"src/couch_server.erl"},{line,575}]},

[2021-02-11T18:42:04.853Z]           {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},

[2021-02-11T18:42:04.853Z]           {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},

[2021-02-11T18:42:04.853Z]           {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},

[2021-02-11T18:42:04.853Z]  {gen_server,call,

[2021-02-11T18:42:04.853Z]              [couch_server_4,

[2021-02-11T18:42:04.853Z]               {create,<<"eunit-test-db-9849e822ba346e3d33d5f388c3"...>>,[]},

[2021-02-11T18:42:04.853Z]               infinity]}}

[2021-02-11T18:42:04.853Z]   output:<<"">>

@rnewson rnewson force-pushed the couch_server_sharding branch 4 times, most recently from 4ac6e7d to c4b254f Compare February 12, 2021 13:13
@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

took me way to long to see my mistake. I called couch_server:couch_server(DbName) in couch_lru.erl instead of couch_server:couch_dbs(DbName). So the new mystery is why these tests pass locally...

@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

@wohali the processes (and ets tables) have different names. I checked _system output and its as expected

   "mem3_reshard_sup": 0,
    "mem3_reshard_job_sup": 0,
    "erl_prim_loader": 0,
    "couch_server_16": 0,
    "mem3_reshard_dbdoc": 0,
    "couch_server_15": 0,
    "couch_epi_sup": 0,
    "couch_server_14": 0,
    "mem3_reshard": 0,
    "couch_server_13": 0,
    "init": 0,
    "couch_server_12": 0,
    "couch_server_11": 0,
    "mem3_nodes": 0,
    "couch_server_10": 0,
    "couch_server_9": 0,
    "couch_server_8": 0,
    "couch_server_7": 0,
    "couch_server_6": 0,
    "code_server": 0,
    "couch_server_5": 0,
    "couch_server_4": 0,
    "application_controller": 0,
    "httpc_sup": 0,
...

Consumers of this will need to adjust expectations for sure, though the _stats endpoint is fine, each couch_server_X process increments the same named stats as before.

@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

The only remaining question before merging is whether we redefine what max_dbs_open to be per scheduler, or whether I should reduce max_dbs_open by the num_servers() factor internal to each couch_server. My preference is for the latter but it's an important point to get opinions on.

@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

I've added the division at https://github.com/apache/couchdb/pull/3366/files#diff-2a7cfe9afa75338cd45f2ab010896e7cf24ea59b4355404f5412000cd9c7d934R296.

The idea is that instead of 1 gen_server/ets tables handling all $max_dbs_open items, we have N gen_server/ets tables handling $max_dbs_open/N items.

This is the least surprising behaviour. The alternative would be a silent multiplying of max_dbs_open by the number of schedulers, a number that is usually not consciously chosen and varies by host.

@janl
Copy link
Member

janl commented Feb 12, 2021

If the development team is unable to put in the time to maintain test suites properly, then you have my vote.

turns out they were real bugs in this PR :)

Copy link
Member

@janl janl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done

@rnewson rnewson merged commit be2898d into 3.x Feb 12, 2021
@rnewson rnewson deleted the couch_server_sharding branch February 12, 2021 15:40
@wohali
Copy link
Member

wohali commented Feb 12, 2021

@rnewson so no way to determine which couch_server is which without a remsh? Boo. How would I go about that anyway?

@davisp
Copy link
Member

davisp commented Feb 12, 2021

@wohali What do you mean which is which? Those message queues are the registered names. Thus, couch_server_1 literally translates to a process registered under that name that can be inspected via process_info/1 and friends.

@wohali
Copy link
Member

wohali commented Feb 12, 2021

@davisp which shard is couch_server_14 serving out?

@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

You can call couch_server:couch_server(DbName) from a remsh if you need to know which couch_server_X process a specific database is handled by. I'm not sure how useful that is. Typically at Cloudant we'd kill couch_server if it gained such a large message queue that it wasn't going to heal on its own, but I don't recall needing to look inside its state. What's your use case?

@wohali
Copy link
Member

wohali commented Feb 12, 2021

@rnewson Talking people through problem solving remotely, people who are not generally facile with (or are frightened by) remsh.

@rnewson
Copy link
Member Author

rnewson commented Feb 12, 2021

Hm, ok, I'm still lost. A problem that requires looking into the state of couch_server is going to need remsh access anyway. Can you give me an example?

@wohali
Copy link
Member

wohali commented Feb 13, 2021

Never mind. Thanks for the good work on this PR.

@rnewson rnewson mentioned this pull request Mar 24, 2021
4 tasks
@rnewson rnewson mentioned this pull request Dec 3, 2021
4 tasks
iilyak added a commit to cloudant/couchdb that referenced this pull request Jan 31, 2022
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness
of other tests. This PR fixes the problem by configuring temporary `database_dir`.

The important implementation detail is that we need to wait for all `couch_server`
processes to restart. Before initroduction of sharded couch server in the
apache#3366 this could be done as:

```erlang
test_util:with_process_restart(couch_server, fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```

This method has to be updated to support sharded couch_server. Following auxilary
functions where added:

- `couch_server:names/0` - returns list of registered names of each
  `couch_server` process
- `test_util:with_processes_restart/{2,4}` - waits all process to be restarted
  returns `{Pids :: #{} | timeout, Res :: term()}`
- `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes
to finish restart

The new way of configuring `database_dir` in test suites is:

```erlang
test_util:with_couch_server_restart(fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```
iilyak added a commit to cloudant/couchdb that referenced this pull request Jan 31, 2022
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness
of other tests. This PR fixes the problem by configuring temporary `database_dir`.

The important implementation detail is that we need to wait for all `couch_server`
processes to restart. Before initroduction of sharded couch server in the
apache#3366 this could be done as:

```erlang
test_util:with_process_restart(couch_server, fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```

This method has to be updated to support sharded couch_server. Following auxilary
functions where added:

- `couch_server:names/0` - returns list of registered names of each
  `couch_server` process
- `test_util:with_processes_restart/{2,4}` - waits all process to be restarted
  returns `{Pids :: #{} | timeout, Res :: term()}`
- `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes
to finish restart

The new way of configuring `database_dir` in test suites is:

```erlang
test_util:with_couch_server_restart(fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```
iilyak added a commit to cloudant/couchdb that referenced this pull request Jan 31, 2022
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness
of other tests. This PR fixes the problem by configuring temporary `database_dir`.

The important implementation detail is that we need to wait for all `couch_server`
processes to restart. Before initroduction of sharded couch server in the
apache#3366 this could be done as:

```erlang
test_util:with_process_restart(couch_server, fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```

This method has to be updated to support sharded couch_server. Following auxilary
functions where added:

- `couch_server:names/0` - returns list of registered names of each
  `couch_server` process
- `test_util:with_processes_restart/{2,4}` - waits all process to be restarted
  returns `{Pids :: #{} | timeout, Res :: term()}`
- `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes
to finish restart

The new way of configuring `database_dir` in test suites is:

```erlang
test_util:with_couch_server_restart(fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```
iilyak added a commit to cloudant/couchdb that referenced this pull request Jan 31, 2022
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness
of other tests. This PR fixes the problem by configuring temporary `database_dir`.

The important implementation detail is that we need to wait for all `couch_server`
processes to restart. Before initroduction of sharded couch server in the
apache#3366 this could be done as:

```erlang
test_util:with_process_restart(couch_server, fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```

This method has to be updated to support sharded couch_server. Following auxilary
functions where added:

- `couch_server:names/0` - returns list of registered names of each
  `couch_server` process
- `test_util:with_processes_restart/{2,4}` - waits all process to be restarted
  returns `{Pids :: #{} | timeout, Res :: term()}`
- `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes
to finish restart

The new way of configuring `database_dir` in test suites is:

```erlang
test_util:with_couch_server_restart(fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```
iilyak added a commit to cloudant/couchdb that referenced this pull request Feb 1, 2022
The `fabric_rpc_tests` pollutes the state of `shards_db` which causes flakiness
of other tests. This PR fixes the problem by configuring temporary `database_dir`.

The important implementation detail is that we need to wait for all `couch_server`
processes to restart. Before initroduction of sharded couch server in the
apache#3366 this could be done as:

```erlang
test_util:with_process_restart(couch_server, fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```

This method has to be updated to support sharded couch_server. Following auxilary
functions where added:

- `couch_server:names/0` - returns list of registered names of each
  `couch_server` process
- `test_util:with_processes_restart/{2,4}` - waits all process to be restarted
  returns `{Pids :: #{} | timeout, Res :: term()}`
- `test_util:with_couch_server_restart/1` - waits for all `couch_server` processes
to finish restart

The new way of configuring `database_dir` in test suites is:

```erlang
test_util:with_couch_server_restart(fun() ->
  config:set("couchdb", "database_dir", NewDatabaseDir)
end),
```
willholley added a commit that referenced this pull request Apr 13, 2023
In #3860 and #3366 we added sharding to `couch_index_server` and
`couch_server`.

The `_system` endpoint surfaces a "fake" message queue for each of these
contining the aggregated queue size across all shards. This commit
adds the same for the `_prometheus` endpoint.

Originally I had thought to just filter out the per-shard queue lengths
as we've not found these to be useful in Cloudant, but I'll leave them
in for now for consistency with the `_system` endpoint. Arguably, we
should filter in both places if there's agreement that the per-shard
queue lengths are just noise.
willholley added a commit that referenced this pull request Apr 13, 2023
In #3860 and #3366 we added sharding to `couch_index_server` and
`couch_server`.

The `_system` endpoint surfaces a "fake" message queue for each of these
contining the aggregated queue size across all shards. This commit
adds the same for the `_prometheus` endpoint.

Originally I had thought to just filter out the per-shard queue lengths
as we've not found these to be useful in Cloudant, but I'll leave them
in for now for consistency with the `_system` endpoint. Arguably, we
should filter in both places if there's agreement that the per-shard
queue lengths are just noise.
willholley added a commit that referenced this pull request Apr 14, 2023
In #3860 and #3366 we added sharding to `couch_index_server` and
`couch_server`.

The `_system` endpoint surfaces a "fake" message queue for each of these
contining the aggregated queue size across all shards. This commit
adds the same for the `_prometheus` endpoint.

Originally I had thought to just filter out the per-shard queue lengths
as we've not found these to be useful in Cloudant, but I'll leave them
in for now for consistency with the `_system` endpoint. Arguably, we
should filter in both places if there's agreement that the per-shard
queue lengths are just noise.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants