Use configured shards db in custodian instead of "dbs" #3807

nickva · 2021-10-28T22:03:29Z

However with custodian working finally, it also started to emit false positive errors in the logs for dbs with N < cluster default N. We fix that in a separate commit where, instead of the cluster default N value we use each database's expected N value.

The expected N value is a bit tricky to understand since with shard splitting feature, shard ranges are not guaranteed to exactly match for all copies. The N value is then defined as the max number of rings which can be completed with the given set of shards -- complete the ring once, remove participating shards, try again, etc. Lucky for us, that function is already written as mem3_util:calculate_max_n/2 so we are just re-using it.

nickva · 2021-10-28T22:10:38Z

How to test:

Create n=1 and n=2 dbs
Notice that custodian shouldn't emit errors
Set one of the nodes in maintenance mode. Custodian should emit errors.
Bring node back to production and kill one of the nodes. Custodian should also emit errors.

http put $DB/n1db'?n=1'
http put $DB/n2db'?n=2'

config:set("couchdb", "maintenance_mode", "true", false).

custodian:report().
[{<<"n2db">>,{0,2147483647},{live,1}},
 {<<"n1db">>,{0,2147483647},{live,0}},
 {<<"_users">>,{2147483648,4294967295},{live,2}},
 {<<"_users">>,{0,2147483647},{live,2}},
 {<<"_replicator">>,{2147483648,4294967295},{live,2}},
 {<<"_replicator">>,{0,2147483647},{live,2}}]

[notice] 2021-10-28T20:51:30.985095Z node1@127.0.0.1 <0.109.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T20:51:30.987098Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T20:51:30.987134Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T20:51:30.987196Z node1@127.0.0.1 <0.2089.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode

Previously, dbs with N < cluster default N would pollute logs with critical errors regarding not having enough shards. Instead, use each database's expected N value to emit custodian reports. Note: the expected N value is a bit tricky to understand since with shard splitting feature, shard ranges are not guaranteed to exactly match for all copies. The N value is then defined as the max number of rings which can be completed with the given set of shards -- complete the ring once, remove participating shards, try again, etc. Lucky for us, that function is already written (`mem3_util:calculate_max_n(Shards)` so we are just re-using it.

jaydoane

Works as advertised:

[notice] 2021-10-28T22:48:16.151309Z node1@127.0.0.1 <0.7556.0> 16807916a8 localhost:15984 127.0.0.1 adm PUT /n1db?n=1 201 ok 69
[notice] 2021-10-28T22:48:25.183759Z node1@127.0.0.1 <0.7739.0> d924989833 localhost:15984 127.0.0.1 adm PUT /n2db?n=2 201 ok 71
[notice] 2021-10-28T22:48:47.861749Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T22:48:47.864359Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T22:48:47.864401Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T22:48:47.864451Z node1@127.0.0.1 <0.8160.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:49:45.102174Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to false for reason nil
[notice] 2021-10-28T22:51:05.504482Z node1@127.0.0.1 <0.350.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504535Z node1@127.0.0.1 <0.354.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504620Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster unstable
[notice] 2021-10-28T22:51:05.504672Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster unstable
[notice] 2021-10-28T22:51:05.505066Z node1@127.0.0.1 <0.442.0> -------- couch_replicator_clustering : cluster unstable
[notice] 2021-10-28T22:51:05.505197Z node1@127.0.0.1 <0.452.0> -------- Stopping replicator db changes listener <0.1061.0>
[notice] 2021-10-28T22:51:05.511643Z node1@127.0.0.1 <0.10406.0> -------- All system databases exist.
[warning] 2021-10-28T22:51:05.513446Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-10-28T22:51:05.513493Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-10-28T22:51:05.513539Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-10-28T22:51:05.513576Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:51:20.505004Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster stable
[notice] 2021-10-28T22:51:20.505575Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster stable

jaydoane · 2021-10-28T22:37:00Z

src/custodian/README

@@ -1,6 +1,6 @@
 Custodian is responsible for the data stored in CouchDB databases.

-Custodian scans the "dbs" database, which details the location of
+Custodian scans the shards database, which details the location of


jaydoane · 2021-10-28T22:38:28Z

src/custodian/src/custodian_server.erl

    couch_event:link_listener(
-            ?MODULE, handle_db_event, nil, [{dbname, <<"dbs">>}]


Sorry I missed this during import 😞

No worries! Thanks for taking a look at the PR

Use configured shards db in custodian instead of "dbs"

1ba0e4a

nickva force-pushed the fix-custodian-hard-coded-dbs branch from d6af550 to 8d86b39 Compare October 28, 2021 22:35

nickva mentioned this pull request Oct 28, 2021

custodian hard-codes dbs db instead of the configured shards db #3806

Closed

jaydoane approved these changes Oct 28, 2021

View reviewed changes

nickva merged commit aa67448 into 3.x Oct 28, 2021

nickva deleted the fix-custodian-hard-coded-dbs branch October 28, 2021 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use configured shards db in custodian instead of "dbs" #3807

Use configured shards db in custodian instead of "dbs" #3807

nickva commented Oct 28, 2021 •

edited

nickva commented Oct 28, 2021

jaydoane left a comment

jaydoane Oct 28, 2021

jaydoane Oct 28, 2021

nickva Oct 28, 2021

		couch_event:link_listener(
		?MODULE, handle_db_event, nil, [{dbname, <<"dbs">>}]

Use configured shards db in custodian instead of "dbs" #3807

Use configured shards db in custodian instead of "dbs" #3807

Conversation

nickva commented Oct 28, 2021 • edited

nickva commented Oct 28, 2021

jaydoane left a comment

Choose a reason for hiding this comment

jaydoane Oct 28, 2021

Choose a reason for hiding this comment

jaydoane Oct 28, 2021

Choose a reason for hiding this comment

nickva Oct 28, 2021

Choose a reason for hiding this comment

nickva commented Oct 28, 2021 •

edited