Skip to content

Use configured shards db in custodian instead of "dbs"#3807

Merged
nickva merged 2 commits into3.xfrom
fix-custodian-hard-coded-dbs
Oct 28, 2021
Merged

Use configured shards db in custodian instead of "dbs"#3807
nickva merged 2 commits into3.xfrom
fix-custodian-hard-coded-dbs

Conversation

@nickva
Copy link
Copy Markdown
Contributor

@nickva nickva commented Oct 28, 2021

However with custodian working finally, it also started to emit false positive errors in the logs for dbs with N < cluster default N. We fix that in a separate commit where, instead of the cluster default N value we use each database's expected N value.

The expected N value is a bit tricky to understand since with shard splitting feature, shard ranges are not guaranteed to exactly match for all copies. The N value is then defined as the max number of rings which can be completed with the given set of shards -- complete the ring once, remove participating shards, try again, etc. Lucky for us, that function is already written as mem3_util:calculate_max_n/2 so we are just re-using it.

@nickva
Copy link
Copy Markdown
Contributor Author

nickva commented Oct 28, 2021

How to test:

  • Create n=1 and n=2 dbs
  • Notice that custodian shouldn't emit errors
  • Set one of the nodes in maintenance mode. Custodian should emit errors.
  • Bring node back to production and kill one of the nodes. Custodian should also emit errors.
http put $DB/n1db'?n=1'
http put $DB/n2db'?n=2'
config:set("couchdb", "maintenance_mode", "true", false).

custodian:report().
[{<<"n2db">>,{0,2147483647},{live,1}},
 {<<"n1db">>,{0,2147483647},{live,0}},
 {<<"_users">>,{2147483648,4294967295},{live,2}},
 {<<"_users">>,{0,2147483647},{live,2}},
 {<<"_replicator">>,{2147483648,4294967295},{live,2}},
 {<<"_replicator">>,{0,2147483647},{live,2}}]
[notice] 2021-10-28T20:51:30.985095Z node1@127.0.0.1 <0.109.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T20:51:30.987098Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T20:51:30.987134Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T20:51:30.987196Z node1@127.0.0.1 <0.2089.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode

Previously, dbs with N < cluster default N would pollute logs with critical
errors regarding not having enough shards. Instead, use each database's
expected N value to emit custodian reports.

Note: the expected N value is a bit tricky to understand since with shard
splitting feature, shard ranges are not guaranteed to exactly match for all
copies. The N value is then defined as the max number of rings which can be
completed with the given set of shards -- complete the ring once, remove
participating shards, try again, etc. Lucky for us, that function is already
written (`mem3_util:calculate_max_n(Shards)` so we are just re-using it.
Copy link
Copy Markdown
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised:

[notice] 2021-10-28T22:48:16.151309Z node1@127.0.0.1 <0.7556.0> 16807916a8 localhost:15984 127.0.0.1 adm PUT /n1db?n=1 201 ok 69
[notice] 2021-10-28T22:48:25.183759Z node1@127.0.0.1 <0.7739.0> d924989833 localhost:15984 127.0.0.1 adm PUT /n2db?n=2 201 ok 71
[notice] 2021-10-28T22:48:47.861749Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T22:48:47.864359Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T22:48:47.864401Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T22:48:47.864451Z node1@127.0.0.1 <0.8160.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:49:45.102174Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to false for reason nil
[notice] 2021-10-28T22:51:05.504482Z node1@127.0.0.1 <0.350.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504535Z node1@127.0.0.1 <0.354.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504620Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster unstable
[notice] 2021-10-28T22:51:05.504672Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster unstable
[notice] 2021-10-28T22:51:05.505066Z node1@127.0.0.1 <0.442.0> -------- couch_replicator_clustering : cluster unstable
[notice] 2021-10-28T22:51:05.505197Z node1@127.0.0.1 <0.452.0> -------- Stopping replicator db changes listener <0.1061.0>
[notice] 2021-10-28T22:51:05.511643Z node1@127.0.0.1 <0.10406.0> -------- All system databases exist.
[warning] 2021-10-28T22:51:05.513446Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-10-28T22:51:05.513493Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-10-28T22:51:05.513539Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-10-28T22:51:05.513576Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:51:20.505004Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster stable
[notice] 2021-10-28T22:51:20.505575Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster stable

Comment thread src/custodian/README
Custodian is responsible for the data stored in CouchDB databases.

Custodian scans the "dbs" database, which details the location of
Custodian scans the shards database, which details the location of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

start_event_listener() ->
DbName = mem3_sync:shards_db(),
couch_event:link_listener(
?MODULE, handle_db_event, nil, [{dbname, <<"dbs">>}]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this during import 😞

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! Thanks for taking a look at the PR

@nickva nickva merged commit aa67448 into 3.x Oct 28, 2021
@nickva nickva deleted the fix-custodian-hard-coded-dbs branch October 28, 2021 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants