Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use configured shards db in custodian instead of "dbs" #3807

Merged
merged 2 commits into from Oct 28, 2021

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Oct 28, 2021

However with custodian working finally, it also started to emit false positive errors in the logs for dbs with N < cluster default N. We fix that in a separate commit where, instead of the cluster default N value we use each database's expected N value.

The expected N value is a bit tricky to understand since with shard splitting feature, shard ranges are not guaranteed to exactly match for all copies. The N value is then defined as the max number of rings which can be completed with the given set of shards -- complete the ring once, remove participating shards, try again, etc. Lucky for us, that function is already written as mem3_util:calculate_max_n/2 so we are just re-using it.

@nickva
Copy link
Contributor Author

nickva commented Oct 28, 2021

How to test:

  • Create n=1 and n=2 dbs
  • Notice that custodian shouldn't emit errors
  • Set one of the nodes in maintenance mode. Custodian should emit errors.
  • Bring node back to production and kill one of the nodes. Custodian should also emit errors.
http put $DB/n1db'?n=1'
http put $DB/n2db'?n=2'
config:set("couchdb", "maintenance_mode", "true", false).

custodian:report().
[{<<"n2db">>,{0,2147483647},{live,1}},
 {<<"n1db">>,{0,2147483647},{live,0}},
 {<<"_users">>,{2147483648,4294967295},{live,2}},
 {<<"_users">>,{0,2147483647},{live,2}},
 {<<"_replicator">>,{2147483648,4294967295},{live,2}},
 {<<"_replicator">>,{0,2147483647},{live,2}}]
[notice] 2021-10-28T20:51:30.985095Z node1@127.0.0.1 <0.109.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T20:51:30.987098Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T20:51:30.987134Z node1@127.0.0.1 <0.2089.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T20:51:30.987196Z node1@127.0.0.1 <0.2089.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode

Previously, dbs with N < cluster default N would pollute logs with critical
errors regarding not having enough shards. Instead, use each database's
expected N value to emit custodian reports.

Note: the expected N value is a bit tricky to understand since with shard
splitting feature, shard ranges are not guaranteed to exactly match for all
copies. The N value is then defined as the max number of rings which can be
completed with the given set of shards -- complete the ring once, remove
participating shards, try again, etc. Lucky for us, that function is already
written (`mem3_util:calculate_max_n(Shards)` so we are just re-using it.
Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised:

[notice] 2021-10-28T22:48:16.151309Z node1@127.0.0.1 <0.7556.0> 16807916a8 localhost:15984 127.0.0.1 adm PUT /n1db?n=1 201 ok 69
[notice] 2021-10-28T22:48:25.183759Z node1@127.0.0.1 <0.7739.0> d924989833 localhost:15984 127.0.0.1 adm PUT /n2db?n=2 201 ok 71
[notice] 2021-10-28T22:48:47.861749Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to true for reason nil
[critical] 2021-10-28T22:48:47.864359Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 1 copy on nodes not in maintenance mode
[critical] 2021-10-28T22:48:47.864401Z node1@127.0.0.1 <0.8160.0> -------- 1 shard in cluster with only 0 copies on nodes not in maintenance mode
[warning] 2021-10-28T22:48:47.864451Z node1@127.0.0.1 <0.8160.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:49:45.102174Z node1@127.0.0.1 <0.132.0> -------- config: [couchdb] maintenance_mode set to false for reason nil
[notice] 2021-10-28T22:51:05.504482Z node1@127.0.0.1 <0.350.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504535Z node1@127.0.0.1 <0.354.0> -------- rexi_server_mon : cluster unstable
[notice] 2021-10-28T22:51:05.504620Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster unstable
[notice] 2021-10-28T22:51:05.504672Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster unstable
[notice] 2021-10-28T22:51:05.505066Z node1@127.0.0.1 <0.442.0> -------- couch_replicator_clustering : cluster unstable
[notice] 2021-10-28T22:51:05.505197Z node1@127.0.0.1 <0.452.0> -------- Stopping replicator db changes listener <0.1061.0>
[notice] 2021-10-28T22:51:05.511643Z node1@127.0.0.1 <0.10406.0> -------- All system databases exist.
[warning] 2021-10-28T22:51:05.513446Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-10-28T22:51:05.513493Z node1@127.0.0.1 <0.10405.0> -------- 2 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-10-28T22:51:05.513539Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-10-28T22:51:05.513576Z node1@127.0.0.1 <0.10405.0> -------- 4 shards in cluster with only 2 copies on nodes not in maintenance mode
[notice] 2021-10-28T22:51:20.505004Z node1@127.0.0.1 <0.349.0> -------- rexi_server : cluster stable
[notice] 2021-10-28T22:51:20.505575Z node1@127.0.0.1 <0.353.0> -------- rexi_buffer : cluster stable

@@ -1,6 +1,6 @@
Custodian is responsible for the data stored in CouchDB databases.

Custodian scans the "dbs" database, which details the location of
Custodian scans the shards database, which details the location of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

couch_event:link_listener(
?MODULE, handle_db_event, nil, [{dbname, <<"dbs">>}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this during import 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! Thanks for taking a look at the PR

@nickva nickva merged commit aa67448 into 3.x Oct 28, 2021
@nickva nickva deleted the fix-custodian-hard-coded-dbs branch October 28, 2021 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants