You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When partitioned queries were introduced, new database creation options were added for the ?partitioned=true boolean along with the hash function used, and these options are stored in the dbs db doc for the relevant database. The underlying bug here is that there are a number of different code paths that result in creation of database shards, and not all of them create the shards with the appropriate database options. We fixed a few of these issues in [1], but I've stumbled upon some more scenarios where we encounter these failures. These currently manifest on partitioned databases having shards created after initial database creation and are incorrectly created without the partitioned flag.
If individual shards of a partitioned database are incorrectly created as non-partitioned, for the most part things just "work", making this an issue that hides in plain site. One of the things that does not work, is that design documents with partitioned query views have a partitioned boolean metadata value, and there's additional validation logic to prevent partitioned ddocs from being written to unpartitioned database shards. So in the event you get a shard replica incorrectly created as unpartitioned, and you have a ddoc that maps to that shard, the ddoc will be unable to write to that shard replica.
In the event two of the three replicas are created correctly as partitioned, and only one of the shards is incorrectly created as unpartitioned, and there's a partitioned=true ddoc on that shard range, then the ddoc will fail to write to the unpartitioned shard, which will then trigger read_repair anytime the ddoc is accessed through the quorum system, however, the read_repair logic doesn't not enforce W=N write semantics, so having two out of three shard replicas properly created as partitioned will result in false positive read_repair success, never triggering the failure case where we log the issue. As a result, there will be a false positive successful read_repair operation every time the ddoc is accessed.
I've tried to be a bit more meticulous in terms of auditing this issue and trying to fix it once and for all. I think there are only two ways of creating database shards: 1) Calling couch_server:open with the option create_if_missing=true, and 2) by directly calling couch_server:create. If anyone can think of any code paths I'm skipping, let me know.
Here's the scenarios where we set create_if_missing:
And here's the scenarios where we call couch_server:create directly:
(! 22220)-> grep -r couch_server:create src/
src//mem3/test/eunit/mem3_seeds_test.erl: {ok, _} = couch_server:create(<<"_users">>, []),
src//mem3/src/mem3_shards.erl: case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of
src//couch/test/eunit/couch_server_tests.erl: Resp = couch_server:create(?tempdb(), [{engine, <<"cowabunga!">>}]),
src//couch/src/couch_httpd_db.erl: case couch_server:create(DbName, [{user_ctx, UserCtx}] ++ Engine) of
src//couch/src/couch_server.erl: couch_server:create(DbName, Options);
src//couch/src/couch_db.erl: couch_server:create(DbName, Options).
src//dreyfus/src/dreyfus_rpc.erl: couch_server:create(DbName, Options);
src//fabric/src/fabric_rpc.erl: rexi:reply(case couch_server:create(DbName, Options) of
Let's audit these to see which are problematic and need to be fixed, starting with the create_if_missing invocations:
src//mem3/src/mem3_shards.erl: [create_if_missing(mem3:name(S), mem3:engine(S)) || S
src//mem3/src/mem3_shards.erl:create_if_missing(Name, Options) ->
These two calls [2][3] are a function in mem3_shards with the same name as the option we're interested in, which is why they showed up with grep, but basically the call in [2] to the function defined in [3] ends up calling couch_server:create so we've got an overlap between the two greps here, and we'll cover the case of [4]:
src//mem3/src/mem3_shards.erl: case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of
here too. You'll notice that when the function in [4] calls couch_server:create it passes in the Options passed in from [2] which is just mem:engine(S) supplying the engine options. You'll notice nothing here checks to see if the database should be created, and therefore this is one of the problematic shard creation code paths that'll we need to fix.
from [5]. This line of code is from mem3_util:ensure_exists [6] which is only used to the creation of sys_dbs, so I think we're ok to leave this as is, as we don't make partitioned system dbs.
This is from mem3_util:get_or_create_db from [7], and this is the improved db creation function we updated in [1], and this properly sets the db properties by way of [8], so we're good to go here.
This is another problematic invocation in [10], which is called as part of fabric_util:get_db [11]. You can clearly see in the fabric_util:get_db/1 (arity one case) that we're supplying an empty list of options which will never have the partition options [12]. We never actually use the one arity function head, as you can see from the following grep:
Those three invocations come from fabric:get_revs_limit/1 [13], fabric:get_purge_infos_limit/1 [14], and the two arity version of fabric:get_security/2 [15], although do note that the one arity fabric:get_security/1 just calls the two arity with an empty options list. You can see in the following grep that we never actually supply database creation options to fabric:get_security and only use the options for setting a user_ctx, if at all:
src//couch/src/couch_httpd_db.erl: case couch_server:create(DbName, [{user_ctx, UserCtx}] ++ Engine) of
This interesting case comes from [16], and because it doesn't include db creation options, it means that you can't actually create partitioned dbs against the 5986 endpoint.
This is an alias function couch_db:create/2 --> couch_server:create/2 . This opens up more code paths for invoking create without the appropriate options. For the most part, it seems this function is only used in test modules, so the following grep shows cases that don't involve the word "test":
The first case is for creating an internal users database, and the second case is part of the inline eunit tests of that module, so I don't think either of these scenarios or this code path are problematic at the moment.
This is from dreyfus_rpc:get_or_create_database [19], which is only invoked by {ok, Db} = get_or_create_db(DbName, []), in [20], and you can clearly see that it's supplying an empty list for the db creation options, which is not appropriate.
On a related note, it's not in ASF CouchDB, but the Hastings [21] library uses a similar structure to Dreyfus and is susceptible to the same get_or_create_database code [22]. I've included it here as a reminder it needs to be fixed too.
And for our final code path we have:
src//fabric/src/fabric_rpc.erl: rexi:reply(case couch_server:create(DbName, Options) of
which is in the fabric_rpc:create_db/2 function that is only used by way of fabric:create_db and fabric_db_create:go, and this is the main code path for creating databases (partitioned or not), and I believe this is working as expected.
How to find problematic shards
I've created a little remsh snippet that can be invoked on any node in the cluster and it will audit all dbs and shard replicas for issues, returning the problematic shards. I think it would be worthwhile to add this to mem3_util or some such.
When partitioned queries were introduced, new database creation options were added for the
?partitioned=true
boolean along with the hash function used, and these options are stored in the dbs db doc for the relevant database. The underlying bug here is that there are a number of different code paths that result in creation of database shards, and not all of them create the shards with the appropriate database options. We fixed a few of these issues in [1], but I've stumbled upon some more scenarios where we encounter these failures. These currently manifest on partitioned databases having shards created after initial database creation and are incorrectly created without the partitioned flag.If individual shards of a partitioned database are incorrectly created as non-partitioned, for the most part things just "work", making this an issue that hides in plain site. One of the things that does not work, is that design documents with partitioned query views have a partitioned boolean metadata value, and there's additional validation logic to prevent partitioned ddocs from being written to unpartitioned database shards. So in the event you get a shard replica incorrectly created as unpartitioned, and you have a ddoc that maps to that shard, the ddoc will be unable to write to that shard replica.
In the event two of the three replicas are created correctly as partitioned, and only one of the shards is incorrectly created as unpartitioned, and there's a partitioned=true ddoc on that shard range, then the ddoc will fail to write to the unpartitioned shard, which will then trigger read_repair anytime the ddoc is accessed through the quorum system, however, the read_repair logic doesn't not enforce
W=N
write semantics, so having two out of three shard replicas properly created as partitioned will result in false positive read_repair success, never triggering the failure case where we log the issue. As a result, there will be a false positive successful read_repair operation every time the ddoc is accessed.I've tried to be a bit more meticulous in terms of auditing this issue and trying to fix it once and for all. I think there are only two ways of creating database shards: 1) Calling
couch_server:open
with the optioncreate_if_missing=true
, and 2) by directly callingcouch_server:create
. If anyone can think of any code paths I'm skipping, let me know.Here's the scenarios where we set
create_if_missing
:And here's the scenarios where we call
couch_server:create
directly:Let's audit these to see which are problematic and need to be fixed, starting with the
create_if_missing
invocations:These two calls [2][3] are a function in
mem3_shards
with the same name as the option we're interested in, which is why they showed up with grep, but basically the call in [2] to the function defined in [3] ends up callingcouch_server:create
so we've got an overlap between the two greps here, and we'll cover the case of [4]:here too. You'll notice that when the function in [4] calls
couch_server:create
it passes in theOptions
passed in from [2] which is justmem:engine(S)
supplying the engine options. You'll notice nothing here checks to see if the database should be created, and therefore this is one of the problematic shard creation code paths that'll we need to fix.Next up is:
from [5]. This line of code is from
mem3_util:ensure_exists
[6] which is only used to the creation of sys_dbs, so I think we're ok to leave this as is, as we don't make partitioned system dbs.This is from
mem3_util:get_or_create_db
from [7], and this is the improved db creation function we updated in [1], and this properly sets the db properties by way of [8], so we're good to go here.The following three cases are for isolated tests:
This is the core mechanism couch_server uses to determine if it should create the database when it's missing [9].
This is another problematic invocation in [10], which is called as part of
fabric_util:get_db
[11]. You can clearly see in thefabric_util:get_db/1
(arity one case) that we're supplying an empty list of options which will never have the partition options [12]. We never actually use the one arity function head, as you can see from the following grep:Those three invocations come from
fabric:get_revs_limit/1
[13],fabric:get_purge_infos_limit/1
[14], and the two arity version offabric:get_security/2
[15], although do note that the one arityfabric:get_security/1
just calls the two arity with an empty options list. You can see in the following grep that we never actually supply database creation options tofabric:get_security
and only use the options for setting a user_ctx, if at all:As such all three of these cases are broken, but they all use the same db creation invocation in [10], so we can fix the problem there.
Next up are the direct invocations of
couch_server:create
:This is an isolated test.
This was covered above.
This is for an isolated test.
This interesting case comes from [16], and because it doesn't include db creation options, it means that you can't actually create partitioned dbs against the 5986 endpoint.
This is the core couch_server logic for creating databases that have had
create_if_missing
passed through [17].This is an alias function
couch_db:create/2
-->couch_server:create/2
. This opens up more code paths for invoking create without the appropriate options. For the most part, it seems this function is only used in test modules, so the following grep shows cases that don't involve the word "test":The first case is for creating an internal users database, and the second case is part of the inline eunit tests of that module, so I don't think either of these scenarios or this code path are problematic at the moment.
This is from
dreyfus_rpc:get_or_create_database
[19], which is only invoked by{ok, Db} = get_or_create_db(DbName, []),
in [20], and you can clearly see that it's supplying an empty list for the db creation options, which is not appropriate.On a related note, it's not in ASF CouchDB, but the Hastings [21] library uses a similar structure to Dreyfus and is susceptible to the same
get_or_create_database
code [22]. I've included it here as a reminder it needs to be fixed too.And for our final code path we have:
which is in the
fabric_rpc:create_db/2
function that is only used by way offabric:create_db
andfabric_db_create:go
, and this is the main code path for creating databases (partitioned or not), and I believe this is working as expected.How to find problematic shards
I've created a little remsh snippet that can be invoked on any node in the cluster and it will audit all dbs and shard replicas for issues, returning the problematic shards. I think it would be worthwhile to add this to
mem3_util
or some such.Problematic shard creation code paths to fix
This leaves us with the following bugs to fix:
couchdb/src/mem3/src/mem3_shards.erl
Line 361 in 8ac1978
couchdb/src/fabric/src/fabric_util.erl
Line 108 in 8ac1978
couchdb/src/couch/src/couch_httpd_db.erl
Line 235 in 8ac1978
couchdb/src/dreyfus/src/dreyfus_rpc.erl
Line 108 in 8ac1978
[1] #2690
[2]
couchdb/src/mem3/src/mem3_shards.erl
Line 361 in 8ac1978
[3]
couchdb/src/mem3/src/mem3_shards.erl
Lines 411 to 423 in 8ac1978
[4]
couchdb/src/mem3/src/mem3_shards.erl
Line 416 in 8ac1978
[5]
couchdb/src/mem3/src/mem3_util.erl
Line 268 in 8ac1978
[6]
couchdb/src/mem3/src/mem3_util.erl
Lines 265 to 274 in 8ac1978
[7]
couchdb/src/mem3/src/mem3_util.erl
Lines 511 to 529 in 8ac1978
[8]
couchdb/src/mem3/src/mem3_util.erl
Line 519 in 8ac1978
[9]
couchdb/src/couch/src/couch_server.erl
Line 108 in 8ac1978
[10]
couchdb/src/fabric/src/fabric_util.erl
Line 108 in 8ac1978
[11]
couchdb/src/fabric/src/fabric_util.erl
Lines 96 to 132 in 8ac1978
[12]
couchdb/src/fabric/src/fabric_util.erl
Lines 96 to 97 in 8ac1978
[13]
couchdb/src/fabric/src/fabric.erl
Line 152 in 8ac1978
[14]
couchdb/src/fabric/src/fabric.erl
Lines 173 to 175 in 8ac1978
[15]
couchdb/src/fabric/src/fabric.erl
Lines 177 to 184 in 8ac1978
[16]
couchdb/src/couch/src/couch_httpd_db.erl
Line 235 in 8ac1978
[17]
couchdb/src/couch/src/couch_server.erl
Line 115 in 8ac1978
[18]
couchdb/src/couch/src/couch_db.erl
Lines 150 to 151 in 8ac1978
[19]
couchdb/src/dreyfus/src/dreyfus_rpc.erl
Lines 104 to 111 in 8ac1978
[20]
couchdb/src/dreyfus/src/dreyfus_rpc.erl
Line 43 in 8ac1978
[21] https://github.com/cloudant-labs/hastings
[22] https://github.com/cloudant-labs/hastings/blob/master/src/hastings_rpc.erl#L103-L111
[23]
couchdb/src/fabric/src/fabric_rpc.erl
Lines 164 to 173 in 8ac1978
The text was updated successfully, but these errors were encountered: