Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all code paths for creating databases respect db creation options #3631

Closed
5 tasks
chewbranca opened this issue Jun 16, 2021 · 4 comments
Closed
5 tasks
Assignees

Comments

@chewbranca
Copy link
Contributor

When partitioned queries were introduced, new database creation options were added for the ?partitioned=true boolean along with the hash function used, and these options are stored in the dbs db doc for the relevant database. The underlying bug here is that there are a number of different code paths that result in creation of database shards, and not all of them create the shards with the appropriate database options. We fixed a few of these issues in [1], but I've stumbled upon some more scenarios where we encounter these failures. These currently manifest on partitioned databases having shards created after initial database creation and are incorrectly created without the partitioned flag.

If individual shards of a partitioned database are incorrectly created as non-partitioned, for the most part things just "work", making this an issue that hides in plain site. One of the things that does not work, is that design documents with partitioned query views have a partitioned boolean metadata value, and there's additional validation logic to prevent partitioned ddocs from being written to unpartitioned database shards. So in the event you get a shard replica incorrectly created as unpartitioned, and you have a ddoc that maps to that shard, the ddoc will be unable to write to that shard replica.

In the event two of the three replicas are created correctly as partitioned, and only one of the shards is incorrectly created as unpartitioned, and there's a partitioned=true ddoc on that shard range, then the ddoc will fail to write to the unpartitioned shard, which will then trigger read_repair anytime the ddoc is accessed through the quorum system, however, the read_repair logic doesn't not enforce W=N write semantics, so having two out of three shard replicas properly created as partitioned will result in false positive read_repair success, never triggering the failure case where we log the issue. As a result, there will be a false positive successful read_repair operation every time the ddoc is accessed.


I've tried to be a bit more meticulous in terms of auditing this issue and trying to fix it once and for all. I think there are only two ways of creating database shards: 1) Calling couch_server:open with the option create_if_missing=true, and 2) by directly calling couch_server:create. If anyone can think of any code paths I'm skipping, let me know.

Here's the scenarios where we set create_if_missing:

(! 22221)-> grep -r create_if_missing src/
src//mem3/src/mem3_shards.erl:                [create_if_missing(mem3:name(S), mem3:engine(S)) || S
src//mem3/src/mem3_shards.erl:create_if_missing(Name, Options) ->
src//mem3/src/mem3_util.erl:    Options = [nologifmissing, sys_db, {create_if_missing, true}, ?ADMIN_CTX],
src//mem3/src/mem3_util.erl:                Options1 = [{create_if_missing, true} | Options],
src//couch/test/eunit/couch_db_tests.erl:        {ok, Db} = couch_db:open(DbName, [{create_if_missing, true}]),
src//couch/test/eunit/couch_db_tests.erl:        {ok, Db} = couch_db:open(DbName, [{create_if_missing, true}]),
src//couch/test/eunit/couch_db_tests.erl:        {ok, Db1} = couch_db:open(DbName, [{create_if_missing, true}]),
src//couch/src/couch_server.erl:        Create = couch_util:get_value(create_if_missing, Options, false),
src//fabric/src/fabric_util.erl:    get_shard(Live, [{create_if_missing, true} | Options], 100, Factor).

And here's the scenarios where we call couch_server:create directly:

(! 22220)-> grep -r couch_server:create src/
src//mem3/test/eunit/mem3_seeds_test.erl:    {ok, _} = couch_server:create(<<"_users">>, []),
src//mem3/src/mem3_shards.erl:            case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of
src//couch/test/eunit/couch_server_tests.erl:    Resp = couch_server:create(?tempdb(), [{engine, <<"cowabunga!">>}]),
src//couch/src/couch_httpd_db.erl:    case couch_server:create(DbName, [{user_ctx, UserCtx}] ++ Engine) of
src//couch/src/couch_server.erl:            couch_server:create(DbName, Options);
src//couch/src/couch_db.erl:    couch_server:create(DbName, Options).
src//dreyfus/src/dreyfus_rpc.erl:        couch_server:create(DbName, Options);
src//fabric/src/fabric_rpc.erl:    rexi:reply(case couch_server:create(DbName, Options) of

Let's audit these to see which are problematic and need to be fixed, starting with the create_if_missing invocations:

src//mem3/src/mem3_shards.erl: [create_if_missing(mem3:name(S), mem3:engine(S)) || S
src//mem3/src/mem3_shards.erl:create_if_missing(Name, Options) ->

These two calls [2][3] are a function in mem3_shards with the same name as the option we're interested in, which is why they showed up with grep, but basically the call in [2] to the function defined in [3] ends up calling couch_server:create so we've got an overlap between the two greps here, and we'll cover the case of [4]:

src//mem3/src/mem3_shards.erl: case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of

here too. You'll notice that when the function in [4] calls couch_server:create it passes in the Options passed in from [2] which is just mem:engine(S) supplying the engine options. You'll notice nothing here checks to see if the database should be created, and therefore this is one of the problematic shard creation code paths that'll we need to fix.


Next up is:

src//mem3/src/mem3_util.erl: Options = [nologifmissing, sys_db, {create_if_missing, true}, ?ADMIN_CTX]

from [5]. This line of code is from mem3_util:ensure_exists [6] which is only used to the creation of sys_dbs, so I think we're ok to leave this as is, as we don't make partitioned system dbs.


src//mem3/src/mem3_util.erl: Options1 = [{create_if_missing, true} | Options],

This is from mem3_util:get_or_create_db from [7], and this is the improved db creation function we updated in [1], and this properly sets the db properties by way of [8], so we're good to go here.


The following three cases are for isolated tests:

src//couch/test/eunit/couch_db_tests.erl: {ok, Db} = couch_db:open(DbName, [{create_if_missing, true}]),
src//couch/test/eunit/couch_db_tests.erl: {ok, Db} = couch_db:open(DbName, [{create_if_missing, true}]),
src//couch/test/eunit/couch_db_tests.erl: {ok, Db1} = couch_db:open(DbName, [{create_if_missing, true}]),


src//couch/src/couch_server.erl: Create = couch_util:get_value(create_if_missing, Options, false),

This is the core mechanism couch_server uses to determine if it should create the database when it's missing [9].


src//fabric/src/fabric_util.erl: get_shard(Live, [{create_if_missing, true} | Options], 100, Factor).

This is another problematic invocation in [10], which is called as part of fabric_util:get_db [11]. You can clearly see in the fabric_util:get_db/1 (arity one case) that we're supplying an empty list of options which will never have the partition options [12]. We never actually use the one arity function head, as you can see from the following grep:

(! 22224)-> grep -r fabric_util:get_db src/
src//fabric/src/fabric.erl:    {ok, Db} = fabric_util:get_db(dbname(DbName), [?ADMIN_CTX]),
src//fabric/src/fabric.erl:    {ok, Db} = fabric_util:get_db(dbname(DbName), [?ADMIN_CTX]),
src//fabric/src/fabric.erl:    {ok, Db} = fabric_util:get_db(dbname(DbName), opts(Options)),

Those three invocations come from fabric:get_revs_limit/1 [13], fabric:get_purge_infos_limit/1 [14], and the two arity version of fabric:get_security/2 [15], although do note that the one arity fabric:get_security/1 just calls the two arity with an empty options list. You can see in the following grep that we never actually supply database creation options to fabric:get_security and only use the options for setting a user_ctx, if at all:

(! 22227)-> grep -r fabric:get_security src/
src//mem3/test/eunit/mem3_reshard_test.erl:    with_proc(fun() -> fabric:get_security(DbName, [?ADMIN_CTX]) end).
src//chttpd/src/chttpd_auth_request.erl:    {_} = fabric:get_security(DbName, [{user_ctx, Ctx}]),
src//chttpd/src/chttpd_auth_request.erl:    Sec = fabric:get_security(DbName, [{user_ctx, Ctx}]),
src//chttpd/src/chttpd_external.erl:            fabric:get_security(Db);
src//chttpd/src/chttpd_db.erl:    fabric:get_security(DbName, [{user_ctx, Ctx}]),
src//chttpd/src/chttpd_db.erl:    send_json(Req, fabric:get_security(Db));
src//fabric/src/fabric_util.erl:    {SecProps} = fabric:get_security(DbName), % as admin

As such all three of these cases are broken, but they all use the same db creation invocation in [10], so we can fix the problem there.


Next up are the direct invocations of couch_server:create:

src//mem3/test/eunit/mem3_seeds_test.erl: {ok, _} = couch_server:create(<<"_users">>, []),

This is an isolated test.

src//mem3/src/mem3_shards.erl: case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of

This was covered above.

src//couch/test/eunit/couch_server_tests.erl: Resp = couch_server:create(?tempdb(), [{engine, <<"cowabunga!">>}]),

This is for an isolated test.


src//couch/src/couch_httpd_db.erl: case couch_server:create(DbName, [{user_ctx, UserCtx}] ++ Engine) of

This interesting case comes from [16], and because it doesn't include db creation options, it means that you can't actually create partitioned dbs against the 5986 endpoint.


src//couch/src/couch_server.erl: couch_server:create(DbName, Options);

This is the core couch_server logic for creating databases that have had create_if_missing passed through [17].


src//couch/src/couch_db.erl: couch_server:create(DbName, Options).

This is an alias function couch_db:create/2 --> couch_server:create/2 . This opens up more code paths for invoking create without the appropriate options. For the most part, it seems this function is only used in test modules, so the following grep shows cases that don't involve the word "test":

(! 22233)-> grep -r couch_db:create src/ | grep -v test
src//couch/src/couch_auth_cache.erl:        {ok, Db} = couch_db:create(users_db(), Options),
src//couch_replicator/src/couch_replicator_docs.erl:    {ok, Db} = couch_db:create(DbName, [?ADMIN_CTX]),

The first case is for creating an internal users database, and the second case is part of the inline eunit tests of that module, so I don't think either of these scenarios or this code path are problematic at the moment.


src//dreyfus/src/dreyfus_rpc.erl: couch_server:create(DbName, Options);

This is from dreyfus_rpc:get_or_create_database [19], which is only invoked by {ok, Db} = get_or_create_db(DbName, []), in [20], and you can clearly see that it's supplying an empty list for the db creation options, which is not appropriate.

On a related note, it's not in ASF CouchDB, but the Hastings [21] library uses a similar structure to Dreyfus and is susceptible to the same get_or_create_database code [22]. I've included it here as a reminder it needs to be fixed too.


And for our final code path we have:

src//fabric/src/fabric_rpc.erl: rexi:reply(case couch_server:create(DbName, Options) of

which is in the fabric_rpc:create_db/2 function that is only used by way of fabric:create_db and fabric_db_create:go, and this is the main code path for creating databases (partitioned or not), and I believe this is working as expected.


How to find problematic shards

I've created a little remsh snippet that can be invoked on any node in the cluster and it will audit all dbs and shard replicas for issues, returning the problematic shards. I think it would be worthwhile to add this to mem3_util or some such.

rr(mem3_shards), f(FoldPartitionedShards), FoldPartitionedShards = fun() ->                                                                                                              Dbnames = lists:filter(fun fabric_util:is_partitioned/1, element(2, fabric:all_dbs())),                                                                                          
    Shards = lists:flatten([mem3:shards(Dbname) || Dbname <- Dbnames]),                                                                                                              
    lists:foldl(fun(Shard, Acc) ->                                                                                                                                                   
        case rpc:call(Shard#shard.node, erlang, apply, [fun() ->                                                                                                                     
                    {ok, Db} = couch_db:open_int(Shard#shard.name, []),                                                                                                              
                    couch_db:is_partitioned(Db)                                                                                                                                      
                end, []]) of                                                                                                                                                         
            true ->                                                                                                                                                                  
                Acc;                                                                                                                                                                 
            false ->                                                                                                                                                                 
                [{Shard#shard.node, Shard#shard.name} | Acc]                                                                                                                         
        end                                                                                                                                                                          
    end, [], Shards)                                                                                                                                                                 
end.                                                                                                                                                                                 

Problematic shard creation code paths to fix

This leaves us with the following bugs to fix:

[1] #2690
[2]

[create_if_missing(mem3:name(S), mem3:engine(S)) || S

[3]
create_if_missing(Name, Options) ->
case couch_server:exists(Name) of
true ->
ok;
false ->
case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of
{ok, Db} ->
couch_db:close(Db);
Error ->
couch_log:error("~p tried to create ~s, got ~p",
[?MODULE, Name, Error])
end
end.

[4]
case couch_server:create(Name, [?ADMIN_CTX] ++ Options) of

[5]
Options = [nologifmissing, sys_db, {create_if_missing, true}, ?ADMIN_CTX],

[6]
ensure_exists(DbName) when is_list(DbName) ->
ensure_exists(list_to_binary(DbName));
ensure_exists(DbName) ->
Options = [nologifmissing, sys_db, {create_if_missing, true}, ?ADMIN_CTX],
case couch_db:open(DbName, Options) of
{ok, Db} ->
{ok, Db};
file_exists ->
couch_db:open(DbName, [sys_db, ?ADMIN_CTX])
end.

[7]
get_or_create_db(DbName, Options) ->
case couch_db:open_int(DbName, Options) of
{ok, _} = OkDb ->
OkDb;
{not_found, no_db_file} ->
try
DbOpts = case mem3:dbname(DbName) of
DbName -> [];
MDbName -> mem3_shards:opts_for_db(MDbName)
end,
Options1 = [{create_if_missing, true} | Options],
Options2 = merge_opts(DbOpts, Options1),
couch_db:open_int(DbName, Options2)
catch error:database_does_not_exist ->
throw({error, missing_target})
end;
Else ->
Else
end.

[8]
MDbName -> mem3_shards:opts_for_db(MDbName)

[9]
Create = couch_util:get_value(create_if_missing, Options, false),

[10]
get_shard(Live, [{create_if_missing, true} | Options], 100, Factor).

[11]
get_db(DbName) ->
get_db(DbName, []).
get_db(DbName, Options) ->
{Local, SameZone, DifferentZone} = mem3:group_by_proximity(mem3:shards(DbName)),
% Prefer shards on the same node over other nodes, prefer shards in the same zone over
% over zones and sort each remote list by name so that we don't repeatedly try the same node.
Shards = Local ++ lists:keysort(#shard.name, SameZone) ++ lists:keysort(#shard.name, DifferentZone),
% suppress shards from down nodes
Nodes = [node()|erlang:nodes()],
Live = [S || #shard{node = N} = S <- Shards, lists:member(N, Nodes)],
Factor = list_to_integer(config:get("fabric", "shard_timeout_factor", "2")),
get_shard(Live, [{create_if_missing, true} | Options], 100, Factor).
get_shard([], _Opts, _Timeout, _Factor) ->
erlang:error({internal_server_error, "No DB shards could be opened."});
get_shard([#shard{node = Node, name = Name} | Rest], Opts, Timeout, Factor) ->
Mon = rexi_monitor:start([rexi_utils:server_pid(Node)]),
MFA = {fabric_rpc, open_shard, [Name, [{timeout, Timeout} | Opts]]},
Ref = rexi:cast(Node, self(), MFA, [sync]),
try
receive {Ref, {ok, Db}} ->
{ok, Db};
{Ref, {'rexi_EXIT', {{unauthorized, _} = Error, _}}} ->
throw(Error);
{Ref, {'rexi_EXIT', {{forbidden, _} = Error, _}}} ->
throw(Error);
{Ref, Reason} ->
couch_log:debug("Failed to open shard ~p because: ~p", [Name, Reason]),
get_shard(Rest, Opts, Timeout, Factor)
after Timeout ->
couch_log:debug("Failed to open shard ~p after: ~p", [Name, Timeout]),
get_shard(Rest, Opts, Factor * Timeout, Factor)
end
after
rexi_monitor:stop(Mon)
end.

[12]
get_db(DbName) ->
get_db(DbName, []).

[13]
{ok, Db} = fabric_util:get_db(dbname(DbName), [?ADMIN_CTX]),

[14]
get_purge_infos_limit(DbName) ->
{ok, Db} = fabric_util:get_db(dbname(DbName), [?ADMIN_CTX]),
try couch_db:get_purge_infos_limit(Db) after catch couch_db:close(Db) end.

[15]
get_security(DbName) ->
get_security(DbName, [?ADMIN_CTX]).
%% @doc retrieve the security object for a database
-spec get_security(dbname()) -> json_obj() | no_return().
get_security(DbName, Options) ->
{ok, Db} = fabric_util:get_db(dbname(DbName), opts(Options)),
try couch_db:get_security(Db) after catch couch_db:close(Db) end.

[16]
case couch_server:create(DbName, [{user_ctx, UserCtx}] ++ Engine) of

[17]
couch_server:create(DbName, Options);

[18]
create(DbName, Options) ->
couch_server:create(DbName, Options).

[19]
get_or_create_db(DbName, Options) ->
case couch_db:open_int(DbName, Options) of
{not_found, no_db_file} ->
couch_log:warning("~p creating ~s", [?MODULE, DbName]),
couch_server:create(DbName, Options);
Else ->
Else
end.

[20]
{ok, Db} = get_or_create_db(DbName, []),

[21] https://github.com/cloudant-labs/hastings
[22] https://github.com/cloudant-labs/hastings/blob/master/src/hastings_rpc.erl#L103-L111
[23]
create_db(DbName) ->
create_db(DbName, []).
create_db(DbName, Options) ->
rexi:reply(case couch_server:create(DbName, Options) of
{ok, _} ->
ok;
Error ->
Error
end).

@chewbranca
Copy link
Contributor Author

Alright, I've opened up a CouchDB PR that fixes the mem3_shards, fabric_util, and dreyfus_rpc issues in #3633.

I've got a related PR for the Hastings issue in: cloudant-labs/hastings#31.

I've left the 5986 creation of PQ shards issue as is, feedback welcome on what we should do there (if anything).

@janl
Copy link
Member

janl commented Jun 21, 2021

@chewbranca exceptional sleuthing, thanks for putting all this together.

@chewbranca
Copy link
Contributor Author

I managed to come up with a sneaky test case to induce one of these failures and confirmed it fails in CI: #3633 (comment)

@chewbranca
Copy link
Contributor Author

Alright, I've merged #3633, closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants