Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect index selection when sort specified #866

Merged
merged 2 commits into from
Oct 12, 2017

Conversation

willholley
Copy link
Member

@willholley willholley commented Oct 5, 2017

Overview

Mango has an apparently long-standing bug whereby if an index is deemed valid for sorting a query but is not valid according to mango_idx:is_usable, the query planner would fall back to _all_docs silently, therefore ignoring the sort order specified in the query.

This reverses the index selection logic so that the list of usable indexes is generated prior to filtering them based on the sort order specified in the query.

Similar logic is applied to use_index, allowing us to generate a more specific error message when a user specifies an index which isn't valid for the current
selector.

This fix exposes that the change introduced in #816 may cause existing queries with sort fields to fail (indeed, several queries in our test suite needed fixing), so we need to consider whether this is a severe enough breaking change to warrant a major version bump.

Testing recommendations

Run the test suite. Test that you can run Mango queries with sort criteria.

Related Pull Requests

#816

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested and it looks good to me

% If a sort was specified we have to find an index that
% can satisfy the request.
has_use_index(Opts) ->
erlang:display(<<"sort opts">>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging

@@ -222,6 +222,20 @@ def test_sort(self):
docs2 = list(reversed(sorted(docs1, key=lambda d: d["age"])))
assert docs1 is not docs2 and docs1 == docs2

def test_sort_desc_complex(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test without the manager: $exists: true and make sure it returns an error

@willholley willholley force-pushed the mango_fix_sort_index_selection branch from 891ab6d to 5cf449a Compare October 5, 2017 14:02
Copy link
Member

@garrensmith garrensmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


UsableIndexes1 = case has_use_index(Opts) of
true ->
UsableFilteredIndexes = mango_cursor:maybe_filter_indexes_by_ddoc(UsableIndexes0, Opts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be written as:

case mango_cursor:maybe_filter_indexes_by_ddoc(UsableIndexes0, Opts) of
    [] -> ?MANGO_ERROR({no_usable_index, no_usable_index_matching_name});
    UsableFilteredIndexes -> UsableFilteredIndexes
end;

% can satisfy the request.
case has_sort(Opts) of
true ->
SortIndexes = mango_idx:for_sort(UsableIndexes1, Opts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar stylist change like above:

case mango_idx:for_sort(UsableIndexes1, Opts) of
    [] -> ?MANGO_ERROR({no_usable_index, missing_sort_index}); 
    SortIndexes -> SortIndexes
end;

@tonysun83
Copy link
Contributor

tonysun83 commented Oct 5, 2017

This is a tricky spot. I think the situation where the user would complain about backwards compatibility is as follows:

  1. User had a sort specified on one field a
  2. User had an index created for fields : [a, b]
  3. User Searched on a and incorrectly got back documents which only included docs with both a and b.
  4. User gets a list of sorted docs returned, but missing some documents. User doesn't complain because the docs look sorted.

In the scenario where we ignore sort order and used _all_docs, I would assume the user would have already complained that their documents are not sorted.

Perhaps we should improve the error messages returned?

@willholley
Copy link
Member Author

@tonysun83 this PR should improve the error message in the case where a user specifies an index with use_index that is no longer deemed valid. We should have a corresponding documentation PR which explains the error messages and how to address them.

@willholley
Copy link
Member Author

@tonysun83 If a sort is specified, perhaps it's reasonable to assume that the user expects the field to exist in all documents that are returned. To preserve existing behaviour, we could implicitly add an {"$exists": true} predicate for each sort field to the selector.

This would allow us to safely validate the selector/index combination against is_usable whilst preserving existing behaviour as much as possible.

@tonysun83
Copy link
Contributor

tonysun83 commented Oct 6, 2017

@willholley : I considered appending {"$exists":true} for backwards compatibility. We've done it in the past when we appended {"$gt": null} to do the _all_docs trick for users. If we do this though, users would not see an error message, but they could still be getting incorrect results back as we're forcing the predicate fields to exist. I.e

Index: [a, b, c]
Query: {"c": "foo"}
Sort:[a, b]
New modified becomes: {"c": "foo", {"a":{"$exists": true}}, {"b":{"$exists": true}}}

It's a tradeoff: let their apps continue to run with possible incorrect data or let it break and force them to possibly re-index and modify their app. I'd like more opinions @garrensmith @davisp ?

@willholley willholley force-pushed the mango_fix_sort_index_selection branch from 44c78a9 to 0146ced Compare October 9, 2017 12:13
@garrensmith
Copy link
Member

Adding $exists: true is a nice idea. +1

@willholley willholley force-pushed the mango_fix_sort_index_selection branch from 0146ced to e411886 Compare October 10, 2017 14:01
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good. I can't figure out norm_deduplicate though. I assume that's to try and deduplicate the $exists check which a) I'm not sure is a requirement, and b) seems to be unused near as I can tell.

Also there's a bug in has_use_index I noted below.

% can satisfy the request.
has_use_index(Opts) ->
case lists:keyfind(sort, 1, Opts) of
{use_index, _} ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will never return true as sort is not equal to use_index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which means we're likely missing a test for this since travis returned green.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test but this situation is handled elsewhere. I'll remove has_use_index and its usage.

@@ -398,6 +400,14 @@ negate({[{Field, Cond}]}) ->
{[{Field, negate(Cond)}]}.


norm_deduplicate({[{Op, [H|T]}]}) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where this is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a step in the selector normalization - (line 40 above). The intention is to deduplicate predicates (particularly "$exists": true) but this isn't necessary for correctness - it just makes the normalized selector cleaner. Thinking about it again, I'll remove it here and will consider improving selector normalization in a different PR.

@@ -398,6 +400,14 @@ negate({[{Field, Cond}]}) ->
{[{Field, negate(Cond)}]}.


norm_deduplicate({[{Op, [H|T]}]}) ->
Arg = [H | [X || X <- norm_deduplicate(T), X /= H]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really hurts my brain to read. And not knowing what's supposed to call it I can't figure out if its correct.

@davisp
Copy link
Member

davisp commented Oct 10, 2017

Couple things to unpack in the discussion:

First, for #816 I think that's actually the opposite, its not whether it needs a new version bump but how many versions should be be back porting this fix to. A database that gives bad answers to queries is more a bug that needs to be patched backwards, not something we cut the cord on and just bump version numbers.

There is of course the issue around notifying users that a pending change may break their application (technically fix, but users have a different perspective :), so this is definitely something we should talk about. We'll also want to update the documentation as we've long said that using column prefixes was a suggestion for sharing indexes so that'll need to be fixed.

For @tonysun83's example of this changing a query from something that uses _all_docs to something that uses an existing index I don't think that's an issue. Originally I would have ixnayed it because of the lack of range restriction on A and B but we've already relaxed that by having an _all_docs default. Also the behavior here is different than the issue introduced by #816 in that previously Tony's example would have been an incorrect sort returned where the new version changes that (ie, same set of docs would be returned both places, just now its sorted). Which I think is quite a bit different than subtley different sets of documents returned. Also, if a user specifies a sort and isn't detecting the wrong sort order based on the _all_docs fall back then I'm failing to see how they'd be relying on that sort, and thus fixing it seems like not an issue.

For #816 with the _all_docs fallback is that also not more of a concern about performance rather than an app that stops working? Ie, all queries will continue to work and be "fixed" but some unknown number of queries will suddenly switch to using _all_docs and become much slower? Granted suddenly slower performance that lasts until an index can be built isn't nothing either.

@willholley willholley force-pushed the mango_fix_sort_index_selection branch 2 times, most recently from b494c44 to a06115d Compare October 12, 2017 10:00
@willholley
Copy link
Member Author

I've pushed a new fix which doesn't require rewriting the selector. Instead, we just include sort fields when considering whether a JSON index provides coverage for the selector. This seems a lot simpler and, unlike the previous implementation, doesn't impact the validity of other index types (i.e. text indexes).

@willholley willholley force-pushed the mango_fix_sort_index_selection branch 2 times, most recently from 595a234 to 2acaf40 Compare October 12, 2017 10:29
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other than I think there's some dead code left over.

@@ -16,7 +16,8 @@
-export([
normalize/1,
match/2,
has_required_fields/2
has_required_fields/2,
append_sort_predicate/2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be left over from the previous commits?

Mango has an apparently long-standing bug whereby if an
index is deemed valid for sorting a query but is not valid
according to mango_idx:is_usable, the query planner would
fall back to _all_docs silently, therefore ignoring the
sort order specified in the query.

This reverses the index selection logic so that the
list of usable indexes is generated prior to filtering
them based on the sort order specified in the query.

Similar logic is applied to use_index, allowing us
to generate a more specific error message when a user
specifies an index which isn't valid for the current
selector.
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In apache#816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
@willholley willholley force-pushed the mango_fix_sort_index_selection branch from 2acaf40 to 235b68a Compare October 12, 2017 15:28
@willholley willholley merged commit d0445f5 into apache:master Oct 12, 2017
wohali pushed a commit that referenced this pull request Oct 19, 2017
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
iilyak added a commit to cloudant/couchdb that referenced this pull request Nov 6, 2017
iilyak added a commit to cloudant/couchdb that referenced this pull request Nov 13, 2017
willholley added a commit to willholley/couchdb that referenced this pull request May 22, 2018
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In apache#816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants