New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect index selection when sort specified #866

Merged
merged 2 commits into from Oct 12, 2017

Conversation

Projects
None yet
4 participants
@willholley
Member

willholley commented Oct 5, 2017

Overview

Mango has an apparently long-standing bug whereby if an index is deemed valid for sorting a query but is not valid according to mango_idx:is_usable, the query planner would fall back to _all_docs silently, therefore ignoring the sort order specified in the query.

This reverses the index selection logic so that the list of usable indexes is generated prior to filtering them based on the sort order specified in the query.

Similar logic is applied to use_index, allowing us to generate a more specific error message when a user specifies an index which isn't valid for the current
selector.

This fix exposes that the change introduced in #816 may cause existing queries with sort fields to fail (indeed, several queries in our test suite needed fixing), so we need to consider whether this is a severe enough breaking change to warrant a major version bump.

Testing recommendations

Run the test suite. Test that you can run Mango queries with sort criteria.

Related Pull Requests

#816

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;
@garrensmith

I've tested and it looks good to me

Show outdated Hide outdated src/mango/src/mango_idx.erl
@@ -222,6 +222,20 @@ def test_sort(self):
docs2 = list(reversed(sorted(docs1, key=lambda d: d["age"])))
assert docs1 is not docs2 and docs1 == docs2
def test_sort_desc_complex(self):

This comment has been minimized.

@garrensmith

garrensmith Oct 5, 2017

Member

Could you add a test without the manager: $exists: true and make sure it returns an error

@garrensmith

garrensmith Oct 5, 2017

Member

Could you add a test without the manager: $exists: true and make sure it returns an error

@garrensmith

LGTM

@tonysun83

This comment has been minimized.

Show comment
Hide comment
@tonysun83

tonysun83 Oct 5, 2017

Contributor

This is a tricky spot. I think the situation where the user would complain about backwards compatibility is as follows:

  1. User had a sort specified on one field a
  2. User had an index created for fields : [a, b]
  3. User Searched on a and incorrectly got back documents which only included docs with both a and b.
  4. User gets a list of sorted docs returned, but missing some documents. User doesn't complain because the docs look sorted.

In the scenario where we ignore sort order and used _all_docs, I would assume the user would have already complained that their documents are not sorted.

Perhaps we should improve the error messages returned?

Contributor

tonysun83 commented Oct 5, 2017

This is a tricky spot. I think the situation where the user would complain about backwards compatibility is as follows:

  1. User had a sort specified on one field a
  2. User had an index created for fields : [a, b]
  3. User Searched on a and incorrectly got back documents which only included docs with both a and b.
  4. User gets a list of sorted docs returned, but missing some documents. User doesn't complain because the docs look sorted.

In the scenario where we ignore sort order and used _all_docs, I would assume the user would have already complained that their documents are not sorted.

Perhaps we should improve the error messages returned?

@willholley

This comment has been minimized.

Show comment
Hide comment
@willholley

willholley Oct 6, 2017

Member

@tonysun83 this PR should improve the error message in the case where a user specifies an index with use_index that is no longer deemed valid. We should have a corresponding documentation PR which explains the error messages and how to address them.

Member

willholley commented Oct 6, 2017

@tonysun83 this PR should improve the error message in the case where a user specifies an index with use_index that is no longer deemed valid. We should have a corresponding documentation PR which explains the error messages and how to address them.

@willholley

This comment has been minimized.

Show comment
Hide comment
@willholley

willholley Oct 6, 2017

Member

@tonysun83 If a sort is specified, perhaps it's reasonable to assume that the user expects the field to exist in all documents that are returned. To preserve existing behaviour, we could implicitly add an {"$exists": true} predicate for each sort field to the selector.

This would allow us to safely validate the selector/index combination against is_usable whilst preserving existing behaviour as much as possible.

Member

willholley commented Oct 6, 2017

@tonysun83 If a sort is specified, perhaps it's reasonable to assume that the user expects the field to exist in all documents that are returned. To preserve existing behaviour, we could implicitly add an {"$exists": true} predicate for each sort field to the selector.

This would allow us to safely validate the selector/index combination against is_usable whilst preserving existing behaviour as much as possible.

@tonysun83

This comment has been minimized.

Show comment
Hide comment
@tonysun83

tonysun83 Oct 6, 2017

Contributor

@willholley : I considered appending {"$exists":true} for backwards compatibility. We've done it in the past when we appended {"$gt": null} to do the _all_docs trick for users. If we do this though, users would not see an error message, but they could still be getting incorrect results back as we're forcing the predicate fields to exist. I.e

Index: [a, b, c]
Query: {"c": "foo"}
Sort:[a, b]
New modified becomes: {"c": "foo", {"a":{"$exists": true}}, {"b":{"$exists": true}}}

It's a tradeoff: let their apps continue to run with possible incorrect data or let it break and force them to possibly re-index and modify their app. I'd like more opinions @garrensmith @davisp ?

Contributor

tonysun83 commented Oct 6, 2017

@willholley : I considered appending {"$exists":true} for backwards compatibility. We've done it in the past when we appended {"$gt": null} to do the _all_docs trick for users. If we do this though, users would not see an error message, but they could still be getting incorrect results back as we're forcing the predicate fields to exist. I.e

Index: [a, b, c]
Query: {"c": "foo"}
Sort:[a, b]
New modified becomes: {"c": "foo", {"a":{"$exists": true}}, {"b":{"$exists": true}}}

It's a tradeoff: let their apps continue to run with possible incorrect data or let it break and force them to possibly re-index and modify their app. I'd like more opinions @garrensmith @davisp ?

@garrensmith

This comment has been minimized.

Show comment
Hide comment
@garrensmith

garrensmith Oct 10, 2017

Member

Adding $exists: true is a nice idea. +1

Member

garrensmith commented Oct 10, 2017

Adding $exists: true is a nice idea. +1

@davisp

Overall the change looks good. I can't figure out norm_deduplicate though. I assume that's to try and deduplicate the $exists check which a) I'm not sure is a requirement, and b) seems to be unused near as I can tell.

Also there's a bug in has_use_index I noted below.

Show outdated Hide outdated src/mango/src/mango_idx.erl
Show outdated Hide outdated src/mango/src/mango_selector.erl
Show outdated Hide outdated src/mango/src/mango_selector.erl
@davisp

This comment has been minimized.

Show comment
Hide comment
@davisp

davisp Oct 10, 2017

Member

Couple things to unpack in the discussion:

First, for #816 I think that's actually the opposite, its not whether it needs a new version bump but how many versions should be be back porting this fix to. A database that gives bad answers to queries is more a bug that needs to be patched backwards, not something we cut the cord on and just bump version numbers.

There is of course the issue around notifying users that a pending change may break their application (technically fix, but users have a different perspective :), so this is definitely something we should talk about. We'll also want to update the documentation as we've long said that using column prefixes was a suggestion for sharing indexes so that'll need to be fixed.

For @tonysun83's example of this changing a query from something that uses _all_docs to something that uses an existing index I don't think that's an issue. Originally I would have ixnayed it because of the lack of range restriction on A and B but we've already relaxed that by having an _all_docs default. Also the behavior here is different than the issue introduced by #816 in that previously Tony's example would have been an incorrect sort returned where the new version changes that (ie, same set of docs would be returned both places, just now its sorted). Which I think is quite a bit different than subtley different sets of documents returned. Also, if a user specifies a sort and isn't detecting the wrong sort order based on the _all_docs fall back then I'm failing to see how they'd be relying on that sort, and thus fixing it seems like not an issue.

For #816 with the _all_docs fallback is that also not more of a concern about performance rather than an app that stops working? Ie, all queries will continue to work and be "fixed" but some unknown number of queries will suddenly switch to using _all_docs and become much slower? Granted suddenly slower performance that lasts until an index can be built isn't nothing either.

Member

davisp commented Oct 10, 2017

Couple things to unpack in the discussion:

First, for #816 I think that's actually the opposite, its not whether it needs a new version bump but how many versions should be be back porting this fix to. A database that gives bad answers to queries is more a bug that needs to be patched backwards, not something we cut the cord on and just bump version numbers.

There is of course the issue around notifying users that a pending change may break their application (technically fix, but users have a different perspective :), so this is definitely something we should talk about. We'll also want to update the documentation as we've long said that using column prefixes was a suggestion for sharing indexes so that'll need to be fixed.

For @tonysun83's example of this changing a query from something that uses _all_docs to something that uses an existing index I don't think that's an issue. Originally I would have ixnayed it because of the lack of range restriction on A and B but we've already relaxed that by having an _all_docs default. Also the behavior here is different than the issue introduced by #816 in that previously Tony's example would have been an incorrect sort returned where the new version changes that (ie, same set of docs would be returned both places, just now its sorted). Which I think is quite a bit different than subtley different sets of documents returned. Also, if a user specifies a sort and isn't detecting the wrong sort order based on the _all_docs fall back then I'm failing to see how they'd be relying on that sort, and thus fixing it seems like not an issue.

For #816 with the _all_docs fallback is that also not more of a concern about performance rather than an app that stops working? Ie, all queries will continue to work and be "fixed" but some unknown number of queries will suddenly switch to using _all_docs and become much slower? Granted suddenly slower performance that lasts until an index can be built isn't nothing either.

@willholley

This comment has been minimized.

Show comment
Hide comment
@willholley

willholley Oct 12, 2017

Member

I've pushed a new fix which doesn't require rewriting the selector. Instead, we just include sort fields when considering whether a JSON index provides coverage for the selector. This seems a lot simpler and, unlike the previous implementation, doesn't impact the validity of other index types (i.e. text indexes).

Member

willholley commented Oct 12, 2017

I've pushed a new fix which doesn't require rewriting the selector. Instead, we just include sort fields when considering whether a JSON index provides coverage for the selector. This seems a lot simpler and, unlike the previous implementation, doesn't impact the validity of other index types (i.e. text indexes).

@davisp

Looks good other than I think there's some dead code left over.

Show outdated Hide outdated src/mango/src/mango_selector.erl

willholley added some commits Oct 5, 2017

Fix incorrect index selection when sort specified
Mango has an apparently long-standing bug whereby if an
index is deemed valid for sorting a query but is not valid
according to mango_idx:is_usable, the query planner would
fall back to _all_docs silently, therefore ignoring the
sort order specified in the query.

This reverses the index selection logic so that the
list of usable indexes is generated prior to filtering
them based on the sort order specified in the query.

Similar logic is applied to use_index, allowing us
to generate a more specific error message when a user
specifies an index which isn't valid for the current
selector.
Make JSON indexes usable if covered by sort fields
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.

@willholley willholley merged commit d0445f5 into apache:master Oct 12, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

wohali added a commit that referenced this pull request Oct 19, 2017

Fix incorrect index selection when sort specified (#866)
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.

iilyak pushed a commit to cloudant/couchdb that referenced this pull request Nov 6, 2017

iilyak pushed a commit to cloudant/couchdb that referenced this pull request Nov 13, 2017

willholley added a commit to willholley/couchdb that referenced this pull request May 22, 2018

Fix incorrect index selection when sort specified (#866)
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment