Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(mango): prevent occasional duplication of paginated text results #4782

Merged
merged 3 commits into from
Oct 5, 2023

Conversation

pgj
Copy link
Contributor

@pgj pgj commented Sep 28, 2023

When an interleaved update happens to a text index while it is being queried, the affected documents might appear duplicated in the collected results. That is because search results are paginated where the boundaries might move due to the updates and the document might show up once more on subsequent queries.

Mitigate the situation by tracking and losing duplicated documents while the results are being streamed to the user.

In addition to that, the mango unit test coverage has increased:

  • Test cases: 196 ⟶ 216
  • mango_cursor_text: 0% ⟶ 100%
  • Total: 55% ⟶ 62%

Testing recommendations

make eunit apps=mango
make mango-test MANGO_TEST_OPTS="24-text-paginated-test"

Note that mango-test requires a running Clouseau instance that is hooked up, while eunit does not.

The change would remove duplicated documents from the response, which users will notice but ideally it should not be considered a breaking change but a welcome fix. There shall be no data loss.

Checklist

  • Code is written and works correctly
  • Changes are covered by tests

@pgj pgj force-pushed the fix/mango/duplicated-text-results branch from 7770d75 to 6a6d9c0 Compare September 28, 2023 15:36
@rnewson
Copy link
Member

rnewson commented Sep 29, 2023

The problem with search-backed mango queries is the execute/1 function that slurps up all the matches in a loop before passing them over for processing. We should fix that. If we do, then there will be no need for documents_seen as we can de-dupe as we iterate (and, critically, terminate the loop early once we have accumulated enough hits).

@rnewson
Copy link
Member

rnewson commented Sep 29, 2023

ignore last comment. re-reading this and happy that execute/1, execute/3, handle_hit/3 exit when we've seen Limit number of rows.

@pgj pgj force-pushed the fix/mango/duplicated-text-results branch 5 times, most recently from cc38c8f to c43f45b Compare October 2, 2023 15:27
@pgj pgj marked this pull request as ready for review October 2, 2023 16:56
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding extra tests and types, @pgj!

I did notice however that now dialyzer complains more compared to main:

=> mango (dialyze)
src/mango_crud.erl:103:1: Function explain/3 has no local return
src/mango_cursor.erl:109:1: Function extract_candidate_indexes/1 has no local return
src/mango_cursor.erl:168:9: The call mango_cursor:tag_elems
         (#{'ranking' := _, 'usable' := 'false'},
          NotUsableIndexes :: sets:set(_)) breaks the contract
          (properties(), sets:set(#idx{})) -> [{#idx{}, properties()}]
src/mango_cursor.erl:260:1: Function extract_selector_hints/1 will never be called
src/mango_cursor.erl:304:1: Function explain/1 has no local return
src/mango_cursor_text.erl:79:1: Function explain/1 has no local return
src/mango_cursor_text.erl:86:41: The call mango_cursor_text:get_partition
         (Opts :: any(),
          'null') breaks the contract
          (Options, Partition) -> Partition
             when Partition :: partition(), Options :: cursor_options()
src/mango_cursor_text.erl:93:1: Function execute/3 has no local return
src/mango_cursor_text.erl:106:41: The call mango_cursor_text:get_partition
         (Opts :: any(),
          'nil') breaks the contract
          (Options, Partition) -> Partition
             when Partition :: partition(), Options :: cursor_options()
src/mango_cursor_text.erl:153:1: Function execute/1 will never be called
src/mango_cursor_text.erl:171:1: Function search_docs/1 will never be called
src/mango_cursor_text.erl:187:1: Function handle_hits/2 will never be called
src/mango_cursor_text.erl:195:1: Function handle_hit/3 will never be called
src/mango_cursor_text.erl:242:1: Function apply_user_fun/2 will never be called
src/mango_cursor_text.erl:263:1: Function sort_query/2 will never be called
src/mango_cursor_text.erl:288:2: Invalid type specification for function mango_cursor_text:get_partition/2. The success typing is
          (_, 'nil' | 'null') -> any()
src/mango_cursor_text.erl:300:1: Function get_bookmark/1 will never be called
src/mango_cursor_text.erl:309:1: Function update_bookmark/2 will never be called
src/mango_cursor_text.erl:318:1: Function pack_bookmark/1 will never be called
src/mango_cursor_text.erl:343:1: Function ddocid/1 will never be called
src/mango_cursor_text.erl:352:1: Function update_query_args/1 will never be called
src/mango_cursor_text.erl:363:1: Function get_limit/1 will never be called
src/mango_cursor_text.erl:374:1: Function get_json_docs/2 will never be called
src/mango_idx.erl:56:2: Invalid type specification for function mango_idx:get_usable_indexes/4. The success typing is
          (_, _, maybe_improper_list(), _) ->
             {[any()],
              #{'all_indexes' := sets:set(_),
                'global_indexes' := sets:set(_),
                'partition_indexes' := sets:set(_),
                'usability_map' := [any()],
                'usable_indexes' := sets:set(_)}}

vs

% make dialyze apps=mango
==> mango (build-plt)
WARN:  ''build-plt'' command does not apply to directory /Users/nvatama/asf/rel
WARN:  ''build-plt'' command does not apply to directory /Users/nvatama/asf
==> mango (dialyze)
src/mango_crud.erl:103:1: Function explain/3 has no local return
src/mango_cursor.erl:109:1: Function extract_candidate_indexes/1 has no local return
src/mango_cursor.erl:168:9: The call mango_cursor:tag_elems
         (#{'ranking' := _, 'usable' := 'false'},
          NotUsableIndexes :: sets:set(_)) breaks the contract
          (properties(), sets:set(#idx{})) -> [{#idx{}, properties()}]
src/mango_cursor.erl:260:1: Function extract_selector_hints/1 will never be called
src/mango_cursor.erl:304:1: Function explain/1 has no local return
src/mango_idx.erl:56:2: Invalid type specification for function mango_idx:get_usable_indexes/4. The success typing is
          (_, _, maybe_improper_list(), _) ->
             {[any()],
              #{'all_indexes' := sets:set(_),
                'global_indexes' := sets:set(_),
                'partition_indexes' := sets:set(_),
                'usability_map' := [any()],
                'usable_indexes' := sets:set(_)}}

Maybe some type specs need adjusting and/or it's discovered some genuine bugs. Unless the tweaks to the type specs are minimal to fix to make the dialyzer happy, perhaps the extra types and associated fixes should be in a separate PR.

Also noticed during DocId extraction we were not actually getting a DocId but instead a Doc with a single _id field. See a few extra comments about it inline.

@pgj pgj force-pushed the fix/mango/duplicated-text-results branch from 9ceb84c to ea4343e Compare October 3, 2023 12:17
@pgj
Copy link
Contributor Author

pgj commented Oct 3, 2023

Thanks @nickva for the comment on the type specifications. I forgot to check them with Dialyzer before marking the PR ready for review, although I believe most of them are logically right. Except for maybe the one about the partitions.

The series of "will never be called" and "has no local return" warnings are probably due to the number of no_return() types. I had to add them because of the frequent use of throw() in the control flow, which I do not consider idiomatic. I think I understand the reasons — throw() is a very convenient way to break out from loops, for example — but in Erlang we could do it better.

By taking your advice, I have removed the type specifications from this PR and I will create another one, possibly together with a refactor.

index = Index,
ranges = null,
trace = Trace,
selector = selector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A selector here obviously doesn't matter and can be anything, however in general in tests, if possible it might be better use proper types even when they don't matter and when it's not too hard. So maybe a selector might be {[]}. skip should be perhaps 0 or some plausible integer like 53 etc. Don't have to change in the current PR just in general I think it's nicer to respect types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with this approach because the functions that receive these values will be mocked. Even if we would respect the types this way it does not matter much because the values of those types are not used for their original purpose — for example, the 0 integer becomes a simple token only. On the other hand, non-standard, out-of-type values can warrant that the mocked function was called with that specific instance and not by accident through some other functions.

Another and more important reason for using atoms in place of regular values is that if the code blows up somewhere while the test is running it means that given value was attempted for use while it should not have been. I admit that this might be a bit more nit-picky, but hey, why not to establish such guarantees in the tests or catch such potentially unintended changes?

Comment on lines +61 to +81
class Concurrently(object):
def __init__(self, thread, thread_args, start=True):
self.thread = threading.Thread(target=self.wrapper, args=(thread, thread_args))
self.return_value = None
if start:
self.start()

def wrapper(self, body, args):
self.return_value = body(*args)

def start(self):
self.thread.start()

def get_result(self):
self.thread.join()
return self.return_value

def join(self):
self.thread.join()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice abstraction!

src/mango/test/mango.py Outdated Show resolved Hide resolved
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, Just noticed a few minor style issues in tests.

We should also get @rnewson's opinion before merging. He is more familiar with this area of the codebase.

@pgj pgj force-pushed the fix/mango/duplicated-text-results branch from ea4343e to 445a992 Compare October 3, 2023 21:23
@pgj pgj force-pushed the fix/mango/duplicated-text-results branch from 445a992 to 08190b5 Compare October 3, 2023 21:38
Copy link
Member

@rnewson rnewson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch is neatly done, bu I have a doubt arond bookmarks.

The Hit at line 176 is used to update the bookmark but it might be ignored at line 186 (as its a duplicate).

Prior to this PR the incoming Hit its always right to update the bookmark, even if the selector doesn't match, as we know we don't want search results earlier than that (they'll fail the selector next time).

With this PR I worry that we'll regress the bookmark if a duplication occurs (a doc that appeared on page 1 that also appears on page 2 due to an update while paginating).

@pgj
Copy link
Contributor Author

pgj commented Oct 4, 2023

@rnewson I have checked the use of mango_cursor_text:update_bookmark/2 and my impression is that the bookmark will be kept updated continuously. The duplication check happens after that. (*) It is not put behind any condition so why would it be called less times? On the other hand, it might be called more times than necessary and the invocation could be pushed down as follows:

--- a/src/mango/src/mango_cursor_text.erl
+++ b/src/mango/src/mango_cursor_text.erl
@@ -184,17 +184,17 @@ handle_hit(CAcc0, Sort, Doc) ->
         execution_stats = Stats,
         documents_seen = Seen
     } = CAcc0,
-    CAcc1 = update_bookmark(CAcc0, Sort),
     Stats1 = mango_execution_stats:incr_docs_examined(Stats),
     couch_stats:increment_counter([mango, docs_examined]),
-    CAcc2 = CAcc1#cacc{execution_stats = Stats1},
-    case mango_selector:match(CAcc2#cacc.selector, Doc) of
+    CAcc1 = CAcc0#cacc{execution_stats = Stats1},
+    case mango_selector:match(CAcc1#cacc.selector, Doc) of
         true ->
             DocId = mango_doc:get_field(Doc, <<"_id">>),
             case sets:is_element(DocId, Seen) of
                 true ->
-                    CAcc2;
+                    CAcc1;
                 false ->
+                    CAcc2 = update_bookmark(CAcc1, Sort),
                     CAcc3 = CAcc2#cacc{
                         documents_seen = sets:add_element(DocId, Seen)
                     },
@@ -216,7 +216,7 @@ handle_hit(CAcc0, Sort, Doc) ->
                     end
             end;
         false ->
-            CAcc2
+            CAcc1
     end.

Did you mean something like this?

(*) Except for Limit == 1 but that is how the code originally worked, and it is documented in the comment.

@rnewson
Copy link
Member

rnewson commented Oct 4, 2023

I was unclear. I am questioning whether we should still update the bookmark when we detect a duplicate or not.

with this PR we will. But I think we shouldn't?

@pgj
Copy link
Contributor Author

pgj commented Oct 4, 2023

Yeah, that sounds like a good idea to me. My diff above does that exactly. Thanks for the clarification.

@pgj pgj force-pushed the fix/mango/duplicated-text-results branch 2 times, most recently from 095451b to b4847f3 Compare October 4, 2023 12:37
@nickva
Copy link
Contributor

nickva commented Oct 4, 2023

It does seem wrong to update the bookmark with the duplicate IDs. We're effectively getting index results from a newer database snapshot than the one we opened in the request (the index got ahead of the database, so to speak). So, we wouldn't want to update the bookmark or do anything with the duplicate IDs except just ignore them.

If we reworked these indexes to store document IDs and update sequences, when we read those documents we would check that IndexSeq =< DbSeq and discard anything else. Then we wouldn't need to accumulate a Seen set. That however, is a larger rewrite with possibly backward compatibility implications.

@pgj
Copy link
Contributor Author

pgj commented Oct 4, 2023

Thanks @nickva for the comment, that is an important piece of information.

@rnewson
Copy link
Member

rnewson commented Oct 4, 2023

perhaps it would be prudent to record seq in nouveau as there are no "real" indexes yet, just in case we might find a use for it.

@pgj
Copy link
Contributor Author

pgj commented Oct 4, 2023

@rnewson in the scope of this PR? I am not that experienced with nouveau and I have no idea how to do that.

@rnewson
Copy link
Member

rnewson commented Oct 4, 2023

no, not in this PR. I will take care of that.

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I think this may be the simplest way to eliminate duplicates in the result.

However, we may still get index updates from an index that is now ahead of the db snapshot we opened in the current request. In other words, we're only papering over the symptoms here, so to speak. Changing that would require keeping track of sequences which in the context of Clouseau is trickier since it involves backward compatibility clauses.

When an interleaved update happens to a `text` index while it is
being queried, the affected documents might appear duplicated in
the collected results.  That is because `search` results are
paginated where the boundaries might move due to the updates and
the document might show up once more on subsequent queries.

Mitigate the situation by tracking and losing duplicated documents
while the results are being streamed to the user.
This is a port of a fix applied to the `text` cursor where
occasionally duplicated documents were filtered out.  This happens
when moving between pages and an interleaved update is applied.
@pgj pgj force-pushed the fix/mango/duplicated-text-results branch from b4847f3 to 0d128cf Compare October 5, 2023 09:07
@pgj
Copy link
Contributor Author

pgj commented Oct 5, 2023

@rnewson are okay with the PR?

@rnewson
Copy link
Member

rnewson commented Oct 5, 2023

+1

@pgj pgj merged commit ce2607a into apache:main Oct 5, 2023
14 checks passed
@pgj pgj deleted the fix/mango/duplicated-text-results branch October 5, 2023 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants