fix(`mango`): prevent occasional duplication of paginated `text` results #4782

pgj · 2023-09-28T12:05:07Z

When an interleaved update happens to a text index while it is being queried, the affected documents might appear duplicated in the collected results. That is because search results are paginated where the boundaries might move due to the updates and the document might show up once more on subsequent queries.

Mitigate the situation by tracking and losing duplicated documents while the results are being streamed to the user.

In addition to that, the mango unit test coverage has increased:

Test cases: 196 ⟶ 216
mango_cursor_text: 0% ⟶ 100%
Total: 55% ⟶ 62%

Testing recommendations

make eunit apps=mango
make mango-test MANGO_TEST_OPTS="24-text-paginated-test"

Note that mango-test requires a running Clouseau instance that is hooked up, while eunit does not.

The change would remove duplicated documents from the response, which users will notice but ideally it should not be considered a breaking change but a welcome fix. There shall be no data loss.

Checklist

Code is written and works correctly
Changes are covered by tests

src/mango/src/mango_cursor_text.erl

rnewson · 2023-09-29T09:28:45Z

The problem with search-backed mango queries is the execute/1 function that slurps up all the matches in a loop before passing them over for processing. We should fix that. If we do, then there will be no need for documents_seen as we can de-dupe as we iterate (and, critically, terminate the loop early once we have accumulated enough hits).

rnewson · 2023-09-29T10:12:31Z

ignore last comment. re-reading this and happy that execute/1, execute/3, handle_hit/3 exit when we've seen Limit number of rows.

src/mango/src/mango_cursor_text.erl

src/mango/src/mango_cursor_nouveau.erl

src/mango/src/mango_cursor_text.erl

nickva

Thanks for adding extra tests and types, @pgj!

I did notice however that now dialyzer complains more compared to main:

=> mango (dialyze)
src/mango_crud.erl:103:1: Function explain/3 has no local return
src/mango_cursor.erl:109:1: Function extract_candidate_indexes/1 has no local return
src/mango_cursor.erl:168:9: The call mango_cursor:tag_elems
         (#{'ranking' := _, 'usable' := 'false'},
          NotUsableIndexes :: sets:set(_)) breaks the contract
          (properties(), sets:set(#idx{})) -> [{#idx{}, properties()}]
src/mango_cursor.erl:260:1: Function extract_selector_hints/1 will never be called
src/mango_cursor.erl:304:1: Function explain/1 has no local return
src/mango_cursor_text.erl:79:1: Function explain/1 has no local return
src/mango_cursor_text.erl:86:41: The call mango_cursor_text:get_partition
         (Opts :: any(),
          'null') breaks the contract
          (Options, Partition) -> Partition
             when Partition :: partition(), Options :: cursor_options()
src/mango_cursor_text.erl:93:1: Function execute/3 has no local return
src/mango_cursor_text.erl:106:41: The call mango_cursor_text:get_partition
         (Opts :: any(),
          'nil') breaks the contract
          (Options, Partition) -> Partition
             when Partition :: partition(), Options :: cursor_options()
src/mango_cursor_text.erl:153:1: Function execute/1 will never be called
src/mango_cursor_text.erl:171:1: Function search_docs/1 will never be called
src/mango_cursor_text.erl:187:1: Function handle_hits/2 will never be called
src/mango_cursor_text.erl:195:1: Function handle_hit/3 will never be called
src/mango_cursor_text.erl:242:1: Function apply_user_fun/2 will never be called
src/mango_cursor_text.erl:263:1: Function sort_query/2 will never be called
src/mango_cursor_text.erl:288:2: Invalid type specification for function mango_cursor_text:get_partition/2. The success typing is
          (_, 'nil' | 'null') -> any()
src/mango_cursor_text.erl:300:1: Function get_bookmark/1 will never be called
src/mango_cursor_text.erl:309:1: Function update_bookmark/2 will never be called
src/mango_cursor_text.erl:318:1: Function pack_bookmark/1 will never be called
src/mango_cursor_text.erl:343:1: Function ddocid/1 will never be called
src/mango_cursor_text.erl:352:1: Function update_query_args/1 will never be called
src/mango_cursor_text.erl:363:1: Function get_limit/1 will never be called
src/mango_cursor_text.erl:374:1: Function get_json_docs/2 will never be called
src/mango_idx.erl:56:2: Invalid type specification for function mango_idx:get_usable_indexes/4. The success typing is
          (_, _, maybe_improper_list(), _) ->
             {[any()],
              #{'all_indexes' := sets:set(_),
                'global_indexes' := sets:set(_),
                'partition_indexes' := sets:set(_),
                'usability_map' := [any()],
                'usable_indexes' := sets:set(_)}}

vs

% make dialyze apps=mango
==> mango (build-plt)
WARN:  ''build-plt'' command does not apply to directory /Users/nvatama/asf/rel
WARN:  ''build-plt'' command does not apply to directory /Users/nvatama/asf
==> mango (dialyze)
src/mango_crud.erl:103:1: Function explain/3 has no local return
src/mango_cursor.erl:109:1: Function extract_candidate_indexes/1 has no local return
src/mango_cursor.erl:168:9: The call mango_cursor:tag_elems
         (#{'ranking' := _, 'usable' := 'false'},
          NotUsableIndexes :: sets:set(_)) breaks the contract
          (properties(), sets:set(#idx{})) -> [{#idx{}, properties()}]
src/mango_cursor.erl:260:1: Function extract_selector_hints/1 will never be called
src/mango_cursor.erl:304:1: Function explain/1 has no local return
src/mango_idx.erl:56:2: Invalid type specification for function mango_idx:get_usable_indexes/4. The success typing is
          (_, _, maybe_improper_list(), _) ->
             {[any()],
              #{'all_indexes' := sets:set(_),
                'global_indexes' := sets:set(_),
                'partition_indexes' := sets:set(_),
                'usability_map' := [any()],
                'usable_indexes' := sets:set(_)}}

Maybe some type specs need adjusting and/or it's discovered some genuine bugs. Unless the tweaks to the type specs are minimal to fix to make the dialyzer happy, perhaps the extra types and associated fixes should be in a separate PR.

Also noticed during DocId extraction we were not actually getting a DocId but instead a Doc with a single _id field. See a few extra comments about it inline.

pgj · 2023-10-03T12:31:20Z

Thanks @nickva for the comment on the type specifications. I forgot to check them with Dialyzer before marking the PR ready for review, although I believe most of them are logically right. Except for maybe the one about the partitions.

The series of "will never be called" and "has no local return" warnings are probably due to the number of no_return() types. I had to add them because of the frequent use of throw() in the control flow, which I do not consider idiomatic. I think I understand the reasons — throw() is a very convenient way to break out from loops, for example — but in Erlang we could do it better.

By taking your advice, I have removed the type specifications from this PR and I will create another one, possibly together with a refactor.

nickva · 2023-10-03T18:43:40Z

src/mango/src/mango_cursor_text.erl

+        index = Index,
+        ranges = null,
+        trace = Trace,
+        selector = selector,


A selector here obviously doesn't matter and can be anything, however in general in tests, if possible it might be better use proper types even when they don't matter and when it's not too hard. So maybe a selector might be {[]}. skip should be perhaps 0 or some plausible integer like 53 etc. Don't have to change in the current PR just in general I think it's nicer to respect types.

I decided to go with this approach because the functions that receive these values will be mocked. Even if we would respect the types this way it does not matter much because the values of those types are not used for their original purpose — for example, the 0 integer becomes a simple token only. On the other hand, non-standard, out-of-type values can warrant that the mocked function was called with that specific instance and not by accident through some other functions.

Another and more important reason for using atoms in place of regular values is that if the code blows up somewhere while the test is running it means that given value was attempted for use while it should not have been. I admit that this might be a bit more nit-picky, but hey, why not to establish such guarantees in the tests or catch such potentially unintended changes?

src/mango/src/mango_cursor_text.erl

nickva · 2023-10-03T18:54:30Z

src/mango/test/mango.py

+class Concurrently(object):
+    def __init__(self, thread, thread_args, start=True):
+        self.thread = threading.Thread(target=self.wrapper, args=(thread, thread_args))
+        self.return_value = None
+        if start:
+            self.start()
+
+    def wrapper(self, body, args):
+        self.return_value = body(*args)
+
+    def start(self):
+        self.thread.start()
+
+    def get_result(self):
+        self.thread.join()
+        return self.return_value
+
+    def join(self):
+        self.thread.join()
+
+


A nice abstraction!

src/mango/test/mango.py

nickva

Looks great, Just noticed a few minor style issues in tests.

We should also get @rnewson's opinion before merging. He is more familiar with this area of the codebase.

rnewson

The patch is neatly done, bu I have a doubt arond bookmarks.

The Hit at line 176 is used to update the bookmark but it might be ignored at line 186 (as its a duplicate).

Prior to this PR the incoming Hit its always right to update the bookmark, even if the selector doesn't match, as we know we don't want search results earlier than that (they'll fail the selector next time).

With this PR I worry that we'll regress the bookmark if a duplication occurs (a doc that appeared on page 1 that also appears on page 2 due to an update while paginating).

pgj · 2023-10-04T10:38:07Z

@rnewson I have checked the use of mango_cursor_text:update_bookmark/2 and my impression is that the bookmark will be kept updated continuously. The duplication check happens after that. (*) It is not put behind any condition so why would it be called less times? On the other hand, it might be called more times than necessary and the invocation could be pushed down as follows:

--- a/src/mango/src/mango_cursor_text.erl
+++ b/src/mango/src/mango_cursor_text.erl
@@ -184,17 +184,17 @@ handle_hit(CAcc0, Sort, Doc) ->
         execution_stats = Stats,
         documents_seen = Seen
     } = CAcc0,
-    CAcc1 = update_bookmark(CAcc0, Sort),
     Stats1 = mango_execution_stats:incr_docs_examined(Stats),
     couch_stats:increment_counter([mango, docs_examined]),
-    CAcc2 = CAcc1#cacc{execution_stats = Stats1},
-    case mango_selector:match(CAcc2#cacc.selector, Doc) of
+    CAcc1 = CAcc0#cacc{execution_stats = Stats1},
+    case mango_selector:match(CAcc1#cacc.selector, Doc) of
         true ->
             DocId = mango_doc:get_field(Doc, <<"_id">>),
             case sets:is_element(DocId, Seen) of
                 true ->
-                    CAcc2;
+                    CAcc1;
                 false ->
+                    CAcc2 = update_bookmark(CAcc1, Sort),
                     CAcc3 = CAcc2#cacc{
                         documents_seen = sets:add_element(DocId, Seen)
                     },
@@ -216,7 +216,7 @@ handle_hit(CAcc0, Sort, Doc) ->
                     end
             end;
         false ->
-            CAcc2
+            CAcc1
     end.

Did you mean something like this?

(*) Except for Limit == 1 but that is how the code originally worked, and it is documented in the comment.

rnewson · 2023-10-04T10:40:28Z

I was unclear. I am questioning whether we should still update the bookmark when we detect a duplicate or not.

with this PR we will. But I think we shouldn't?

pgj · 2023-10-04T10:47:08Z

Yeah, that sounds like a good idea to me. My diff above does that exactly. Thanks for the clarification.

nickva · 2023-10-04T14:20:24Z

It does seem wrong to update the bookmark with the duplicate IDs. We're effectively getting index results from a newer database snapshot than the one we opened in the request (the index got ahead of the database, so to speak). So, we wouldn't want to update the bookmark or do anything with the duplicate IDs except just ignore them.

If we reworked these indexes to store document IDs and update sequences, when we read those documents we would check that IndexSeq =< DbSeq and discard anything else. Then we wouldn't need to accumulate a Seen set. That however, is a larger rewrite with possibly backward compatibility implications.

pgj · 2023-10-04T14:29:09Z

Thanks @nickva for the comment, that is an important piece of information.

rnewson · 2023-10-04T15:03:44Z

perhaps it would be prudent to record seq in nouveau as there are no "real" indexes yet, just in case we might find a use for it.

pgj · 2023-10-04T15:16:50Z

@rnewson in the scope of this PR? I am not that experienced with nouveau and I have no idea how to do that.

rnewson · 2023-10-04T15:36:49Z

no, not in this PR. I will take care of that.

nickva

+1

I think this may be the simplest way to eliminate duplicates in the result.

However, we may still get index updates from an index that is now ahead of the db snapshot we opened in the current request. In other words, we're only papering over the symptoms here, so to speak. Changing that would require keeping track of sequences which in the context of Clouseau is trickier since it involves backward compatibility clauses.

When an interleaved update happens to a `text` index while it is being queried, the affected documents might appear duplicated in the collected results. That is because `search` results are paginated where the boundaries might move due to the updates and the document might show up once more on subsequent queries. Mitigate the situation by tracking and losing duplicated documents while the results are being streamed to the user.

This is a port of a fix applied to the `text` cursor where occasionally duplicated documents were filtered out. This happens when moving between pages and an interleaved update is applied.

pgj · 2023-10-05T09:22:55Z

@rnewson are okay with the PR?

rnewson · 2023-10-05T09:52:55Z

+1

nickva reviewed Sep 28, 2023

View reviewed changes