Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mango: add $beginsWith operator #4810

Merged
merged 11 commits into from Oct 30, 2023
Merged

mango: add $beginsWith operator #4810

merged 11 commits into from Oct 30, 2023

Conversation

willholley
Copy link
Member

Overview

Adds a $beginsWith operator to selectors, with json and text index support. This is a compliment / precursor to optimising $regex support as proposed in #4776.

For json indexes, a $beginsWith operator translates into a key range query, as is common practice for _view queries. For example, to find all rows with a key beginning with "W", we can use a range start_key="W", end_key="W\ufff0". Given Mango uses compound keys, this is slightly more complex in practice, but the idea is the same. As with other range operators ($gt, $gte, etc), $beginsWith can be used in combination with equality operators and result sorting but must result in a contiguous key range. That is, a range of start_key=[10, "W"], end_key=[10, "W\ufff0", {}] would be valid, but start_key=["W", 10], end_key=["W\ufff0", 10, {}] would not, because the second element of the key may result in a non-contiguous range.

For text indexes, $beginsWith translates to a Lucene query on the specified field of W*.

If a non-string operand is provided to $beginsWith, the request will fail with a 400 / invalid_operator error.

Testing recommendations

I've only tested this with json indexes so far because the dev container doesn't have text index support.

A basic eunit test is added for the new selector support. You can run these using make eunit apps=mango.

Integration tests are added to the Mango test suite, which can be run via make mango-test.

Related Issues or Pull Requests

n/a

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

src/mango/src/mango_selector.erl Outdated Show resolved Hide resolved
src/mango/src/mango_selector_text.erl Outdated Show resolved Hide resolved
src/mango/test/25-beginswith-test.py Outdated Show resolved Hide resolved
src/mango/test/25-beginswith-test.py Outdated Show resolved Hide resolved
src/mango/test/25-beginswith-test.py Show resolved Hide resolved
@pgj
Copy link
Contributor

pgj commented Oct 18, 2023

Great work! I think it would be worth to cover the text indexes with tests in this PR otherwise it can easily go forgotten.

@pgj
Copy link
Contributor

pgj commented Oct 21, 2023

I miss the parts related to the documentation.

@pgj
Copy link
Contributor

pgj commented Oct 21, 2023

Please add text tests for the integration test suite. I am fine if they do not fully work (like a draft) I can test them and provide feedback. For your information, I am currently working on introducing support for testing with Clouseau in the PRs -- you can also wait for that.

@willholley
Copy link
Member Author

Please add text tests for the integration test suite. I am fine if they do not fully work (like a draft) I can test them and provide feedback. For your information, I am currently working on introducing support for testing with Clouseau in the PRs -- you can also wait for that.

It would certainly be helpful to have Clouseau in the CI and, presumably, then in the dev container which uses the same CI images.

@willholley
Copy link
Member Author

Please add text tests for the integration test suite. I am fine if they do not fully work (like a draft) I can test them and provide feedback. For your information, I am currently working on introducing support for testing with Clouseau in the PRs -- you can also wait for that.

The 03-operator-test.py updates include integration tests for text indexes that check the result is as expected. I didn't add integration tests that check the args passed to Lucene, but these are covered by eunit tests (it's much simpler than the mrargs parameters which involve a range scan).

@pgj
Copy link
Contributor

pgj commented Oct 25, 2023

Okay. Right now there is a failing test case (for the text indexes):

======================================================================
FAIL: test_beginswith (03-operator-test.OperatorTextTests.test_beginswith)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 146, in test_beginswith
    self.assertEqual(len(docs), 2)
AssertionError: 0 != 2

----------------------------------------------------------------------
Ran 384 tests in 37.973s

FAILED (failures=1, skipped=1)

Adds a `$beginsWith` operator to selectors, with json and text index
support. This is a compliment / precursor to optimising `$regex`
support as proposed in #4776.

For `json` indexes, a $beginsWith operator translates into a key
range query, as is common practice for _view queries. For example,
to find all rows with a key beginning with "W", we can use a range
`start_key="W", end_key="W\ufff0"`. Given Mango uses compound keys,
this is slightly more complex in practice, but the idea is the same.
As with other range operators (`$gt`, `$gte`, etc), `$beginsWith`
can be used in combination with equality operators and result sorting
but must result in a contiguous key range. That is, a range of
`start_key=[10, "W"], end_key=[10, "W\ufff0", {}]` would be valid,
but `start_key=["W", 10], end_key=["W\ufff0", 10, {}]` would not,
because the second element of the key may result in a non-contiguous
range.

For text indexes, `$beginsWith` translates to a Lucene query on
the specified field of `W*`.

If a non-string operand is provided to `$beginsWith`, the request will
fail with a 400 / `invalid_operator` error.
@willholley willholley force-pushed the mango-beginswith branch 2 times, most recently from 6f7f750 to cdfb26d Compare October 26, 2023 13:30
@willholley
Copy link
Member Author

@pgj I think I've addressed all the comments, though will need to squash the commits

src/docs/src/api/database/find.rst Outdated Show resolved Hide resolved
src/mango/test/03-operator-test.py Outdated Show resolved Hide resolved
@pgj
Copy link
Contributor

pgj commented Oct 27, 2023

In the documentation, 1.3.6.1.1. Selectors Basics should also talk about $beginsWith?

You can create more complex selector expressions by combining operators.
For best performance, it is best to combine 'combination' or
'array logical' operators, such as ``$regex``, with an equality
operators such as ``$eq``, ``$gt``, ``$gte``, ``$lt``, and ``$lte``
(but not ``$ne``). For more information about creating complex
selector expressions, see :ref:`creating selector expressions
<find/expressions>`.

@willholley willholley merged commit 26cfe53 into main Oct 30, 2023
15 checks passed
@willholley willholley deleted the mango-beginswith branch October 30, 2023 18:21
@pgj
Copy link
Contributor

pgj commented Oct 30, 2023

+1

@big-r81
Copy link
Contributor

big-r81 commented Oct 30, 2023

@willholley

Typically, CouchDB uses the Review-Then-Commit (RTC) model of code collaboration, not Commit-Then-Review.

😜

@pgj
Copy link
Contributor

pgj commented Oct 30, 2023

@big-r81 we have been in contact with @willholley about the PR out-of-band. On merging the PR he was aware that I did not have any more pending comments on the contents. I wanted to approve it formally but I was a bit late to the party hence I added the comment with the +1 to signal that I agree with the change.

@big-r81
Copy link
Contributor

big-r81 commented Oct 30, 2023

@pgj It was only a small friendly reminder 😉 btw, this reminds me of #4772...

@pgj
Copy link
Contributor

pgj commented Oct 30, 2023

Hrm... Something is wrong with the merged version. CI says the following:

[2023-10-30T18:57:57.544Z] FAIL: test_basic (25-beginswith-test.BeginsWithOperator)
[2023-10-30T18:57:57.544Z] ----------------------------------------------------------------------
[2023-10-30T18:57:57.544Z] Traceback (most recent call last):
[2023-10-30T18:57:57.544Z]   File "/home/jenkins/workspace/kins-cm1_FullPlatformMatrix_main/centos7/build/src/mango/test/25-beginswith-test.py", line 49, in test_basic
[2023-10-30T18:57:57.544Z]     self.assertEqual(len(docs), 2)
[2023-10-30T18:57:57.544Z] AssertionError: 0 != 2
[2023-10-30T18:57:57.544Z] 
[2023-10-30T18:57:57.544Z] ======================================================================
[2023-10-30T18:57:57.544Z] FAIL: test_compound_key (25-beginswith-test.BeginsWithOperator)
[2023-10-30T18:57:57.544Z] ----------------------------------------------------------------------
[2023-10-30T18:57:57.544Z] Traceback (most recent call last):
[2023-10-30T18:57:57.544Z]   File "/home/jenkins/workspace/kins-cm1_FullPlatformMatrix_main/centos7/build/src/mango/test/25-beginswith-test.py", line 68, in test_compound_key
[2023-10-30T18:57:57.544Z]     self.assertEqual(len(docs), 1)
[2023-10-30T18:57:57.544Z] AssertionError: 0 != 1

@pgj
Copy link
Contributor

pgj commented Oct 30, 2023

Locally (with Dreyfus/Clouseau enabled) make mango-test works for me.

@nickva
Copy link
Contributor

nickva commented Oct 31, 2023

It probably a flaky test. We should investigate as we've been trying to squash some of those lately. Is the race condition fairly obvious? Usually we can add a retry loop.

@pgj
Copy link
Contributor

pgj commented Oct 31, 2023

It is not flaky, these errors happen every time -- but only on CentOS 7. I was able to reproduce the issue locally with the help of a container created from the apache/couchdbci-centos:7-erlang-24.3.4.10 Docker image.

@nickva
Copy link
Contributor

nickva commented Oct 31, 2023

It is not flaky, these errors happen every time -- but only on CentOS 7.

Oh very interesting, definitely worth investigating. One thing that immediately jumps to mind is that CentOS 7 has an older libicu library which may have different collation rules. Any end/start (min/max) marker perhaps might collate differently there.

@willholley
Copy link
Member Author

@big-r81 oops on the premature merge - I'm too used to working with repos with branch protection enabled in GH and didn't think to manually check for the +1.

@willholley
Copy link
Member Author

curious that the failure didn't show in the PR checks - it definitely smells like a collation issue and likely an issue with the special characters used in the tests rather than the implementation of $beginsWith

@big-r81
Copy link
Contributor

big-r81 commented Oct 31, 2023

@willholley np, only a reminder. We will add this review-branch-protection in the future. At the moment, we will get the "green" merge button, if the CI runs through...

The failure didn't show up in the PR checks, because CentOS and other OSes are only called in the full CI run. For PRs, we only test a linux distro (debian) with different erlang version (24,25,26).

@pgj
Copy link
Contributor

pgj commented Oct 31, 2023

it definitely smells like a collation issue and likely an issue with the special characters used in the tests

In my understanding, the two failing test cases do not feature any special characters. Neither the selectors nor the documents. C.f. "begins with A" vs. { "AUS", "AND", "CAN", "DEN", "ETH" }.

@big-r81
Copy link
Contributor

big-r81 commented Oct 31, 2023

@pgj can you test on your local CentOS vm the state before the PR?

@pgj
Copy link
Contributor

pgj commented Oct 31, 2023

It is the derived string upper limit that does not work. Consider the following:

> Lower = <<"A">>.
<<"A">>.
> Upper = <<Lower/binary, 16#10FFFF>>.
<<"A\377">>.

On CentOS 7:

> couch_ejson_compare:get_icu_version().
{50,2,0,0}
> couch_ejson_compare:get_collator_version().
{58,0,6,50}
> mango_json:cmp(Lower, <<"AUS">>).
-1
> mango_json:cmp(<<"AUS">>, Upper).
1

On anywhere else (for example, locally):

> couch_ejson_compare:get_icu_version().
{73,2,0,0}
> couch_ejson_compare:get_collator_version().
{153,120,0,0}
> mango_json:cmp(Lower, <<"AUS">>).
-1
> mango_json:cmp(<<"AUS">>, Upper).
-1

The mango_json:cmp/2 function should return -1 for every case. That signals the "less than" result, which should hold for all. Apparently, icu 50 disagrees (as @nickva suspected) hence no documents are returned for the range.

@pgj
Copy link
Contributor

pgj commented Oct 31, 2023

Yeah, basically the lack of support for 0xFFFF in older ICU versions strikes again here, similar to what happened in #3491. As the standard says:

U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent.

Given that the other issue was fixed by unconditionally applying a shim on top of the ICU string comparison call, we should do the same here? @nickva what do you think?

@nickva
Copy link
Contributor

nickva commented Nov 1, 2023

@pgj excellent find!

Given that the other issue was fixed by unconditionally applying a shim on top of the ICU string comparison call, we should do the same here? @nickva what do you think?

It could work. The previous fix was easy enough as it was an isolated element -- a single marker. This one might involve doing some string wrangling in C, which is always fun, and by fun I mean dangerous, of course ;-)

I guess we could extend the current check to first see if the string ends in \uFFFFF and handle that special case.

@willholley
Copy link
Member Author

I wonder whether there's a pragmatically high code point we could use as the high bound in $beginsWith until we drop support for Centos 7 (it goes out of support next year)?

@pgj
Copy link
Contributor

pgj commented Nov 1, 2023

whether there's a pragmatically high code point we could use as the high bound

I was thinking about the same. Unfortunately I am not that experienced with Unicode collation and ICU and I do not have any idea. I will keep studying the implementation for some inspiration and work on the shim-based approach so we could at least have something as a fix.

@nickva
Copy link
Contributor

nickva commented Nov 1, 2023

Another sentinel value might work, good idea.

CentOS is EOL next year so then we can deprecate support for it and remove a few hacks we have for it.

@pgj
Copy link
Contributor

pgj commented Nov 1, 2023

Another sentinel value might work, good idea.

Sure but what should be that value specifically? My impression is that 0xFFFF was created for that exact purpose because it was missing before. It would be good to know how others have implemented this feature in lack of this symbol.

@nickva
Copy link
Contributor

nickva commented Nov 2, 2023

Sure but what should be that value specifically? My impression is that 0xFFFF was created for that exact purpose because it was missing before. It would be good to know how others have implemented this feature in lack of this symbol.

Looking over at the PR itself, one issue that jumps out is in the line <<Arg/binary, 16#10FFFF>>. We don't actually build a binary with 3 extra bytes at the end: 10, FF and FF; instead we're appending an integer with the size of 8 bits which gets maxed out at FF.

io:format("~w~n", [<<"A",16#10FFFF>>]).
<<65,255>>

At some point the ICU library decided to treat that differently but they could have as well have thrown an error too.

But then, if even if we did 16#10FFFF:24 it would still hit the same special case of the API implementation. What we're passing in is a utf8 encoded binary but then append a raw unicode code point to it.

So then the fix might be to actually encode the U+10FFFF:

io:format("~w~n", [<<16#10FFFF/utf8>>]).
<<244,143,191,191>>

Even better, since unicode standard seems to define a max sortable code point U+FFFF, we should use that instead of U+10FFFF. So we could define a max as unicode:characters_to_binary([16#FFFF]) or the shorter version of <<16#FFFF/utf8>>.

io:format("~w~n", [<<16#FFFF/utf8>>]).
<<239,191,191>>

Give <<Arg/binary, 16#FFFF/utf8>> a try, that should work on all the OS-es. It probably calls for extra unit tests too.

One interesting case to think about is that with /utf8 binary type we'll crash if the string is not a valid utf8 binary (say if somehow jiffy allowed through a surrogate pair prefix or we then building /utf8 binary should blow up).

@pgj
Copy link
Contributor

pgj commented Nov 2, 2023

Thanks @nickva! I will experiment with the options you described above and get back to you with the results.

@willholley
Copy link
Member Author

thanks @nickva - this solution looks good in my tests using the CentoOS container

@pgj
Copy link
Contributor

pgj commented Nov 2, 2023

Even better, since unicode standard seems to define a max sortable code point U+FFFF, we should use that instead of U+10FFFF.

Erlang itself supports U+10FFFF and it is numerically greater than U+FFFF.

Give <<Arg/binary, 16#FFFF/utf8>> a try, that should work on all the OS-es.

It works nicely. Jenkins seems to be happy with that, see #4829 for the details.

@nickva nickva mentioned this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants