-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NGRAM_MATCH behavior does not appear to match documentation and makes it less useful for substring search #20118
Comments
Hello.
It's not correct, longest matching sequence is also 3 here |
In general you can imagine it like "obact" -> target ngram array [oba bac act] And we just find LCP longest common subsequence, when score is lcp divide by target array length |
If you have long tokens (which is separated to ngrams) probably will be good is increase ngram size |
Thanks for clearing up my understanding.
Hmm, that's not what I would think of as a sequence. I would think they'd need to be contiguous. I understand that things are working as intended though, so at most we're talking about a terminology difference. That being said - is there a way to use arangosearch to find an arbitrary length substring in a field without scanning every document value for the field? There's |
Yep, you're completely right, but we plan to add wildcard analyzer in 3.12. That helps for leading wildcard query too (internally it uses ngram and post filtering) |
As another option you can try to make ngram with different size and search your substring by exact term search. But it can make inverted index big ofc |
Oh great, looking forward to 3.12 then. Hurry hurry hurry! |
My Environment
arangodb:3.11
Docker image from DockerhubComponent, Query & Data
Affected feature:
ArangoSearch query using web interface
AQL query (if applicable):
Query 1:
Query 2:
AQL explain and/or profile (if applicable):
N/A
Dataset:
N/A
Size of your Dataset on disk:
N/A
Replication Factor & Number of Shards (Cluster only):
RF: 3
Shards: N/A
Steps to reproduce
true
resulttrue
resultProblem:
The documentation describes the scoring of the ngram match as:
Based on that description I would expect query 1 to match (3 trigrams in the query, a sequence of 3 matching trigrams in the target for a score of 1) and query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33), but both match.
I was hoping to use
NGRAM_MATCH
as a fast (and it's definitely really fast) substring query but given the current behavior that's not going to work, unless I'm doing something wrong.Expected result:
Query 2's result should be
false
if I understand the documentation correctlyThanks for any advice you can give me if there's something incorrect with my setup
The text was updated successfully, but these errors were encountered: