Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGRAM_MATCH behavior does not appear to match documentation and makes it less useful for substring search #20118

Open
MrCreosote opened this issue Nov 10, 2023 · 7 comments

Comments

@MrCreosote
Copy link

My Environment

  • ArangoDB Version: 3.11.4
  • Deployment Mode: Single server from arangodb:3.11 Docker image from Dockerhub
  • Deployment Strategy: whatever the docker image does
  • Configuration: whatever the docker image does
  • Infrastructure: laptop
  • Operating System: Ubuntu 18.04
  • Total RAM in your machine: 32GB, 20GB for VirtualBox VM
  • Disks in use: SSD
  • Used Package: Docker image

Component, Query & Data

Affected feature:
ArangoSearch query using web interface

AQL query (if applicable):
Query 1:

RETURN NGRAM_MATCH("flaviobacter", "obact", 1, "ngram3")

Query 2:

RETURN NGRAM_MATCH("xxobaxxxbacxxxactxxx", "obact", 1, "ngram3")

AQL explain and/or profile (if applicable):
N/A

Dataset:
N/A

Size of your Dataset on disk:
N/A

Replication Factor & Number of Shards (Cluster only):
RF: 3
Shards: N/A

Steps to reproduce

  1. Add the following analyzer to the DB:
{
    "name": "test::ngram3",
    "type": "pipeline",
    "features": [
        "frequency",
        "position"
    ],
    "properties": {
        "pipeline": [
            {
                "type": "norm",
                "properties": {
                    "locale": "en",
                    "case": "lower",
                    "accent": false
                }
            },
            {
                "type": "ngram",
                "properties": {
                    "min": 3,
                    "max": 3,
                    "preserveOriginal": false,
                    "streamType": "utf8",
                    "startMarker": "",
                    "endMarker": ""
                }
            }
        ]
    }
}
  1. Execute query 1, note the true result
  2. Execute query 2, note the true result

Problem:
The documentation describes the scoring of the ngram match as:

The similarity is calculated by counting how long the longest sequence of matching n-grams is, divided by the target’s total n-gram count. Only fully matching n-grams are counted. [Emphasis mine]

Based on that description I would expect query 1 to match (3 trigrams in the query, a sequence of 3 matching trigrams in the target for a score of 1) and query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33), but both match.

I was hoping to use NGRAM_MATCH as a fast (and it's definitely really fast) substring query but given the current behavior that's not going to work, unless I'm doing something wrong.

Expected result:
Query 2's result should be false if I understand the documentation correctly

Thanks for any advice you can give me if there's something incorrect with my setup

@MBkkt
Copy link
Contributor

MBkkt commented Nov 16, 2023

Hello.

query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33)

It's not correct, longest matching sequence is also 3 here
xxobaxxxbacxxxactxxx

@MBkkt
Copy link
Contributor

MBkkt commented Nov 16, 2023

In general you can imagine it like

"obact" -> target ngram array [oba bac act]
"xxobaxxxbacxxxactxxx" -> ngram array [xxo xob oba bax ....]

And we just find LCP longest common subsequence, when score is lcp divide by target array length

@MBkkt
Copy link
Contributor

MBkkt commented Nov 16, 2023

If you have long tokens (which is separated to ngrams) probably will be good is increase ngram size

@MrCreosote
Copy link
Author

Thanks for clearing up my understanding.

It's not correct, longest matching sequence is also 3 here
xxobaxxxbacxxxactxxx

Hmm, that's not what I would think of as a sequence. I would think they'd need to be contiguous. I understand that things are working as intended though, so at most we're talking about a terminology difference.

That being said - is there a way to use arangosearch to find an arbitrary length substring in a field without scanning every document value for the field? There's LIKE, which the documentation says is backed by indexes, but for a search like %foo% to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE searches where the left side of the query is anchored, like foo%.

@MBkkt
Copy link
Contributor

MBkkt commented Nov 16, 2023

There's LIKE, which the documentation says is backed by indexes, but for a search like %foo% to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE searches where the left side of the query is anchored, like foo%.

Yep, you're completely right, but we plan to add wildcard analyzer in 3.12. That helps for leading wildcard query too (internally it uses ngram and post filtering)

@MBkkt
Copy link
Contributor

MBkkt commented Nov 16, 2023

As another option you can try to make ngram with different size and search your substring by exact term search. But it can make inverted index big ofc

@MrCreosote
Copy link
Author

we plan to add wildcard analyzer in 3.12

Oh great, looking forward to 3.12 then. Hurry hurry hurry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants