-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830
Comments
A few other things to note: It seems that "ca" is what triggers the behavior and root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'ca';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'ca'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 5ms total (execution 4ms / network 0ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'ch';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'ch'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 3ms total (execution 3ms / network 0ms) Increasing the length of queries using these problematic prefixes doesn't seem to do much either: root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'cheese';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'cheese'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 4ms / network 0ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'cakes';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'cakes'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 3ms / network 0ms) These suboptimal plans will also "infect" otherwise good plans when using an OR clause. root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'eggs';
info
---------------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'eggs'
│
└── • index join
│ estimated row count: 0
│ table: words@words_pkey
│
└── • inverted filter
│ estimated row count: 0
│ inverted column: word_inverted_key
│ num spans: 5
│
└── • scan
estimated row count: 0 (<0.01% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
table: words@words_word_idx
spans: 5 spans
(20 rows)
Time: 3ms total (execution 3ms / network 1ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'eggs' OR word % 'cakes';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: (word % 'eggs') OR (word % 'cakes')
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 3ms / network 0ms)
|
It's important to note that the optimizer cannot (and will never, unless there's a bug) plan a query with a that uses a trigram inverted index if the search term contains fewer than 3 letters. So the example queries that you're seeing that don't work (ca and ch) are expected not to work. Are there examples of 2-character queries that do use the inverted index? As far as the "or" case, there are more trigrams in that query, and the more trigrams the more scans and the more selective the trigrams must be to "win" over the full scan. You can use |
I take back what I said about two-character % queries not being able to use the index. They're able to use it because padding is added. I think any issue here is related to stats. I also think that unfortunately the stats collection is working fine, because hinting the inverted index for the There are some execution performance low hanging fruit here though so hopefully we can speed this stuff up a bit regardless of the stats issues. |
Huh, I could have sworn that forcing the usage of the index resulted in a notable speed up but after rebuilding the table it's significantly slower.
Here's the updated EXPLAIN output:
I put together a little repo to toy around with various querying methods and it will rebuild the exact database that I've been using while poking around this issue: https://github.com/chrisseto/cockroach-trigrams. |
@chrisseto try playing around with a build on this PR: #93757 |
Also note that running with |
93757: trigram: support multi-byte string trigrams; perf improvements r=jordanlewis a=jordanlewis Fixes #93744 Related to #93830 - Add multi-byte character support - Improve performance ``` name old time/op new time/op delta Similarity-32 1.72µs ± 0% 0.60µs ± 3% -64.98% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Similarity-32 1.32kB ± 0% 0.37kB ± 0% -72.10% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Similarity-32 15.0 ± 0% 6.0 ± 0% -60.00% (p=0.000 n=10+10) ``` Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected. 94122: sql: implement the pg_timezone_names table r=rafiss a=otan Informs #84505 Release note (sql change): Implement the `pg_timezone_names` pg_catalog table, which lists all supported timezones. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
#93757 shaves about 0.5 seconds off some of my queries! Wow, I did not expect that to be the source of any slowness. Oddly, the queries against already "analyzed" (Tokenized, stemmed, joined with spaces) only saw a 0.05 second improvement. Overall an amazing change, the results are a bit curious in some cases. |
The problem with I investigated with Postgres, and I'm noticing that they have similar weirdness - depending on the index, a |
Interesting! Thanks for digging into this further. I wonder if we could overload the What should we do with this issue? It seems like this is less of a bug and more a very niche use case that may or may not be worth pouring more time into. |
I agree, I don't think it's a bug, so I'll close it. I think the answer is tsvectors if you're looking for normal word search as opposed to fuzzy search. But fuzzy search still shouldn't be super slow - then again, I think we're working "as expected" here in terms of performance at least for now. Hopefully tsvectors are coming... we'll see! |
SGTM! Thanks for the improvements and all the debugging. |
Describe the problem
A simple
SELECT * FROM table WHERE column % 'word'
query will ignore the trigram index for certain phrases.To Reproduce
I have a newly created single node CockroachDB instance and have indexed the USDA Food Dataset food names after manually tokenizing and stemming the strings.
When experimenting with various queries, I noticed that certain words will not use the trigram indexes:
The above queries don't result in dramatically different result counts but the execution times are dramatically different.
Expected behavior
I would expect that any
%
query would result in a query plan that uses the available trigram index.Additional data / screenshots
Just to give you an idea of the trigram distribution and table size:
Environment:
cockroach sql
Additional context
Its a bit concerning to me that a simple parameter change, that is likely to be user controlled in the case of trigrams, could dramatically alter the query plan and execution time of queries.
Jira issue: CRDB-22544
The text was updated successfully, but these errors were encountered: