sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830

chrisseto · 2022-12-16T22:37:25Z

Describe the problem

A simple SELECT * FROM table WHERE column % 'word' query will ignore the trigram index for certain phrases.

To Reproduce

I have a newly created single node CockroachDB instance and have indexed the USDA Food Dataset food names after manually tokenizing and stemming the strings.

root@localhost:26257/defaultdb> SHOW CREATE TABLE words;
  table_name |                              create_statement
-------------+------------------------------------------------------------------------------
  words      | CREATE TABLE public.words (
             |     food_id UUID NOT NULL,
             |     word STRING NOT NULL,
             |     count INT8 NOT NULL,
             |     weight INT8 NOT NULL,
             |     rowid INT8 NOT VISIBLE NOT NULL DEFAULT unique_rowid(),
             |     CONSTRAINT words_pkey PRIMARY KEY (rowid ASC),
             |     INVERTED INDEX words_food_id_word_idx (food_id ASC, word gin_trgm_ops),
             |     INDEX words_food_id_idx (food_id ASC),
             |     INVERTED INDEX words_word_idx (word gin_trgm_ops)
             | )
(1 row)

When experimenting with various queries, I noticed that certain words will not use the trigram indexes:

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'egg';
                                                                 info
---------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'egg'
  │
  └── • index join
      │ estimated row count: 0
      │ table: words@words_pkey
      │
      └── • inverted filter
          │ estimated row count: 0
          │ inverted column: word_inverted_key
          │ num spans: 4
          │
          └── • scan
                estimated row count: 0 (<0.01% of the table; stats collected 19 minutes ago; using stats forecast for 18 minutes ago)
                table: words@words_word_idx
                spans: 4 spans
(20 rows)


Time: 3ms total (execution 3ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cake';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cake'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 19 minutes ago; using stats forecast for 18 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 3ms total (execution 3ms / network 0ms)

The above queries don't result in dramatically different result counts but the execution times are dramatically different.

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words WHERE word % 'egg';
  count
---------
   1096
(1 row)


Time: 129ms total (execution 129ms / network 0ms)

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words WHERE word % 'cake';
  count
---------
   1425
(1 row)


Time: 876ms total (execution 876ms / network 0ms)

Expected behavior
I would expect that any % query would result in a query plan that uses the available trigram index.

Additional data / screenshots
Just to give you an idea of the trigram distribution and table size:

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words;
  count
----------
  644580
(1 row)


Time: 140ms total (execution 140ms / network 0ms)

root@localhost:26257/defaultdb> SELECT unnest(show_trgm(word)) as trigram, COUNT(*) FROM words GROUP BY trigram ORDER BY COUNT(*) DESC LIMIT 25;
  trigram | count
----------+--------
    c     | 95313
    s     | 80770
    b     | 52114
    p     | 47186
    f     | 39878
    m     | 39123
   ch     | 34820
  er      | 27228
    w     | 26876
    g     | 25592
    t     | 25329
    o     | 24770
    r     | 22522
    a     | 21489
    d     | 20076
   co     | 19451
    h     | 19332
  ri      | 19137
  an      | 18802
  in      | 17735
    l     | 17033
   sa     | 15627
   cr     | 14996
  che     | 14521
  rea     | 14453

Environment:

CockroachDB version v22.2.0
Server OS: MacOS
Client app cockroach sql

Additional context

Its a bit concerning to me that a simple parameter change, that is likely to be user controlled in the case of trigrams, could dramatically alter the query plan and execution time of queries.

Jira issue: CRDB-22544

The text was updated successfully, but these errors were encountered:

chrisseto · 2022-12-16T22:44:48Z

A few other things to note:

It seems that "ca" is what triggers the behavior and ch will as well.

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'ca';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'ca'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 5ms total (execution 4ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'ch';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'ch'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 3ms total (execution 3ms / network 0ms)

Increasing the length of queries using these problematic prefixes doesn't seem to do much either:

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cheese';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cheese'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 4ms total (execution 4ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cakes';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cakes'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 4ms total (execution 3ms / network 0ms)

These suboptimal plans will also "infect" otherwise good plans when using an OR clause.

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'eggs';
                                                                 info
---------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'eggs'
  │
  └── • index join
      │ estimated row count: 0
      │ table: words@words_pkey
      │
      └── • inverted filter
          │ estimated row count: 0
          │ inverted column: word_inverted_key
          │ num spans: 5
          │
          └── • scan
                estimated row count: 0 (<0.01% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
                table: words@words_word_idx
                spans: 5 spans
(20 rows)


Time: 3ms total (execution 3ms / network 1ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'eggs' OR word % 'cakes';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: (word % 'eggs') OR (word % 'cakes')
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)


Time: 4ms total (execution 3ms / network 0ms)

jordanlewis · 2022-12-17T21:19:37Z

It's important to note that the optimizer cannot (and will never, unless there's a bug) plan a query with a that uses a trigram inverted index if the search term contains fewer than 3 letters. So the example queries that you're seeing that don't work (ca and ch) are expected not to work. Are there examples of 2-character queries that do use the inverted index?

As far as the "or" case, there are more trigrams in that query, and the more trigrams the more scans and the more selective the trigrams must be to "win" over the full scan.

You can use EXPLAIN(opt,memo) for enough detail to be able to track down what's going on here in a little bit more detail.

jordanlewis · 2022-12-19T18:36:18Z

I take back what I said about two-character % queries not being able to use the index. They're able to use it because padding is added.

I think any issue here is related to stats. I also think that unfortunately the stats collection is working fine, because hinting the inverted index for the cake query actually results in worse performance.

There are some execution performance low hanging fruit here though so hopefully we can speed this stuff up a bit regardless of the stats issues.

chrisseto · 2022-12-22T20:16:08Z

Huh, I could have sworn that forcing the usage of the index resulted in a notable speed up but after rebuilding the table it's significantly slower.

root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods@foods_name_idx WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                                  info
--------------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 12.7s
  distribution: full
  vectorized: true
  rows read from KV: 565,017 (50 MiB, 9 gRPC calls)
  cumulative time spent in KV: 10.9s
  maximum memory usage: 56 MiB
  network usage: 0 B (0 messages)

  • top-k
  │ nodes: n1
  │ actual row count: 10
  │ estimated max memory allocated: 10 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 10
  │ order: +column9
  │ k: 10
  │
  └── • render
      │
      └── • filter
          │ nodes: n1
          │ actual row count: 91
          │ estimated row count: 131,179
          │ filter: name % 'cake'
          │
          └── • index join
              │ nodes: n1
              │ actual row count: 238,051
              │ KV time: 10.8s
              │ KV contention time: 0µs
              │ KV rows read: 238,051
              │ KV bytes read: 34 MiB
              │ KV gRPC calls: 7
              │ estimated max memory allocated: 41 MiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 296,144
              │ table: foods@foods_pkey
              │
              └── • inverted filter
                  │ nodes: n1
                  │ actual row count: 238,051
                  │ estimated max memory allocated: 15 MiB
                  │ estimated max sql temp disk usage: 0 B
                  │ estimated row count: 296,144
                  │ inverted column: name_inverted_key
                  │ num spans: 5
                  │
                  └── • scan
                        nodes: n1
                        actual row count: 326,966
                        KV time: 104ms
                        KV contention time: 0µs
                        KV rows read: 326,966
                        KV bytes read: 16 MiB
                        KV gRPC calls: 2
                        estimated max memory allocated: 10 MiB
                        estimated row count: 296,144 (75% of the table; stats collected 2 minutes ago)
                        table: foods@foods_name_idx
                        spans: 5 spans
(60 rows)


Time: 12.656s total (execution 12.656s / network 0.000s)

root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                              info
-------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 2.5s
  distribution: full
  vectorized: true
  rows read from KV: 393,537 (59 MiB, 6 gRPC calls)
  cumulative time spent in KV: 130ms
  maximum memory usage: 10 MiB
  network usage: 0 B (0 messages)

  • top-k
  │ nodes: n1
  │ actual row count: 10
  │ estimated max memory allocated: 10 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 10
  │ order: +column9
  │ k: 10
  │
  └── • render
      │
      └── • filter
          │ nodes: n1
          │ actual row count: 91
          │ estimated row count: 131,179
          │ filter: name % 'cake'
          │
          └── • scan
                nodes: n1
                actual row count: 393,537
                KV time: 130ms
                KV contention time: 0µs
                KV rows read: 393,537
                KV bytes read: 59 MiB
                KV gRPC calls: 6
                estimated max memory allocated: 10 MiB
                estimated row count: 393,537 (100% of the table; stats collected 2 minutes ago)
                table: foods@foods_pkey
                spans: FULL SCAN
(38 rows)


Time: 2.500s total (execution 2.500s / network 0.000s)

Here's the updated EXPLAIN output:

root@localhost:26257/defaultdb> EXPLAIN(opt,memo) SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                                                                info
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  memo (optimized, ~10KB, required=[presentation: info:10])
   ├── G1: (explain G2 [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
   │    └── [presentation: info:10]
   │         ├── best: (explain G2="[presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]" [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
   │         └── cost: 655458.61
   ├── G2: (limit G3 G4 ordering=+9) (top-k G3 &{10 +9 })
   │    ├── [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]
   │    │    ├── best: (top-k G3 &{10 +9 })
   │    │    └── cost: 655458.59
   │    └── []
   │         ├── best: (top-k G3 &{10 +9 })
   │         └── cost: 655458.59
   ├── G3: (project G5 G6 id name analyzed weight)
   │    ├── [ordering: +9] [limit hint: 10.00]
   │    │    ├── best: (sort G3)
   │    │    └── cost: 700593.11
   │    └── []
   │         ├── best: (project G5 G6 id name analyzed weight)
   │         └── cost: 644119.16
   ├── G4: (const 10)
   ├── G5: (select G7 G8) (select G9 G8)
   │    └── []
   │         ├── best: (select G7 G8)
   │         └── cost: 641495.56
   ├── G6: (projections G10)
   ├── G7: (scan foods,cols=(1-4))
   │    └── []
   │         ├── best: (scan foods,cols=(1-4))
   │         └── cost: 637560.16
   ├── G8: (filters G11)
   ├── G9: (index-join G12 foods,cols=(1-4))
   │    └── []
   │         ├── best: (index-join G12 foods,cols=(1-4))
   │         └── cost: 2357336.33
   ├── G10: (function G13 similarity)
   ├── G11: (mod G14 G15)
   ├── G12: (inverted-filter G16 name_inverted_key)
   │    └── []
   │         ├── best: (inverted-filter G16 name_inverted_key)
   │         └── cost: 408708.76
   ├── G13: (scalar-list G14 G15)
   ├── G14: (variable name)
   ├── G15: (const 'cake')
   └── G16: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
        └── []
             ├── best: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
             └── cost: 405747.30
  top-k
   ├── k: 10
   └── project
        ├── select
        │    ├── scan foods
        │    └── filters
        │         └── name % 'cake'
        └── projections
             └── similarity(name, 'cake')
(56 rows)


Time: 6ms total (execution 5ms / network 1ms)

I put together a little repo to toy around with various querying methods and it will rebuild the exact database that I've been using while poking around this issue: https://github.com/chrisseto/cockroach-trigrams.

jordanlewis · 2022-12-22T20:32:22Z

@chrisseto try playing around with a build on this PR: #93757

jordanlewis · 2022-12-22T20:32:47Z

Also note that running with EXPLAIN ANALYZE is much slower than running without EXPLAIN ANALYZE.

93757: trigram: support multi-byte string trigrams; perf improvements r=jordanlewis a=jordanlewis Fixes #93744 Related to #93830 - Add multi-byte character support - Improve performance ``` name old time/op new time/op delta Similarity-32 1.72µs ± 0% 0.60µs ± 3% -64.98% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Similarity-32 1.32kB ± 0% 0.37kB ± 0% -72.10% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Similarity-32 15.0 ± 0% 6.0 ± 0% -60.00% (p=0.000 n=10+10) ``` Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected. 94122: sql: implement the pg_timezone_names table r=rafiss a=otan Informs #84505 Release note (sql change): Implement the `pg_timezone_names` pg_catalog table, which lists all supported timezones. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

chrisseto · 2022-12-23T16:19:50Z

#93757 shaves about 0.5 seconds off some of my queries! Wow, I did not expect that to be the source of any slowness.

Oddly, the queries against already "analyzed" (Tokenized, stemmed, joined with spaces) only saw a 0.05 second improvement. Overall an amazing change, the results are a bit curious in some cases.

jordanlewis · 2022-12-24T04:37:48Z

The problem with % vs ILIKE is that it needs to search the boundary trigrams - the ones that have spaces and then a letter, or a letter and than spaces. And those are very high cardinality typically for this dataset because they essentially reduce to a 1-gram. It's a bit of an annoying problem...

I investigated with Postgres, and I'm noticing that they have similar weirdness - depending on the index, a % query may or may not be accelerated with the index due to how expensive searching each of the boundary trigrams is.

chrisseto · 2022-12-29T15:45:24Z

Interesting! Thanks for digging into this further. I wonder if we could overload the similarity function to accept some tuning options. It would hurt compatibility a little bit but might result in a better UX? Though I suppose the actual solution here is TSVectors isn't it?

What should we do with this issue? It seems like this is less of a bug and more a very niche use case that may or may not be worth pouring more time into.

jordanlewis · 2023-01-03T15:58:58Z

I agree, I don't think it's a bug, so I'll close it. I think the answer is tsvectors if you're looking for normal word search as opposed to fuzzy search. But fuzzy search still shouldn't be super slow - then again, I think we're working "as expected" here in terms of performance at least for now.

Hopefully tsvectors are coming... we'll see!

chrisseto · 2023-01-03T16:11:15Z

SGTM! Thanks for the improvements and all the debugging.

chrisseto added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 16, 2022

jordanlewis mentioned this issue Dec 20, 2022

trigram: support multi-byte string trigrams; perf improvements #93757

Merged

blathers-crl bot added the T-sql-queries SQL Queries Team label Dec 20, 2022

blathers-crl bot mentioned this issue Dec 23, 2022

release-22.2: trigram: support multi-byte string trigrams; perf improvements #94199

Merged

jordanlewis closed this as completed Jan 3, 2023

michae2 mentioned this issue Oct 19, 2023

sql: trigram-index-accelerated fuzzy string matching slow for long search strings #112675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830

sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830

chrisseto commented Dec 16, 2022 •

edited by cockroach-jira-scripts

Loading

chrisseto commented Dec 16, 2022

jordanlewis commented Dec 17, 2022

jordanlewis commented Dec 19, 2022

chrisseto commented Dec 22, 2022

jordanlewis commented Dec 22, 2022

jordanlewis commented Dec 22, 2022

chrisseto commented Dec 23, 2022

jordanlewis commented Dec 24, 2022

chrisseto commented Dec 29, 2022

jordanlewis commented Jan 3, 2023

chrisseto commented Jan 3, 2023

sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830

sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830

Comments

chrisseto commented Dec 16, 2022 • edited by cockroach-jira-scripts Loading

chrisseto commented Dec 16, 2022

jordanlewis commented Dec 17, 2022

jordanlewis commented Dec 19, 2022

chrisseto commented Dec 22, 2022

jordanlewis commented Dec 22, 2022

jordanlewis commented Dec 22, 2022

chrisseto commented Dec 23, 2022

jordanlewis commented Dec 24, 2022

chrisseto commented Dec 29, 2022

jordanlewis commented Jan 3, 2023

chrisseto commented Jan 3, 2023

chrisseto commented Dec 16, 2022 •

edited by cockroach-jira-scripts

Loading