feat(query): Inverted index support set filters and index record options #15254

b41sh · 2024-04-17T09:10:10Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

inverted index support set filters and index record options
change the estimated memory display format in explain memo, so that we don't need to change the test result when the PushDownInfo struct changes.

Filters

the tokenizer can add several types of filters

english_stop: removes English stop words, like a, an, the, etc.
english_stemmer: maps different forms of the same word to a common word, for example, walking and walked will be mapped to walk.
chinese_stop: remove the Chinese stop word, which currently only supports Chinese punctuations, like ，, 。, etc.

In addition to these optional filters, the tokenizer adds the lowercase filter by default, converting all terms to lowercase so that the search can match all the rows regardless of whether it is uppercase or lowercase.

Users can set filters as required. and multiple filters need to be separated by commas ,.

Filters can make search match the data to be queried more accurately, but setting more filters can also make index refresh a little slower.

Index record

There are three types of index records that support different needs.

basic: only stores DocId, takes up minimal space, but can't search for phrase terms, like "quick brown fox".
freq: store DocId and term frequency, takes up medium space, and also can't search for phrase terms, but can give better scoring.
position: store DocId, term frequency, and positions, take up most space, have better scoring, and can search for phrase terms.

To support more sophisticated queries, the index record uses position by default. Users who want to reduce the size of the index file and do not need to search for phrase terms can choose freq. And if the user is not very concerned about the match score of the search term, can choose basic.

for example

mysql> CREATE TABLE t (id int, content string);
Query OK, 0 rows affected (0.12 sec)

# add filter `english_stop` and `english_stemmer`
mysql> CREATE INVERTED INDEX IF NOT EXISTS idx1 ON t(content) tokenizer = 'english' filters = 'english_stop,english_stemmer';
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO t VALUES
    -> (1, 'The quick brown fox jumps over the lazy dog'),
    -> (2, 'A picture is worth a thousand words'),
    -> (3, 'The early bird catches the worm'),
    -> (4, 'Actions speak louder than words'),
    -> (5, 'Time flies like an arrow; fruit flies like a banana'),
    -> (6, 'Beauty is in the eye of the beholder'),
    -> (7, 'When life gives you lemons, make lemonade'),
    -> (8, 'Put all your eggs in one basket'),
    -> (9, 'You can not judge a book by its cover'),
    -> (10, 'An apple a day keeps the doctor away');
Query OK, 10 rows affected (0.25 sec)

# word `the` can not be matched, because it is an English stop word and is removed by `english_stop` filter.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'the');
Empty set (0.06 sec)
Read 0 rows, 0.00 B in 0.017 sec., 0 rows/sec., 0.00 B/sec.

# word `fly` can match `flies`, because it is mapped to `fly` by `english_stemmer` fitler.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'fly');
+------+-----------+-----------------------------------------------------+
| id   | score()   | content                                             |
+------+-----------+-----------------------------------------------------+
|    5 | 2.4594712 | Time flies like an arrow; fruit flies like a banana |
+------+-----------+-----------------------------------------------------+
1 row in set (0.07 sec)
Read 10 rows, 504.00 B in 0.018 sec., 560.38 rows/sec., 27.58 KiB/sec.

# search with phrase terms, because default index record support position.
mysql> SELECT id, score(), content FROM t WHERE query('content:"quick brown fox"');
+------+-----------+---------------------------------------------+
| id   | score()   | content                                     |
+------+-----------+---------------------------------------------+
|    1 | 5.4435673 | The quick brown fox jumps over the lazy dog |
+------+-----------+---------------------------------------------+
1 row in set (0.07 sec)
Read 10 rows, 504.00 B in 0.028 sec., 357.48 rows/sec., 17.59 KiB/sec.

# create new inverted index without filters and change `index_record` to `basic`
mysql> CREATE OR REPLACE INVERTED INDEX idx1 ON t(content) tokenizer = 'english' index_record='basic';
Query OK, 0 rows affected (0.07 sec)

mysql> REFRESH INVERTED INDEX idx1 ON t;
Query OK, 0 rows affected (0.19 sec)

# word `the` can be matched, because `english_stop` filter is not used.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'the');
+------+-----------+---------------------------------------------+
| id   | score()   | content                                     |
+------+-----------+---------------------------------------------+
|    1 | 0.8323383 | The quick brown fox jumps over the lazy dog |
|    3 | 0.9893832 | The early bird catches the worm             |
|    6 | 0.8788376 | Beauty is in the eye of the beholder        |
|   10 | 0.8788376 | An apple a day keeps the doctor away        |
+------+-----------+---------------------------------------------+
4 rows in set (0.12 sec)
Read 10 rows, 504.00 B in 0.036 sec., 276.09 rows/sec., 13.59 KiB/sec.

# word `fly` can not be matched, because `because `english_stop` filter is not used.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'fly');
Empty set (0.07 sec)
Read 0 rows, 0.00 B in 0.015 sec., 0 rows/sec., 0.00 B/sec.

# can not search with phrase terms, because index record is basic, position is not stored.
mysql> SELECT id, score(), content FROM t WHERE query('content:"quick brown fox"')
    -> ;
ERROR 1105 (HY000): TantivyQueryParserError. Code: 1903, Text = The field 'content' does not have positions indexed.

part of #14825

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

feat(query): Inverted index support set filters and index record options

bdb9e80

b41sh requested a review from sundy-li April 17, 2024 09:10

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 17, 2024

b41sh added 4 commits April 17, 2024 18:59

fix

5472585

fix mem size

ceb8cc3

fix

8074f03

fix Memo estimated memory

8c97e7d

b41sh requested a review from Dousir9 April 17, 2024 14:27

b41sh added 3 commits April 17, 2024 22:47

fix

71957e7

Merge branch 'main' into feat-inverted-index-11

83e1ede

fix

e7de8b8

sundy-li approved these changes Apr 18, 2024

View reviewed changes

b41sh added 2 commits April 18, 2024 10:26

fix

33bb2ae

Merge branch 'main' into feat-inverted-index-11

c7a1f50

b41sh force-pushed the feat-inverted-index-11 branch from 5fac4e8 to c7a1f50 Compare April 18, 2024 02:29

b41sh added this pull request to the merge queue Apr 18, 2024

BohuTANG removed this pull request from the merge queue due to a manual request Apr 18, 2024

BohuTANG merged commit d14cb55 into datafuselabs:main Apr 18, 2024
72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(query): Inverted index support set filters and index record options #15254

feat(query): Inverted index support set filters and index record options #15254

b41sh commented Apr 17, 2024 •

edited

feat(query): Inverted index support set filters and index record options #15254

feat(query): Inverted index support set filters and index record options #15254

Conversation

b41sh commented Apr 17, 2024 • edited

Summary

Filters

Index record

Tests

Type of change

b41sh commented Apr 17, 2024 •

edited