Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Inverted index support set filters and index record options #15254

Merged
merged 10 commits into from
Apr 18, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Apr 17, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  • inverted index support set filters and index record options
  • change the estimated memory display format in explain memo, so that we don't need to change the test result when the PushDownInfo struct changes.

Filters

the tokenizer can add several types of filters

  1. english_stop: removes English stop words, like a, an, the, etc.
  2. english_stemmer: maps different forms of the same word to a common word, for example, walking and walked will be mapped to walk.
  3. chinese_stop: remove the Chinese stop word, which currently only supports Chinese punctuations, like , , etc.

In addition to these optional filters, the tokenizer adds the lowercase filter by default, converting all terms to lowercase so that the search can match all the rows regardless of whether it is uppercase or lowercase.

Users can set filters as required. and multiple filters need to be separated by commas ,.

Filters can make search match the data to be queried more accurately, but setting more filters can also make index refresh a little slower.

Index record

There are three types of index records that support different needs.

  1. basic: only stores DocId, takes up minimal space, but can't search for phrase terms, like "quick brown fox".
  2. freq: store DocId and term frequency, takes up medium space, and also can't search for phrase terms, but can give better scoring.
  3. position: store DocId, term frequency, and positions, take up most space, have better scoring, and can search for phrase terms.

To support more sophisticated queries, the index record uses position by default. Users who want to reduce the size of the index file and do not need to search for phrase terms can choose freq. And if the user is not very concerned about the match score of the search term, can choose basic.

for example

mysql> CREATE TABLE t (id int, content string);
Query OK, 0 rows affected (0.12 sec)

# add filter `english_stop` and `english_stemmer`
mysql> CREATE INVERTED INDEX IF NOT EXISTS idx1 ON t(content) tokenizer = 'english' filters = 'english_stop,english_stemmer';
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO t VALUES
    -> (1, 'The quick brown fox jumps over the lazy dog'),
    -> (2, 'A picture is worth a thousand words'),
    -> (3, 'The early bird catches the worm'),
    -> (4, 'Actions speak louder than words'),
    -> (5, 'Time flies like an arrow; fruit flies like a banana'),
    -> (6, 'Beauty is in the eye of the beholder'),
    -> (7, 'When life gives you lemons, make lemonade'),
    -> (8, 'Put all your eggs in one basket'),
    -> (9, 'You can not judge a book by its cover'),
    -> (10, 'An apple a day keeps the doctor away');
Query OK, 10 rows affected (0.25 sec)

# word `the` can not be matched, because it is an English stop word and is removed by `english_stop` filter.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'the');
Empty set (0.06 sec)
Read 0 rows, 0.00 B in 0.017 sec., 0 rows/sec., 0.00 B/sec.

# word `fly` can match `flies`, because it is mapped to `fly` by `english_stemmer` fitler.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'fly');
+------+-----------+-----------------------------------------------------+
| id   | score()   | content                                             |
+------+-----------+-----------------------------------------------------+
|    5 | 2.4594712 | Time flies like an arrow; fruit flies like a banana |
+------+-----------+-----------------------------------------------------+
1 row in set (0.07 sec)
Read 10 rows, 504.00 B in 0.018 sec., 560.38 rows/sec., 27.58 KiB/sec.

# search with phrase terms, because default index record support position.
mysql> SELECT id, score(), content FROM t WHERE query('content:"quick brown fox"');
+------+-----------+---------------------------------------------+
| id   | score()   | content                                     |
+------+-----------+---------------------------------------------+
|    1 | 5.4435673 | The quick brown fox jumps over the lazy dog |
+------+-----------+---------------------------------------------+
1 row in set (0.07 sec)
Read 10 rows, 504.00 B in 0.028 sec., 357.48 rows/sec., 17.59 KiB/sec.

# create new inverted index without filters and change `index_record` to `basic`
mysql> CREATE OR REPLACE INVERTED INDEX idx1 ON t(content) tokenizer = 'english' index_record='basic';
Query OK, 0 rows affected (0.07 sec)

mysql> REFRESH INVERTED INDEX idx1 ON t;
Query OK, 0 rows affected (0.19 sec)

# word `the` can be matched, because `english_stop` filter is not used.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'the');
+------+-----------+---------------------------------------------+
| id   | score()   | content                                     |
+------+-----------+---------------------------------------------+
|    1 | 0.8323383 | The quick brown fox jumps over the lazy dog |
|    3 | 0.9893832 | The early bird catches the worm             |
|    6 | 0.8788376 | Beauty is in the eye of the beholder        |
|   10 | 0.8788376 | An apple a day keeps the doctor away        |
+------+-----------+---------------------------------------------+
4 rows in set (0.12 sec)
Read 10 rows, 504.00 B in 0.036 sec., 276.09 rows/sec., 13.59 KiB/sec.

# word `fly` can not be matched, because `because `english_stop` filter is not used.
mysql> SELECT id, score(), content FROM t WHERE match(content, 'fly');
Empty set (0.07 sec)
Read 0 rows, 0.00 B in 0.015 sec., 0 rows/sec., 0.00 B/sec.

# can not search with phrase terms, because index record is basic, position is not stored.
mysql> SELECT id, score(), content FROM t WHERE query('content:"quick brown fox"')
    -> ;
ERROR 1105 (HY000): TantivyQueryParserError. Code: 1903, Text = The field 'content' does not have positions indexed.

part of #14825

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@b41sh b41sh requested a review from sundy-li April 17, 2024 09:10
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 17, 2024
@b41sh b41sh requested a review from Dousir9 April 17, 2024 14:27
@b41sh b41sh added this pull request to the merge queue Apr 18, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Apr 18, 2024
@BohuTANG BohuTANG merged commit d14cb55 into datafuselabs:main Apr 18, 2024
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants