feat(query): inverted index use empty position data when query not contain phrase terms #15362

b41sh · 2024-04-28T10:33:15Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In the files of the inverted index, the position file records the position of each term in the original text, which is used to judge whether the terms are adjacent to each other when searching for phrases. Position files are usually large in size, which leads to a long time of reading the data of the inverted index and affects the query speed. For example, the size of each file of a 55M inverted index data is as follows:

positions file 37.8M
postings file 10.7M
terms file 6.3M
store file 19.2K
field_norms file 19.1K
fast file 0.1K
meta.json file 0.7K
managed.json file 0.2K

We can see that the positions file takes up 68% of the total size of all the files, which is the main reason for the slow speed of reading the index data. Since the positions file is only used when querying for phrases, it is not used for other queries. We can choose not to read the positions file when the querying don't contain phrase terms, and use an empty positions file instead, which can greatly speed up the query. After testing, we found that the query time was reduced by about 50%.

for example

mysql> CREATE TABLE pmc20 (
    ->   name VARCHAR NULL,
    ->   journal VARCHAR NULL,
    ->   date VARCHAR NULL,
    ->   volume VARCHAR NULL,
    ->   issue VARCHAR NULL,
    ->   accession VARCHAR NULL,
    ->   timestamp TIMESTAMP NULL,
    ->   pmid VARCHAR NULL,
    ->   body VARCHAR NULL
    -> );
Query OK, 0 rows affected (0.10 sec)

mysql> create INVERTED INDEX idx1 on pmc20(name, journal, accession, body);
Query OK, 0 rows affected (0.06 sec)

mysql> COPY INTO pmc20 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);
+----------------------------------+-------------+-------------+-------------+------------------+
| File                             | Rows_loaded | Errors_seen | First_error | First_error_line |
+----------------------------------+-------------+-------------+-------------+------------------+
| data2/b41sh/bench/documents.json |      574199 |           0 | NULL        |             NULL |
+----------------------------------+-------------+-------------+-------------+------------------+
1 row in set (11 min 1.82 sec)
Read 1148398 rows, 43.02 GiB in 661.731 sec., 1.74 thousand rows/sec., 66.57 MiB/sec.

## old query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (2.53 sec)
Read 574199 rows, 0.00 B in 2.485 sec., 231.09 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (1.79 sec)
Read 574199 rows, 0.00 B in 1.706 sec., 336.6 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.69 sec)
Read 3039 rows, 0.00 B in 1.619 sec., 1.88 thousand rows/sec., 0.00 B/sec.


## new query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (1.37 sec)
Read 574199 rows, 0.00 B in 1.301 sec., 441.49 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (0.62 sec)
Read 574199 rows, 0.00 B in 0.542 sec., 1.06 million rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.52 sec)
Read 3039 rows, 0.00 B in 1.435 sec., 2.12 thousand rows/sec., 0.00 B/sec.

part of #14825

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

…ntain phrase terms

BohuTANG · 2024-04-28T14:23:13Z

Some panic during the tests :/

feat(query): inverted index use empty position data when query not co…

ff6b077

…ntain phrase terms

b41sh requested review from BohuTANG and sundy-li April 28, 2024 10:33

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 28, 2024

sundy-li approved these changes Apr 28, 2024

View reviewed changes

score as optional

26be0f8

b41sh added 6 commits April 28, 2024 22:39

fix

0d4f7f3

fix

69e3ae1

check need position in pruner

5f54557

fix

ce5cd72

Merge branch 'main' into feat-inverted-index-12

a0e86ce

fix tests

ec1af71

b41sh enabled auto-merge April 28, 2024 17:16

fix tests

37f93f8

b41sh changed the title ~~feat(query): inverted index use empty position data when query not co…~~ feat(query): inverted index use empty position data when query not contain phrase terms Apr 28, 2024

b41sh added this pull request to the merge queue Apr 28, 2024

Merged via the queue into datafuselabs:main with commit 9fb0a2f Apr 28, 2024
78 checks passed

b41sh deleted the feat-inverted-index-12 branch April 28, 2024 18:30

b41sh mentioned this pull request Apr 29, 2024

chore: purge inverted index #15354

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(query): inverted index use empty position data when query not contain phrase terms #15362

feat(query): inverted index use empty position data when query not contain phrase terms #15362

b41sh commented Apr 28, 2024 •

edited

BohuTANG commented Apr 28, 2024

feat(query): inverted index use empty position data when query not contain phrase terms #15362

feat(query): inverted index use empty position data when query not contain phrase terms #15362

Conversation

b41sh commented Apr 28, 2024 • edited

Summary

Tests

Type of change

BohuTANG commented Apr 28, 2024

b41sh commented Apr 28, 2024 •

edited