Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): inverted index use empty position data when query not contain phrase terms #15362

Merged
merged 9 commits into from
Apr 28, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Apr 28, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In the files of the inverted index, the position file records the position of each term in the original text, which is used to judge whether the terms are adjacent to each other when searching for phrases. Position files are usually large in size, which leads to a long time of reading the data of the inverted index and affects the query speed. For example, the size of each file of a 55M inverted index data is as follows:

  1. positions file 37.8M
  2. postings file 10.7M
  3. terms file 6.3M
  4. store file 19.2K
  5. field_norms file 19.1K
  6. fast file 0.1K
  7. meta.json file 0.7K
  8. managed.json file 0.2K

We can see that the positions file takes up 68% of the total size of all the files, which is the main reason for the slow speed of reading the index data. Since the positions file is only used when querying for phrases, it is not used for other queries. We can choose not to read the positions file when the querying don't contain phrase terms, and use an empty positions file instead, which can greatly speed up the query. After testing, we found that the query time was reduced by about 50%.

for example

mysql> CREATE TABLE pmc20 (
    ->   name VARCHAR NULL,
    ->   journal VARCHAR NULL,
    ->   date VARCHAR NULL,
    ->   volume VARCHAR NULL,
    ->   issue VARCHAR NULL,
    ->   accession VARCHAR NULL,
    ->   timestamp TIMESTAMP NULL,
    ->   pmid VARCHAR NULL,
    ->   body VARCHAR NULL
    -> );
Query OK, 0 rows affected (0.10 sec)

mysql> create INVERTED INDEX idx1 on pmc20(name, journal, accession, body);
Query OK, 0 rows affected (0.06 sec)

mysql> COPY INTO pmc20 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);
+----------------------------------+-------------+-------------+-------------+------------------+
| File                             | Rows_loaded | Errors_seen | First_error | First_error_line |
+----------------------------------+-------------+-------------+-------------+------------------+
| data2/b41sh/bench/documents.json |      574199 |           0 | NULL        |             NULL |
+----------------------------------+-------------+-------------+-------------+------------------+
1 row in set (11 min 1.82 sec)
Read 1148398 rows, 43.02 GiB in 661.731 sec., 1.74 thousand rows/sec., 66.57 MiB/sec.

## old query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (2.53 sec)
Read 574199 rows, 0.00 B in 2.485 sec., 231.09 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (1.79 sec)
Read 574199 rows, 0.00 B in 1.706 sec., 336.6 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.69 sec)
Read 3039 rows, 0.00 B in 1.619 sec., 1.88 thousand rows/sec., 0.00 B/sec.


## new query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (1.37 sec)
Read 574199 rows, 0.00 B in 1.301 sec., 441.49 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (0.62 sec)
Read 574199 rows, 0.00 B in 0.542 sec., 1.06 million rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.52 sec)
Read 3039 rows, 0.00 B in 1.435 sec., 2.12 thousand rows/sec., 0.00 B/sec.

part of #14825

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@b41sh b41sh requested review from BohuTANG and sundy-li April 28, 2024 10:33
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 28, 2024
@BohuTANG
Copy link
Member

Some panic during the tests :/

@b41sh b41sh enabled auto-merge April 28, 2024 17:16
@b41sh b41sh changed the title feat(query): inverted index use empty position data when query not co… feat(query): inverted index use empty position data when query not contain phrase terms Apr 28, 2024
@b41sh b41sh added this pull request to the merge queue Apr 28, 2024
Merged via the queue into datafuselabs:main with commit 9fb0a2f Apr 28, 2024
78 checks passed
@b41sh b41sh deleted the feat-inverted-index-12 branch April 28, 2024 18:30
@b41sh b41sh mentioned this pull request Apr 29, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants