refactor: generate inverted indexs for each blocks #15150

b41sh · 2024-04-01T18:25:33Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In the previous PR #14997, we implemented the inverted index, where an index file corresponds to data in multiple blocks. In the actual test, we found that this design will lead to a larger index file, and it will take more time to read the data in the file, which affects the performance. Therefore, we rewrote the design of the index structure and refactored it in this PR, the detailed design is as follows.

Index Design

Generate one index file for each block, and the index file is used to determine which rows in the block match the query keywords when filtering, if no row is matched, the block can be pruned. as shown in the following figure:

 ┌────────┐     ┌────────┐   ┌────────┐     ┌────────┐
 │ Index1 │ ... │ IndexM │   │ IndexN │ ... │ IndexZ │
 └────────┘     └────────┘   └────────┘     └────────┘
     |              |            |              |
     |              |            |              |
 ┌────────┐     ┌────────┐   ┌────────┐     ┌────────┐
 │ Block1 │ ... │ BlockM │   │ BlockN │ ... │ BlockZ │
 └────────┘     └────────┘   └────────┘     └────────┘
  \                     /     \                     /
   \          _________/       \          _________/
    \        /                  \        /
     Segment1           ...      SegmentN

Compared with the previous design, it has the following benefits.

there is no need to maintain additional index info information, which simplifies the structure of the snapshot design.
index data can be generated and read in parallel to speed up indexing and querying.
inverted index filtering can be placed after bloom filtering and range filtering, part of the index does not need to be read.

Index file structure

The index data generated by Tantivy is stored in a directory containing several files, we merge those files into a large file as the index data. The index file contains the following parts:

┌──────────────┐
│ fast fields  │
├──────────────┤
│ store        │
├──────────────┤
│ field norms  │
├──────────────┤
│ positions    │
├──────────────┤
│ postings     │
├──────────────┤
│ terms        │
├──────────────┤
│ meta.json    │
├──────────────┤
│ managed.json │
├──────────────┤
│ offsets      │
└──────────────┘

terms stores the term dictionary in FST(Finite State Transducer) struct. The key is the term after tokenizer processing, value is an address in the postings and the positions.
postings stores the lists of document ids and term freqs.
positions stores the positions of terms in each document.
field norms stores the sum of the length of the term in each field.
fast fields stores column-oriented documents (not used).
store stores row-oriented documents (not used).
meta.json stores the meta information associated with the index, for example:

{
  "index_settings": {
    "docstore_compression": "lz4",
    "docstore_blocksize": 16384
  },
  "segments":[{
    "segment_id": "94bce521-d5bc-4ecc-bf3e-7a0212093622",
    "max_doc": 6,
    "deletes": null
  }],
  "schema":[{
    "name": "title",
    "type": "text",
    "options": {
      "indexing": {
        "record": "position",
        "fieldnorms": true,
        "tokenizer": "en"
      },
      "stored": false,
      "fast": false
    }
  }],
  "opstamp":0
}

managed.json stores the name of the file that the index contains, for example:

[
  "94bce521d5bc4eccbf3e7a0212093622.pos",
  "94bce521d5bc4eccbf3e7a0212093622.fieldnorm",
  "94bce521d5bc4eccbf3e7a0212093622.fast",
  "meta.json",
  "94bce521d5bc4eccbf3e7a0212093622.term",
  "94bce521d5bc4eccbf3e7a0212093622.store",
  "94bce521d5bc4eccbf3e7a0212093622.idx"
]

offsets stores the offsets of each parts in the index file.

When querying, we first read terms to get the address of postings and positions. postings contains the matched rows and the frequency of the term in each row, we can use the frequency and field norms to calculate the match score of each row. If the query keyword is a phrase, we also need to use positions of each word in the phrase to determine whether the terms can form the phrase.

Block pruner

Databend supports many types of pruners, which will determine whether a block can be pruned in the following order.

range pruner uses the maximum and minimum values of the fields in the block to determine whether a range query can prune a block.
bloom pruner uses the bloom filter to determine whether a point query can prune a block.
limit pruner uses the amount of query value to determine whether the block can be pruned.
inverted index pruner uses the inverted filter to determine whether a search query can prune a block.

If the query condition specifies more than one filter field, such as time range. Some of the data in the inverted index may not be read, which can speed up the query.

Tokenizer

Currently, we support two tokenizers, English and Chinese, to split the input sentences, user can specify the tokenizer by option when creating the index. if not specified, the default is English tokenizer, for example.

CREATE TABLE t (id int, content string)

CREATE INVERTED INDEX IF NOT EXISTS idx1 ON t(content) tokenizer = 'english';

CREATE INVERTED INDEX IF NOT EXISTS idx2 ON t(content) tokenizer = 'chinese';

Examples

Let's use pmc data as an example for testing.

the test data can be downloaded from this URL: https://s3.amazonaws.com/benchmarks.redislabs/redisearch/datasets/pmc/documents.json.bz2

mysql> create table pmc(
    ->     name string,
    ->     journal string,
    ->     date string,
    ->     volume string,
    ->     issue string,
    ->     accession string,
    ->     timestamp timestamp,
    ->     pmid string,
    ->     body string
    -> );
Query OK, 0 rows affected (0.42 sec)

mysql> copy into pmc from 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = ndjson);

+----------------------------------+-------------+-------------+-------------+------------------+
| File                             | Rows_loaded | Errors_seen | First_error | First_error_line |
+----------------------------------+-------------+-------------+-------------+------------------+
| data2/b41sh/bench/documents.json |      574199 |           0 | NULL        |             NULL |
+----------------------------------+-------------+-------------+-------------+------------------+
1 row in set (3 min 55.75 sec)
Read 574199 rows, 21.66 GiB in 235.667 sec., 2.44 thousand rows/sec., 94.11 MiB/sec.

mysql> create inverted index idx1 on pmc(name, journal, accession, body);
Query OK, 0 rows affected (0.79 sec)

mysql> refresh inverted index idx1 on pmc;
Query OK, 0 rows affected (14 min 11.57 sec)

# match cold run
mysql> select count(*) from pmc where match (body, '"Vellore Institute"');
+----------+
| count(*) |
+----------+
|       26 |
+----------+
1 row in set (2.35 sec)
Read 74142 rows, 0.00 B in 2.299 sec., 32.35 thousand rows/sec., 0.00 B/sec.

# match hot run
# use upper case word to ignore pruner cache
mysql> select count(*) from pmc where match (body, '"Vellore INSTITUTE"');
+----------+
| count(*) |
+----------+
|       26 |
+----------+
1 row in set (1.57 sec)
Read 74142 rows, 0.00 B in 1.486 sec., 49.89 thousand rows/sec., 0.00 B/sec.

# like cold run
mysql> select count(*) from pmc where body like '%Vellore Institute%';
+----------+
| count(*) |
+----------+
|       25 |
+----------+
1 row in set (31.32 sec)
Read 574199 rows, 21.31 GiB in 31.228 sec., 18.39 thousand rows/sec., 698.86 MiB/sec.

# like hot run
mysql> select count(*) from pmc where body like '%Vellore Institute%';
+----------+
| count(*) |
+----------+
|       25 |
+----------+
1 row in set (31.56 sec)
Read 574199 rows, 21.31 GiB in 31.472 sec., 18.24 thousand rows/sec., 693.44 MiB/sec.

# match cold run
mysql> select count(*) from pmc where match (body, 'Introduction');
+----------+
| count(*) |
+----------+
|   317491 |
+----------+
1 row in set (2.46 sec)
Read 574199 rows, 0.00 B in 2.365 sec., 242.84 thousand rows/sec., 0.00 B/sec.

# match hot run
# use lower case word to ignore pruner cache
mysql> select count(*) from pmc where match (body, 'introduction');
+----------+
| count(*) |
+----------+
|   317491 |
+----------+
1 row in set (1.59 sec)
Read 574199 rows, 0.00 B in 1.490 sec., 385.39 thousand rows/sec., 0.00 B/sec.

# like cold run
mysql> select count(*) from pmc where body like '%Introduction%';
+----------+
| count(*) |
+----------+
|   260654 |
+----------+
1 row in set (29.63 sec)
Read 574199 rows, 21.31 GiB in 29.537 sec., 19.44 thousand rows/sec., 738.88 MiB/sec.

# like hot run
mysql> select count(*) from pmc where body like '%Introduction%';
+----------+
| count(*) |
+----------+
|   260654 |
+----------+
1 row in set (30.15 sec)
Read 574199 rows, 21.31 GiB in 30.058 sec., 19.1 thousand rows/sec., 726.06 MiB/sec.

Other changes

TableIndex meta add fields version, options and refreshed_on.
Remove index_info_locations field from Snapshot.
Add indexes field to Snapshot to record indexed segments of each index.
Add inverted index pruning metrics.
add options for user to specify tokenizer when create inverted index.

part of #14825

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

src/query/storages/common/table_meta/src/meta/v4/snapshot.rs

github-actions · 2024-04-02T00:37:10Z

Docker Image for PR

tag: pr-15150-8bc2f1b

note: this image tag is only available for internal use,
please check the internal doc for more details.

src/meta/app/src/schema/table.rs

src/query/storages/common/table_meta/src/meta/v4/snapshot.rs

BohuTANG · 2024-04-03T02:45:41Z

Does this PR affects the memory usage of plan?
https://github.com/datafuselabs/databend/actions/runs/8531541115/job/23373653532?pr=15150#step:4:202

b41sh · 2024-04-03T03:19:33Z

Does this PR affects the memory usage of plan? https://github.com/datafuselabs/databend/actions/runs/8531541115/job/23373653532?pr=15150#step:4:202

yes, PushDown add a new filed index_version

BohuTANG · 2024-04-03T06:16:00Z

There are some tests need fixed:

unit test

Differences (-left|+right):
 | 'test-node' | 'bloom_index_filter_cache'       | 0        | 0        |
 | 'test-node' | 'bloom_index_meta_cache'         | 0        | 0        |
 | 'test-node' | 'file_meta_data_cache'           | 0        | 0        |
 | 'test-node' | 'inverted_index_filter_cache'    | 0        | 0        |
-| 'test-node' | 'inverted_index_info_cache'      | 0        | 0        |
 | 'test-node' | 'prune_partitions_cache'         | 0        | 0        |
 | 'test-node' | 'segment_info_cache'             | 0        | 0        |
 | 'test-node' | 'table_snapshot_cache'           | 0        | 0        |
 | 'test-node' | 'table_snapshot_statistic_cache' | 0        | 0        |

https://github.com/datafuselabs/databend/actions/runs/8533209730/job/23375577396?pr=15150#step:4:4210

sql logic test

[Diff] (-expected|+actual)
    Memo
Error: SelfError("sqllogictest failed")
    ├── root group: #5
-   ├── estimated memory: 6032 bytes
+   ├── estimated memory: 6240 bytes

https://github.com/datafuselabs/databend/actions/runs/8533209730/job/23375815590?pr=15150#step:4:204

refactor: generate inverted indexs for each blocks

e2b0db6

b41sh requested a review from sundy-li April 1, 2024 18:25

b41sh requested a review from drmingdrmer as a code owner April 1, 2024 18:25

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Apr 1, 2024

BohuTANG added the ci-cloud Build docker image for cloud test label Apr 2, 2024

sundy-li reviewed Apr 2, 2024

View reviewed changes

src/query/storages/common/table_meta/src/meta/v4/snapshot.rs Outdated Show resolved Hide resolved

BohuTANG reviewed Apr 2, 2024

View reviewed changes

src/meta/app/src/schema/table.rs Outdated Show resolved Hide resolved

BohuTANG reviewed Apr 2, 2024

View reviewed changes

src/query/storages/common/table_meta/src/meta/v4/snapshot.rs Outdated Show resolved Hide resolved

b41sh added 2 commits April 2, 2024 09:35

fix

b8c8148

Merge branch 'main' into feat-inverted-index-6

13b0daf

drmingdrmer approved these changes Apr 2, 2024

View reviewed changes

BohuTANG requested a review from dantengsky April 2, 2024 05:48

BohuTANG and others added 9 commits April 2, 2024 14:07

Merge branch 'main' into feat-inverted-index-6

9838160

Merge branch 'main' into feat-inverted-index-6

811c984

remove index refreshed_on

fd3dd0d

add index info file

9fadbcd

fix

5d81378

Merge branch 'main' into feat-inverted-index-6

7576ee8

add offset numbers in index file

c5da888

fix

d483daa

fix

60af0b7

Merge branch 'main' into feat-inverted-index-6

2cb3cd6

b41sh added 4 commits April 3, 2024 11:23

fix

e54aee9

fix

3d7605d

fix tests

e727e52

refactor index info format

0871868

b41sh added 3 commits April 3, 2024 14:51

Merge branch 'main' into feat-inverted-index-6

62cce45

rename snapshot indexes as inverted_indexes

011e0c1

fix unit tests

b9a7747

dantengsky approved these changes Apr 3, 2024

View reviewed changes

BohuTANG merged commit fecea50 into datafuselabs:main Apr 3, 2024
72 checks passed

BohuTANG mentioned this pull request Apr 7, 2024

inverted index datafuselabs/databend-docs#716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: generate inverted indexs for each blocks #15150

refactor: generate inverted indexs for each blocks #15150

b41sh commented Apr 1, 2024 •

edited

github-actions bot commented Apr 2, 2024

BohuTANG commented Apr 3, 2024

b41sh commented Apr 3, 2024

BohuTANG commented Apr 3, 2024

refactor: generate inverted indexs for each blocks #15150

refactor: generate inverted indexs for each blocks #15150

Conversation

b41sh commented Apr 1, 2024 • edited

Summary

Index Design

Index file structure

Block pruner

Tokenizer

Examples

Other changes

Tests

Type of change

github-actions bot commented Apr 2, 2024

Docker Image for PR

BohuTANG commented Apr 3, 2024

b41sh commented Apr 3, 2024

BohuTANG commented Apr 3, 2024

b41sh commented Apr 1, 2024 •

edited