Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: generate inverted indexs for each blocks #15150

Merged
merged 20 commits into from Apr 3, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Apr 1, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In the previous PR #14997, we implemented the inverted index, where an index file corresponds to data in multiple blocks. In the actual test, we found that this design will lead to a larger index file, and it will take more time to read the data in the file, which affects the performance. Therefore, we rewrote the design of the index structure and refactored it in this PR, the detailed design is as follows.

Index Design

Generate one index file for each block, and the index file is used to determine which rows in the block match the query keywords when filtering, if no row is matched, the block can be pruned. as shown in the following figure:

 ┌────────┐     ┌────────┐   ┌────────┐     ┌────────┐
 │ Index1 │ ... │ IndexM │   │ IndexN │ ... │ IndexZ │
 └────────┘     └────────┘   └────────┘     └────────┘
     |              |            |              |
     |              |            |              |
 ┌────────┐     ┌────────┐   ┌────────┐     ┌────────┐
 │ Block1 │ ... │ BlockM │   │ BlockN │ ... │ BlockZ │
 └────────┘     └────────┘   └────────┘     └────────┘
  \                     /     \                     /
   \          _________/       \          _________/
    \        /                  \        /
     Segment1           ...      SegmentN

Compared with the previous design, it has the following benefits.

  1. there is no need to maintain additional index info information, which simplifies the structure of the snapshot design.
  2. index data can be generated and read in parallel to speed up indexing and querying.
  3. inverted index filtering can be placed after bloom filtering and range filtering, part of the index does not need to be read.

Index file structure

The index data generated by Tantivy is stored in a directory containing several files, we merge those files into a large file as the index data. The index file contains the following parts:

┌──────────────┐
│ fast fields  │
├──────────────┤
│ store        │
├──────────────┤
│ field norms  │
├──────────────┤
│ positions    │
├──────────────┤
│ postings     │
├──────────────┤
│ terms        │
├──────────────┤
│ meta.json    │
├──────────────┤
│ managed.json │
├──────────────┤
│ offsets      │
└──────────────┘
  • terms stores the term dictionary in FST(Finite State Transducer) struct. The key is the term after tokenizer processing, value is an address in the postings and the positions.
  • postings stores the lists of document ids and term freqs.
  • positions stores the positions of terms in each document.
  • field norms stores the sum of the length of the term in each field.
  • fast fields stores column-oriented documents (not used).
  • store stores row-oriented documents (not used).
  • meta.json stores the meta information associated with the index, for example:
{
  "index_settings": {
    "docstore_compression": "lz4",
    "docstore_blocksize": 16384
  },
  "segments":[{
    "segment_id": "94bce521-d5bc-4ecc-bf3e-7a0212093622",
    "max_doc": 6,
    "deletes": null
  }],
  "schema":[{
    "name": "title",
    "type": "text",
    "options": {
      "indexing": {
        "record": "position",
        "fieldnorms": true,
        "tokenizer": "en"
      },
      "stored": false,
      "fast": false
    }
  }],
  "opstamp":0
}
  • managed.json stores the name of the file that the index contains, for example:
[
  "94bce521d5bc4eccbf3e7a0212093622.pos",
  "94bce521d5bc4eccbf3e7a0212093622.fieldnorm",
  "94bce521d5bc4eccbf3e7a0212093622.fast",
  "meta.json",
  "94bce521d5bc4eccbf3e7a0212093622.term",
  "94bce521d5bc4eccbf3e7a0212093622.store",
  "94bce521d5bc4eccbf3e7a0212093622.idx"
]
  • offsets stores the offsets of each parts in the index file.

When querying, we first read terms to get the address of postings and positions. postings contains the matched rows and the frequency of the term in each row, we can use the frequency and field norms to calculate the match score of each row. If the query keyword is a phrase, we also need to use positions of each word in the phrase to determine whether the terms can form the phrase.

Block pruner

Databend supports many types of pruners, which will determine whether a block can be pruned in the following order.

  • range pruner uses the maximum and minimum values of the fields in the block to determine whether a range query can prune a block.
  • bloom pruner uses the bloom filter to determine whether a point query can prune a block.
  • limit pruner uses the amount of query value to determine whether the block can be pruned.
  • inverted index pruner uses the inverted filter to determine whether a search query can prune a block.

If the query condition specifies more than one filter field, such as time range. Some of the data in the inverted index may not be read, which can speed up the query.

Tokenizer

Currently, we support two tokenizers, English and Chinese, to split the input sentences, user can specify the tokenizer by option when creating the index. if not specified, the default is English tokenizer, for example.

CREATE TABLE t (id int, content string)

CREATE INVERTED INDEX IF NOT EXISTS idx1 ON t(content) tokenizer = 'english';

CREATE INVERTED INDEX IF NOT EXISTS idx2 ON t(content) tokenizer = 'chinese';

Examples

Let's use pmc data as an example for testing.

the test data can be downloaded from this URL: https://s3.amazonaws.com/benchmarks.redislabs/redisearch/datasets/pmc/documents.json.bz2

mysql> create table pmc(
    ->     name string,
    ->     journal string,
    ->     date string,
    ->     volume string,
    ->     issue string,
    ->     accession string,
    ->     timestamp timestamp,
    ->     pmid string,
    ->     body string
    -> );
Query OK, 0 rows affected (0.42 sec)

mysql> copy into pmc from 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = ndjson);

+----------------------------------+-------------+-------------+-------------+------------------+
| File                             | Rows_loaded | Errors_seen | First_error | First_error_line |
+----------------------------------+-------------+-------------+-------------+------------------+
| data2/b41sh/bench/documents.json |      574199 |           0 | NULL        |             NULL |
+----------------------------------+-------------+-------------+-------------+------------------+
1 row in set (3 min 55.75 sec)
Read 574199 rows, 21.66 GiB in 235.667 sec., 2.44 thousand rows/sec., 94.11 MiB/sec.

mysql> create inverted index idx1 on pmc(name, journal, accession, body);
Query OK, 0 rows affected (0.79 sec)

mysql> refresh inverted index idx1 on pmc;
Query OK, 0 rows affected (14 min 11.57 sec)

# match cold run
mysql> select count(*) from pmc where match (body, '"Vellore Institute"');
+----------+
| count(*) |
+----------+
|       26 |
+----------+
1 row in set (2.35 sec)
Read 74142 rows, 0.00 B in 2.299 sec., 32.35 thousand rows/sec., 0.00 B/sec.

# match hot run
# use upper case word to ignore pruner cache
mysql> select count(*) from pmc where match (body, '"Vellore INSTITUTE"');
+----------+
| count(*) |
+----------+
|       26 |
+----------+
1 row in set (1.57 sec)
Read 74142 rows, 0.00 B in 1.486 sec., 49.89 thousand rows/sec., 0.00 B/sec.

# like cold run
mysql> select count(*) from pmc where body like '%Vellore Institute%';
+----------+
| count(*) |
+----------+
|       25 |
+----------+
1 row in set (31.32 sec)
Read 574199 rows, 21.31 GiB in 31.228 sec., 18.39 thousand rows/sec., 698.86 MiB/sec.

# like hot run
mysql> select count(*) from pmc where body like '%Vellore Institute%';
+----------+
| count(*) |
+----------+
|       25 |
+----------+
1 row in set (31.56 sec)
Read 574199 rows, 21.31 GiB in 31.472 sec., 18.24 thousand rows/sec., 693.44 MiB/sec.

# match cold run
mysql> select count(*) from pmc where match (body, 'Introduction');
+----------+
| count(*) |
+----------+
|   317491 |
+----------+
1 row in set (2.46 sec)
Read 574199 rows, 0.00 B in 2.365 sec., 242.84 thousand rows/sec., 0.00 B/sec.

# match hot run
# use lower case word to ignore pruner cache
mysql> select count(*) from pmc where match (body, 'introduction');
+----------+
| count(*) |
+----------+
|   317491 |
+----------+
1 row in set (1.59 sec)
Read 574199 rows, 0.00 B in 1.490 sec., 385.39 thousand rows/sec., 0.00 B/sec.

# like cold run
mysql> select count(*) from pmc where body like '%Introduction%';
+----------+
| count(*) |
+----------+
|   260654 |
+----------+
1 row in set (29.63 sec)
Read 574199 rows, 21.31 GiB in 29.537 sec., 19.44 thousand rows/sec., 738.88 MiB/sec.

# like hot run
mysql> select count(*) from pmc where body like '%Introduction%';
+----------+
| count(*) |
+----------+
|   260654 |
+----------+
1 row in set (30.15 sec)
Read 574199 rows, 21.31 GiB in 30.058 sec., 19.1 thousand rows/sec., 726.06 MiB/sec.

Other changes

  • TableIndex meta add fields version, options and refreshed_on.
  • Remove index_info_locations field from Snapshot.
  • Add indexes field to Snapshot to record indexed segments of each index.
  • Add inverted index pruning metrics.
  • add options for user to specify tokenizer when create inverted index.

part of #14825

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@b41sh b41sh requested a review from sundy-li April 1, 2024 18:25
@b41sh b41sh requested a review from drmingdrmer as a code owner April 1, 2024 18:25
@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Apr 1, 2024
@BohuTANG BohuTANG added the ci-cloud Build docker image for cloud test label Apr 2, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

Docker Image for PR

  • tag: pr-15150-8bc2f1b

note: this image tag is only available for internal use,
please check the internal doc for more details.

@BohuTANG BohuTANG requested a review from dantengsky April 2, 2024 05:48
@BohuTANG
Copy link
Member

BohuTANG commented Apr 3, 2024

@b41sh
Copy link
Member Author

b41sh commented Apr 3, 2024

Does this PR affects the memory usage of plan? https://github.com/datafuselabs/databend/actions/runs/8531541115/job/23373653532?pr=15150#step:4:202

yes, PushDown add a new filed index_version

@BohuTANG
Copy link
Member

BohuTANG commented Apr 3, 2024

There are some tests need fixed:

  1. unit test
Differences (-left|+right):
 | 'test-node' | 'bloom_index_filter_cache'       | 0        | 0        |
 | 'test-node' | 'bloom_index_meta_cache'         | 0        | 0        |
 | 'test-node' | 'file_meta_data_cache'           | 0        | 0        |
 | 'test-node' | 'inverted_index_filter_cache'    | 0        | 0        |
-| 'test-node' | 'inverted_index_info_cache'      | 0        | 0        |
 | 'test-node' | 'prune_partitions_cache'         | 0        | 0        |
 | 'test-node' | 'segment_info_cache'             | 0        | 0        |
 | 'test-node' | 'table_snapshot_cache'           | 0        | 0        |
 | 'test-node' | 'table_snapshot_statistic_cache' | 0        | 0        |

https://github.com/datafuselabs/databend/actions/runs/8533209730/job/23375577396?pr=15150#step:4:4210

  1. sql logic test
[Diff] (-expected|+actual)
    Memo
Error: SelfError("sqllogictest failed")
    ├── root group: #5
-   ├── estimated memory: 6032 bytes
+   ├── estimated memory: 6240 bytes

https://github.com/datafuselabs/databend/actions/runs/8533209730/job/23375815590?pr=15150#step:4:204

@BohuTANG BohuTANG merged commit fecea50 into datafuselabs:main Apr 3, 2024
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants