Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try out a tantivy's term dictionary format #12513

Open
Tony-X opened this issue Aug 17, 2023 · 16 comments
Open

Try out a tantivy's term dictionary format #12513

Tony-X opened this issue Aug 17, 2023 · 16 comments

Comments

@Tony-X
Copy link
Contributor

Tony-X commented Aug 17, 2023

Description

Hello!

I've been working on a benchmark for a while to compare the features and performance of Lucene and Tantivy, a rust search engine library which was heavily inspired by Lucene.

The benchmark uses the corpus and queries from luceneutil (the framework for Lucene nightly bench). Since not all query types are supported by Tantivy, currently it focuses on Term/Boolean/PhraseQuery. Tantivy in general showed performance advantages for now and I got motivated to understand why.

I documented the two engines' inverted index implementations per my understanding. Here is the wiki. Specifically, both engines use FST to aid the term lookup but the way they use them are quite different. In summary, Lucene uses FST to map term prefixes followed by scanning the on-disk blocks of terms. Tantivy uses FST to maps all the terms to their ordinals and use that ordinal/index to decode at most one full block.

The proposal here is to try Tantivy's term dictionary which I can see some advantages

  1. it can determine a term does not existing with only FST operations.
  2. decoding less terms in worst case (a term within a large gap between two prefixes)
  3. it is simpler? (might be subjective, but it took me days to digest Lucene90BlockTreeTermsWriter and I'm still not sure I got every bits correct...)

What do you think?

@Tony-X Tony-X changed the title Try out a simpler term dictionary Try out a tantivy's term dictionary format Aug 17, 2023
@Tony-X
Copy link
Contributor Author

Tony-X commented Aug 17, 2023

Copy paste the relevant wiki section about tantivy's TermDictionary


tantivy TermDictionary

Tantivy's term dictionary has two components. An FST that encodes term -> term_ordinal (u64) where term ordinal is the index of the term in the sorted order. The TermInfoStore maps the ordinal to the postings metadata. Conceptually, TermInfoStore is just a big vector of postings metadata ordered by the term ordinal. Internally, it applies additional compression but still offers random access.

TermInfoStore encodings



                                 │
        Metadata Section         │                  Blocks
  ┌───┬───┬──────────────┬───┬───┼───────────┬──────────────────┬─────────┐
  │   │   │              │   │   │           │                  │         │
  │   │   │              │   │   │           │                  │         │
  │ 1 │   │   ... ...    │n-1│ n │  block 1  │     ... ...      │ block n │
  │   │   │              │   │   │           │                  │         │
  └───┴───┴──────────────┴───┴───┼───────────┴──────────────────┴─────────┘
                                 │
        n fix-sized record       │
                                 │

The metadata record contains all information needed to decode the corresponding block.

struct TermInfoBlockMeta {
    offset: u64,
    ref_term_info: TermInfo,
    doc_freq_nbits: u8,
    postings_offset_nbits: u8,
    positions_offset_nbits: u8,
}

pub struct TermInfo {
    /// Number of documents in the segment containing the term
    pub doc_freq: u32,
    /// Byte range of the posting list within the postings (`.idx`) file.
    pub postings_range: Range<usize>,
    /// Byte range of the positions of this terms in the positions (`.pos`) file.
    pub positions_range: Range<usize>,
}
  • offset: the start offset of the data block in the block section
  • ref_term_info: the reference TermInfo on top of which the delta is applied
  • *_nbits: the bit-width used to bit-unpack freq and postings/position offsets in the data block.

At search time, both FST and the TermInfoStore are loaded into memory. To search for a term, it first consults the FST to get back the term ordinal if it exists. Note that here the FST contains all the terms so the lookup process can tell if a term does not exist. The term ordinal is used to determine which metadata record and that terms relative offset within the data block by simply modulo the (fixed) block size. Then it decodes the metadata record which helps to locate the data block and decode the TermInfo (postings metadata).

@mikemccand
Copy link
Member

it took me days to digest Lucene90BlockTreeTermsWriter and I'm still not sure I got every bits correct

Sorry! This is my fault :) It'd be awesome to simplify the block-tree terms dictionary if you have ideas ... it is truly hairy. Yet it is fast and compactish and paged memory friendly (hot stuff localized together so OS can clearly cache that, large pages of cold stuff can be mostly left on disk, for indices that do not entirely fit in RAM).

Also, thank you @Tony-X for creating the open source licensed (ASL2) combination of Tantivy's and Lucene's benchmark in your repo, enabling us to isolate/understand the performance and functional differences.

This has already led to some nice cross-fertilization gains in Lucene, such as optimizing count() for disjunctive queries -- see the new nightly chart for count(OrHighHigh) -- thank you @jpountz and @fulmicoton (Tantivy creator!) for the idea. The added cost of G1GC memory barriers (separate issue, a 4.9% latency hit to AndHighHigh, thanks @slow-J and @uschindler for suggesting we test/isolate GC effects), was surprising to me.

+1 to explore a terms dictionary format similar to Tantivy's. I think the experimental (no backwards compatibility!) FSTPostingsFormat is close? It holds all terms in a single FST (for each segment), and maps to a byte[] blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the .doc/.pos/etc. postings files) for this term. To match Tantivy's approach we would change that to dereference through a long (there can be > 2.1 B terms in one segment) ordinal instead of inlining all metadata in a single byte[], so that the FST only stores this ordinal and then looks up all the term metadata in a different data structure? But, that FST can get quite large, and take quite a bit of time to create during indexing, though FSTs are off-heap now, so perhaps letting the OS decide the hot vs warm pages will be fine at search time. Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary!

@Tony-X
Copy link
Contributor Author

Tony-X commented Aug 18, 2023

Thanks @mikemccand for bringing in the context. I should've done that part better :)

FSTPostingsFormat is close? It holds all terms in a single FST (for each segment), and maps to a byte[] blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the .doc/.pos/etc. postings files) for this term.

Yes, I actually tried to use FSTPostingsFormat in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand. Search-wise, the performance got slightly bit worse. So I set out to dig deeper and realized what you pointed out -- the FST maps the term to a byte[] blob (the postings's term metadata). I have not gone to the full details of the paper that underpins FSTCompiler implementation but I believe mapping to 8-byte ordinals (monotonically increasing) are much easier than mapping to variable-length and unordered byte[] blobs. Also, compression-wise the FST may have done a great job in compressing the keys but not so for the blobs.

But, that FST can get quite large, and take quite a bit of time to create during indexing

I think if we move the values out of FST we could balance the size. Time-wise, I'm not sure. Hopefully the simplified value space make building FST easier. This requires some experimentation

Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary!

Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms.

@Tony-X
Copy link
Contributor Author

Tony-X commented Aug 31, 2023

I'd like to seek for some advices regarding the situation I am in --

I want to preserve the nice properties of the tantivy's termdict as I port it over for Lucene

  1. definitive term lookup; no additional scan is required for non-existent term after FST lookup.
  2. random-addressing term information given an ordinal. again no additional scan;

(2) is possible because after reading the metadata block it can determine the record size in the corresponding data block, such that it knows the term's data starts at data_block_offest + (term_ord % block_size) * record_size. This is the benefit of having a fixed record size per term in a block. However there are a few things in the way current Lucene90PostingsFormat encode as TermState make it challening:

  1. docid pulsing -- sometimes when the term just has a single document associated with it. We don't write any posting but to reuse the docStartFp in the term's data to record that singleton docid.
  2. skipOffset -- only set when the term has >= 128 docs
  3. lastPosBlockOffset -- similarly, this optionally indicates if there is a VInt encoded remainder block of positions.

I can think of one way to solve them is to change the posting formart to make each posting/position record self-descriptive. e.g. adding a small header per postings list in the pos file to store singleton docid and the skip_offset as well as storing number of packed blocks before positions data starts. But this will require changing the posting reader/writer.

I wonder if there are smarter ways (as compare to store everything with their full width) to achieve the fixed size record and preferably not changing Postings.

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 3, 2023

Now that I think about it, I don't think we really need to store lastPosBlockOffset in the term metadata, because we can instead use totalTermFreq to track if we are reading the last block of the positions.

I opened #12536 which should be an improvement without trading off anything.

@mikemccand
Copy link
Member

Yes, I actually tried to use FSTPostingsFormat in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand.

Do you know whether Tantivy is producing a truly minimal FST? Doing so (which Lucene does, by default, I think!) requires a big hash map during indexing to keep track of already-seen suffixes and share them, making a minimal FST that is like a combined prefix and suffix trie. If you disable this entirely, the FST becomes a prefix trie. You can play with Lucene NOT trying so hard to make a minimal FST by increasing the minSuffixCount and minSuffixCount2 in the FSTCompiler.Builder from 0 to 2 or 3 -- this should make FST compilation faster, less RAM hungry, and increase its size somewhat (perhaps negligibly in practice; maybe we should change Lucene90PostingsWriter's defaults here?).

Maybe we could try a standalone test to build an FST with both Tantivy and Lucene and compare the byte size? I'll open an issue over in search-benchmark-game!

Search-wise, the performance got slightly bit worse.

This is curious -- I would expect the terms dict lookup cost to be unimportant for queries that visit many hits. The terms dict cost should be dwarfed by the cost of iterating the postings. Did you see the slowdown only for terms dict intensive queries? We should really test e.g. FuzzyQuery or RegexpQuery or maybe TermInSetQuery with a large set of primary key values, to suss out the impact here. Or maybe the performance drop was under the noise floor of the benchy...?

Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary!

Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms.

+1, this is an exciting exploration!

Note that there are certain cases (perhaps rare in practice, not sure) where even Lucene's "prefix" FST can skip a segment without checking the on-disk terms blocks. It happens when even the prefix for a given term never occurs in any other terms in that segment, which might be frequent if say documents are indexed in sorted order by their primary keys. This would cause certain "dense" regions of primary key space to be indexed into each segment and might then mean on lookup that the prefix FST can know that a given PK cannot occur in the segment without even scanning the on-disk blocks.

@mikemccand
Copy link
Member

  1. random-addressing term information given an ordinal. again no additional scan;

Hmm indeed this would require a fixed block size for every term's metadata.

Does Tantivy do pulsing (inlining postings for a singleton terms into the terms dictionary)?

Another option might be to have each term block have some sort of header array to quickly map a term ordinal (within the block) to its corresponding file pointer location?

Or perhaps we keep the scanning within a block when looking for an ord within that block? The FST would still be definitive about whether a term exists or not, but then when it exists, we would still need to do some scanning.

Or, for starters, just make all terms metadata fixed width (yes, wasting bytes for those terms that don't need the extra stuff). It'd be a start just to simplify playing with this idea, which we could then iterate from?

I have not gone to the full details of the paper that underpins FSTCompiler implementation but I believe mapping to 8-byte ordinals (monotonically increasing) are much easier than mapping to variable-length and unordered byte[] blobs.

Yeah -- this is very true! FST is very efficient at encoding monotonically increasing int/long outputs, much more so than semi-random looking byte[] blobs that don't often share prefixes. Also, with a monotonically increasing output, reverse lookup (ord -> term) becomes possible! I think FSTOrdPostingsFormat does that (implements the optional TermsEnum.seekExact(long ord) API)?

@fulmicoton
Copy link

Does Tantivy do pulsing (inlining postings for a singleton terms into the terms dictionary)?

No but we should. It has been on my task list for a long time.

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 5, 2023

Do you know whether Tantivy is producing a truly minimal FST?

Maybe @fulmicoton can shed more light on this topic :)

A related question: can Tantivy read a Lucene-built FST and vice-versa?

@mikemccand
Copy link
Member

I'm poking around trying to understand Tantivy's FST implementation, and found it was forked originally from this FST implementation into this Tantivy specific version (which seems to have fallen behind merging the upstream changes?).

There is a wonderful blog post describing it. Now I want to try building a Lucene FST from that giant Common Crawl corpus -- 1.6 B URLs!

Some clear initial differences over Lucene's implementation:

  • The original fst package (linked above) can build Levenshtein FSTs! Lucene can build Levenshtein Automata, but not FSTs.
  • It can also search FSTs using regexps! Lucene can do that w/ Automaton, but not FSTs.
  • Generally, the Rust FST implementation does a stronger job unifying Automata and FSTs, whereas in Lucene these are strongly divorced classes despite having clear overlapping functionality.
  • Building the FST looks crazy fast compared to Lucene -- I'm really curious how it works :) Specifically, how the suffixes are shared -- this uses tons of RAM in Lucene to ensure precisely minimal FST.

@mikemccand
Copy link
Member

How the blog post models his cat reminds me of how I modeled the scoring of a single tennis game as an FSA, and uncovered that there is absolutely no difference between deuce and 30 all if you just want to know who won the game!

If there are any tennis players reading this, you can save a wee bit of brain state when tracking the score! Just announce deuce once you get to 30 all and never say 30 all again!

@mikemccand
Copy link
Member

Aha! This is an interesting approach:

It is possible to mitigate the onerous memory required by sacrificing
guaranteed minimality of the resulting FST. Namely, one can maintain a
hash table that is bounded in size. This means that commonly reused
states are kept in the hash table while less commonly reused states
are evicted. In practice, a hash table with about 10,000 slots
achieves a decent compromise and closely approximates minimality in my
own unscientific experiments. (The actual implementation does a little
better and stores a small LRU cache in each slot, so that if two
common but distinct nodes map to the same bucket, they can still be
reused.)

Lucene can also bound the size of this "suffix hashmap" using the crazy cryptic minSufixCount1, minSuffixCount2, and sharedMaxTailLength parameters, but these are a poor (and basically unintelligible!) way to bound the map versus what Tantivy FST does using this LRU style cache. Yes, it sacrifices truly minimal FST, but in practice it looks like it gets close enough, while massively reducing RAM required during construction. I'll open a spinoff issue for this...

@mikemccand
Copy link
Member

The next paragraph in the blog is also very interesting!

An interesting consequence of using a bounded hash table which only
stores some of the states is that construction of an FST can be
streamed to a file on disk. Namely, when states are frozen as
described in the previous two sections, there’s no reason to keep all
of them in memory. Instead, we can immediately write them to disk (or
a socket, or whatever).

Since Lucene's FSTs are now "off-heap", we could take a similar approach. I think this is lower priority, but I'll open a spinoff issue for it too...

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 18, 2023

I've been designing how to possibly account for the optional states that each term may end up with. Namely how to deal with the following:

  • if a term has singleton docid
  • if a term has skip data
  • if a term has a last vInt encoded position block (only relevant when the field has position enabled)

I came up with a divide et impera(Divide and Conquer) approach. The idea is to classify which case out of the 8 outcomes (2^3, as there are 3 dimensions) a term belongs to. At indexing time, for a given field, within each category the terms information share the same structure and we can apply the RefBlock + bit-packing encoding scheme. We will still use an FST to encode term's ordinal. However, instead of storing the global ordinal we will store (category, ord) where the ord is the ordinal within the category. This can be fit in to a long with 3 bits for category and rest for ordinal.

At search time, to look up a term, we consult the FST to get back the category and local ordinal. Then locate the data file for that category and extract out the term information with the local ordinal.

Of course, I need to handle multiple fields, etc. and there are details like how to organize the files. Besides all that, I believe this scheme can work out. In particular it has a few nice properties

  • FST can be definitive about if term exists or not
  • After FST lookup, locating the term only takes two seeks to randomly access the term.

On the other hand, it might not compress to the best as it potentially could, especially for those monotonically increasing values such as postings start offset. That's because near-by terms (by their global ordinal) may be spread into different category thus losing the locality a little bit.

@dungba88
Copy link
Contributor

I'm still consuming this thread, pardon me if I ask something that's already discussed.

Yes, I actually tried to use FSTPostingsFormat in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand

Are you referring to the FSTTermsWriter case for building, as FSTTermsReader is already off-heap? If so, with the recent change to stream FST to disk while building could help: #12624. We can plug the out IndexOutput already created in FSTTermsWriter to the FieldMetadata.FSTCompiler. A catch is that we need another IndexOutput for storing the FST metadata, as it's not possible to write the metadata into the same IndexOutput as the main FST body :(

I think we can even do this as a separate PR? I could look into it as part of #12902. Let me know if there are any other place should be doing this.

Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms.

Sound exciting. I could imagine we can drop an entire clause with a single FST look-up?

@dungba88
Copy link
Contributor

Are you referring to the FSTTermsWriter case for building

Ah nvm, I see you already created a brand new writer. But that random-access terms dict writer can be written off-heap as well. I added a comment in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants