New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try out a tantivy's term dictionary format #12513
Comments
Copy paste the relevant wiki section about tantivy's TermDictionary tantivy TermDictionaryTantivy's term dictionary has two components. An FST that encodes TermInfoStore encodings
The metadata record contains all information needed to decode the corresponding block. struct TermInfoBlockMeta {
offset: u64,
ref_term_info: TermInfo,
doc_freq_nbits: u8,
postings_offset_nbits: u8,
positions_offset_nbits: u8,
}
pub struct TermInfo {
/// Number of documents in the segment containing the term
pub doc_freq: u32,
/// Byte range of the posting list within the postings (`.idx`) file.
pub postings_range: Range<usize>,
/// Byte range of the positions of this terms in the positions (`.pos`) file.
pub positions_range: Range<usize>,
}
At search time, both FST and the |
Sorry! This is my fault :) It'd be awesome to simplify the block-tree terms dictionary if you have ideas ... it is truly hairy. Yet it is fast and compactish and paged memory friendly (hot stuff localized together so OS can clearly cache that, large pages of cold stuff can be mostly left on disk, for indices that do not entirely fit in RAM). Also, thank you @Tony-X for creating the open source licensed (ASL2) combination of Tantivy's and Lucene's benchmark in your repo, enabling us to isolate/understand the performance and functional differences. This has already led to some nice cross-fertilization gains in Lucene, such as optimizing +1 to explore a terms dictionary format similar to Tantivy's. I think the experimental (no backwards compatibility!) |
Thanks @mikemccand for bringing in the context. I should've done that part better :)
Yes, I actually tried to use
I think if we move the values out of FST we could balance the size. Time-wise, I'm not sure. Hopefully the simplified value space make building FST easier. This requires some experimentation
Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms. |
I'd like to seek for some advices regarding the situation I am in -- I want to preserve the nice properties of the tantivy's termdict as I port it over for Lucene
(2) is possible because after reading the metadata block it can determine the record size in the corresponding data block, such that it knows the term's data starts at
I can think of one way to solve them is to change the posting formart to make each posting/position record self-descriptive. e.g. adding a small header per postings list in the pos file to store singleton docid and the skip_offset as well as storing number of packed blocks before positions data starts. But this will require changing the posting reader/writer. I wonder if there are smarter ways (as compare to store everything with their full width) to achieve the fixed size record and preferably not changing Postings. |
Now that I think about it, I don't think we really need to store I opened #12536 which should be an improvement without trading off anything. |
Do you know whether Tantivy is producing a truly minimal FST? Doing so (which Lucene does, by default, I think!) requires a big hash map during indexing to keep track of already-seen suffixes and share them, making a minimal FST that is like a combined prefix and suffix trie. If you disable this entirely, the FST becomes a prefix trie. You can play with Lucene NOT trying so hard to make a minimal FST by increasing the Maybe we could try a standalone test to build an FST with both Tantivy and Lucene and compare the byte size? I'll open an issue over in
This is curious -- I would expect the terms dict lookup cost to be unimportant for queries that visit many hits. The terms dict cost should be dwarfed by the cost of iterating the postings. Did you see the slowdown only for terms dict intensive queries? We should really test e.g.
+1, this is an exciting exploration! Note that there are certain cases (perhaps rare in practice, not sure) where even Lucene's "prefix" FST can skip a segment without checking the on-disk terms blocks. It happens when even the prefix for a given term never occurs in any other terms in that segment, which might be frequent if say documents are indexed in sorted order by their primary keys. This would cause certain "dense" regions of primary key space to be indexed into each segment and might then mean on lookup that the prefix FST can know that a given PK cannot occur in the segment without even scanning the on-disk blocks. |
Hmm indeed this would require a fixed block size for every term's metadata. Does Tantivy do pulsing (inlining postings for a singleton terms into the terms dictionary)? Another option might be to have each term block have some sort of header array to quickly map a term ordinal (within the block) to its corresponding file pointer location? Or perhaps we keep the scanning within a block when looking for an ord within that block? The FST would still be definitive about whether a term exists or not, but then when it exists, we would still need to do some scanning. Or, for starters, just make all terms metadata fixed width (yes, wasting bytes for those terms that don't need the extra stuff). It'd be a start just to simplify playing with this idea, which we could then iterate from?
Yeah -- this is very true! FST is very efficient at encoding monotonically increasing int/long outputs, much more so than semi-random looking |
No but we should. It has been on my task list for a long time. |
Maybe @fulmicoton can shed more light on this topic :) A related question: can Tantivy read a Lucene-built FST and vice-versa? |
I'm poking around trying to understand Tantivy's FST implementation, and found it was forked originally from this FST implementation into this Tantivy specific version (which seems to have fallen behind merging the upstream changes?). There is a wonderful blog post describing it. Now I want to try building a Lucene FST from that giant Common Crawl corpus -- 1.6 B URLs! Some clear initial differences over Lucene's implementation:
|
How the blog post models his cat reminds me of how I modeled the scoring of a single tennis game as an FSA, and uncovered that there is absolutely no difference between If there are any tennis players reading this, you can save a wee bit of brain state when tracking the score! Just announce |
Aha! This is an interesting approach:
Lucene can also bound the size of this "suffix hashmap" using the crazy cryptic |
The next paragraph in the blog is also very interesting!
Since Lucene's FSTs are now "off-heap", we could take a similar approach. I think this is lower priority, but I'll open a spinoff issue for it too... |
I've been designing how to possibly account for the optional states that each term may end up with. Namely how to deal with the following:
I came up with a divide et impera(Divide and Conquer) approach. The idea is to classify which case out of the 8 outcomes (2^3, as there are 3 dimensions) a term belongs to. At indexing time, for a given field, within each category the terms information share the same structure and we can apply the RefBlock + bit-packing encoding scheme. We will still use an FST to encode term's ordinal. However, instead of storing the global ordinal we will store (category, ord) where the ord is the ordinal within the category. This can be fit in to a long with 3 bits for category and rest for ordinal. At search time, to look up a term, we consult the FST to get back the category and local ordinal. Then locate the data file for that category and extract out the term information with the local ordinal. Of course, I need to handle multiple fields, etc. and there are details like how to organize the files. Besides all that, I believe this scheme can work out. In particular it has a few nice properties
On the other hand, it might not compress to the best as it potentially could, especially for those monotonically increasing values such as postings start offset. That's because near-by terms (by their global ordinal) may be spread into different category thus losing the locality a little bit. |
I'm still consuming this thread, pardon me if I ask something that's already discussed.
Are you referring to the I think we can even do this as a separate PR? I could look into it as part of #12902. Let me know if there are any other place should be doing this.
Sound exciting. I could imagine we can drop an entire clause with a single FST look-up? |
Ah nvm, I see you already created a brand new writer. But that random-access terms dict writer can be written off-heap as well. I added a comment in the PR. |
Description
Hello!
I've been working on a benchmark for a while to compare the features and performance of Lucene and Tantivy, a rust search engine library which was heavily inspired by Lucene.
The benchmark uses the corpus and queries from luceneutil (the framework for Lucene nightly bench). Since not all query types are supported by Tantivy, currently it focuses on Term/Boolean/PhraseQuery. Tantivy in general showed performance advantages for now and I got motivated to understand why.
I documented the two engines' inverted index implementations per my understanding. Here is the wiki. Specifically, both engines use FST to aid the term lookup but the way they use them are quite different. In summary, Lucene uses FST to map term prefixes followed by scanning the on-disk blocks of terms. Tantivy uses FST to maps all the terms to their ordinals and use that ordinal/index to decode at most one full block.
The proposal here is to try Tantivy's term dictionary which I can see some advantages
What do you think?
The text was updated successfully, but these errors were encountered: