Where should we stream FST to disk directly? #12902

dungba88 · 2023-12-10T16:27:30Z

Description

Most of the use cases with FST seems to be writing the FST eventually to a DataOutput (is it IndexOutput?). In that case we are currently writing the FST to an on heap DataOutput (previously BytesStore and now ReadWriteDataOutput) and then save it to the on disk.

With #12624 it's possible to write the FST to an on disk DataOutput. So maybe let first compile a list of places which can be migrated to the new way?

Note: With the new way, there is a catch: We can't write the metadata in the same DataOutput as the main FST body, it has to be written separately.

dungba88 · 2023-12-10T16:38:54Z

A point for reference, Tantivy saves the metadata to the end of file, and it will first jump to the end to know the size and starting node. But we couldn't do it as one file might contain multiple FST

dungba88 · 2023-12-15T04:23:00Z

A candidate could be the FSTTermsWriter, which can help building FSTPostingsFormat with much less heap size. This would help #12513, as we are trying to build a FST with every single term in the search index.

dungba88 · 2023-12-15T05:33:41Z

I just briefly looked at the code, but it seems FSTTermsWriter will write the field metadata (number of terms, term freq, doc freq, etc), FST metadata, and FST main body for each of the field into a single IndexOutput.

As we can't write the FST main body into the same IndexOutput as the FST metadata, there are 2 possibilities:

To make the behavior same as before, we need to write the FST into a temporary IndexOutput (createTempOutput()?), then write it again to out. This means the behavior is exactly the same as before and no other code change required, but it would be slower (>x2 slower?), since we need to write the FST twice.
Or we could write the FST main body into a separate IndexOutput, and metadata on another IndexOutput. This would require propagation changes, and be backward incompatible.

dungba88 · 2023-12-15T07:00:37Z

I realized FSTPostingsFormat is an experimental one, which is only being used in 5 places! Those Lucene9xPostingsFormat seem to be active ones, which in turn use Lucene90BlockTreeTermsWriter.

If we change FSTPostingsFormat we are free to make any backward incompatible change?

dungba88 · 2023-12-28T05:54:02Z

Put the first PR for FSTPostingsFormat: #12980

dungba88 added the type:task label Dec 10, 2023

dungba88 mentioned this issue Dec 15, 2023

Try out a tantivy's term dictionary format #12513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where should we stream FST to disk directly? #12902

Where should we stream FST to disk directly? #12902

dungba88 commented Dec 10, 2023

dungba88 commented Dec 10, 2023

dungba88 commented Dec 15, 2023 •

edited

dungba88 commented Dec 15, 2023 •

edited

dungba88 commented Dec 15, 2023 •

edited

dungba88 commented Dec 28, 2023

Where should we stream FST to disk directly? #12902

Where should we stream FST to disk directly? #12902

Comments

dungba88 commented Dec 10, 2023

Description

dungba88 commented Dec 10, 2023

dungba88 commented Dec 15, 2023 • edited

dungba88 commented Dec 15, 2023 • edited

dungba88 commented Dec 15, 2023 • edited

dungba88 commented Dec 28, 2023

dungba88 commented Dec 15, 2023 •

edited

dungba88 commented Dec 15, 2023 •

edited

dungba88 commented Dec 15, 2023 •

edited