Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where should we stream FST to disk directly? #12902

Open
dungba88 opened this issue Dec 10, 2023 · 5 comments
Open

Where should we stream FST to disk directly? #12902

dungba88 opened this issue Dec 10, 2023 · 5 comments

Comments

@dungba88
Copy link
Contributor

Description

Most of the use cases with FST seems to be writing the FST eventually to a DataOutput (is it IndexOutput?). In that case we are currently writing the FST to an on heap DataOutput (previously BytesStore and now ReadWriteDataOutput) and then save it to the on disk.

With #12624 it's possible to write the FST to an on disk DataOutput. So maybe let first compile a list of places which can be migrated to the new way?

Note: With the new way, there is a catch: We can't write the metadata in the same DataOutput as the main FST body, it has to be written separately.

@dungba88
Copy link
Contributor Author

A point for reference, Tantivy saves the metadata to the end of file, and it will first jump to the end to know the size and starting node. But we couldn't do it as one file might contain multiple FST

@dungba88
Copy link
Contributor Author

dungba88 commented Dec 15, 2023

A candidate could be the FSTTermsWriter, which can help building FSTPostingsFormat with much less heap size. This would help #12513, as we are trying to build a FST with every single term in the search index.

@dungba88
Copy link
Contributor Author

dungba88 commented Dec 15, 2023

I just briefly looked at the code, but it seems FSTTermsWriter will write the field metadata (number of terms, term freq, doc freq, etc), FST metadata, and FST main body for each of the field into a single IndexOutput.

As we can't write the FST main body into the same IndexOutput as the FST metadata, there are 2 possibilities:

  • To make the behavior same as before, we need to write the FST into a temporary IndexOutput (createTempOutput()?), then write it again to out. This means the behavior is exactly the same as before and no other code change required, but it would be slower (>x2 slower?), since we need to write the FST twice.
  • Or we could write the FST main body into a separate IndexOutput, and metadata on another IndexOutput. This would require propagation changes, and be backward incompatible.

@dungba88
Copy link
Contributor Author

dungba88 commented Dec 15, 2023

I realized FSTPostingsFormat is an experimental one, which is only being used in 5 places! Those Lucene9xPostingsFormat seem to be active ones, which in turn use Lucene90BlockTreeTermsWriter.

If we change FSTPostingsFormat we are free to make any backward incompatible change?

@dungba88
Copy link
Contributor Author

Put the first PR for FSTPostingsFormat: #12980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant