Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialisation of a corpus #8

Closed
claudius108 opened this issue Apr 15, 2023 · 6 comments
Closed

Serialisation of a corpus #8

claudius108 opened this issue Apr 15, 2023 · 6 comments

Comments

@claudius108
Copy link

Hi,

Thanks for this very nice library! It recognizes very well "tattvacintam" when the corpus contains "???avaci??am".

My question is that if such a corpus could be stored on disk. I would like to index some strings and use the corpus in the browser, with the library compiled to Webassembly.

Best regards!
Claudius Teodorescu

@compenguy
Copy link
Owner

It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function.

In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact.

@claudius108
Copy link
Author

claudius108 commented Apr 16, 2023 via email

@LucaCappelletti94
Copy link
Contributor

I have implemented the serde serialization/deserialization by swapping the boxed function with chainable key transformers. See pull request #10

@claudius108
Copy link
Author

Thank you!

@LucaCappelletti94
Copy link
Contributor

I just finished adding the necessary extra bells and whistles to trie-rs and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master

@claudius108
Copy link
Author

claudius108 commented Apr 1, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants