-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialisation of a corpus #8
Comments
It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function. In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact. |
Thank you!
I have a list of words with question marks instead of diacritics. I will
test the first approach, if the list is not that large, and I will let you
know about the results.
All the best,
Claudius
…On Sun, 16 Apr 2023, 04:39 Will Page, ***@***.***> wrote:
It would be possible. The most storage-efficient serialization of a corpus
is just the list of words, from which the full corpus can be rebuilt. The
most cpu-efficient serialization is the full mappings of words-to-ngrams
and ngrams-to-words. Which means the Serialize trait is easy, but to
implement the Deserialize trait we would need a key translation function.
In order to solve the Deserialize problem, we would need to make a
breaking change to make the key translation function optional, and add
methods to set a key translation function after the fact.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANKHHRJF5KAVMR7HHXZIY3XBNEVFANCNFSM6AAAAAAW7URNHY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I have implemented the serde serialization/deserialization by swapping the boxed function with chainable key transformers. See pull request #10 |
Thank you! |
I just finished adding the necessary extra bells and whistles to trie-rs and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master |
Hi!
I think this can be also useful: https://crates.io/crates/fst
I am using it for browser-based search engines. It allows fast regex
search, prefix search, and more.
Claudius
…On Mon, 1 Apr 2024, 19:01 Luca Cappelletti, ***@***.***> wrote:
I just finished adding the necessary extra bells and whistles to trie-rs
<https://github.com/LucaCappelletti94/trie-rs> and I will be testing
shortly how much memory we can save by switching from the hashmap to that.
Will keep you posted. Until @compenguy <https://github.com/compenguy>
consider merging my pull request, in the meantime you can use my fork here:
https://github.com/LucaCappelletti94/ngrammatic/tree/master
—
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANKHHR3AD3EGWOI5KD2MP3Y3GAFJAVCNFSM6AAAAAAW7URNH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGA2TKNJUGA>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
Hi,
Thanks for this very nice library! It recognizes very well "tattvacintam" when the corpus contains "???avaci??am".
My question is that if such a corpus could be stored on disk. I would like to index some strings and use the corpus in the browser, with the library compiled to Webassembly.
Best regards!
Claudius Teodorescu
The text was updated successfully, but these errors were encountered: