Serialisation of a corpus #8

claudius108 · 2023-04-15T20:54:24Z

Hi,

Thanks for this very nice library! It recognizes very well "tattvacintam" when the corpus contains "???avaci??am".

My question is that if such a corpus could be stored on disk. I would like to index some strings and use the corpus in the browser, with the library compiled to Webassembly.

Best regards!
Claudius Teodorescu

compenguy · 2023-04-16T01:39:20Z

It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function.

In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact.

claudius108 · 2023-04-16T05:04:08Z

Thank you! I have a list of words with question marks instead of diacritics. I will test the first approach, if the list is not that large, and I will let you know about the results. All the best, Claudius

…

On Sun, 16 Apr 2023, 04:39 Will Page, ***@***.***> wrote: It would be possible. The most storage-efficient serialization of a corpus is just the list of words, from which the full corpus can be rebuilt. The most cpu-efficient serialization is the full mappings of words-to-ngrams and ngrams-to-words. Which means the Serialize trait is easy, but to implement the Deserialize trait we would need a key translation function. In order to solve the Deserialize problem, we would need to make a breaking change to make the key translation function optional, and add methods to set a key translation function after the fact. — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANKHHRJF5KAVMR7HHXZIY3XBNEVFANCNFSM6AAAAAAW7URNHY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

LucaCappelletti94 · 2024-03-31T20:39:41Z

I have implemented the serde serialization/deserialization by swapping the boxed function with chainable key transformers. See pull request #10

claudius108 · 2024-04-01T05:05:58Z

Thank you!

LucaCappelletti94 · 2024-04-01T16:01:02Z

I just finished adding the necessary extra bells and whistles to trie-rs and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master

claudius108 · 2024-04-01T18:46:27Z

Hi! I think this can be also useful: https://crates.io/crates/fst I am using it for browser-based search engines. It allows fast regex search, prefix search, and more. Claudius

…

On Mon, 1 Apr 2024, 19:01 Luca Cappelletti, ***@***.***> wrote: I just finished adding the necessary extra bells and whistles to trie-rs <https://github.com/LucaCappelletti94/trie-rs> and I will be testing shortly how much memory we can save by switching from the hashmap to that. Will keep you posted. Until @compenguy <https://github.com/compenguy> consider merging my pull request, in the meantime you can use my fork here: https://github.com/LucaCappelletti94/ngrammatic/tree/master — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANKHHR3AD3EGWOI5KD2MP3Y3GAFJAVCNFSM6AAAAAAW7URNH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGA2TKNJUGA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

claudius108 closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialisation of a corpus #8

Serialisation of a corpus #8

claudius108 commented Apr 15, 2023

compenguy commented Apr 16, 2023

claudius108 commented Apr 16, 2023 via email

LucaCappelletti94 commented Mar 31, 2024

claudius108 commented Apr 1, 2024

LucaCappelletti94 commented Apr 1, 2024

claudius108 commented Apr 1, 2024 via email

Serialisation of a corpus #8

Serialisation of a corpus #8

Comments

claudius108 commented Apr 15, 2023

compenguy commented Apr 16, 2023

claudius108 commented Apr 16, 2023 via email

LucaCappelletti94 commented Mar 31, 2024

claudius108 commented Apr 1, 2024

LucaCappelletti94 commented Apr 1, 2024

claudius108 commented Apr 1, 2024 via email