Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a memmap for the dictionary #148

Closed
kirawi opened this issue Feb 3, 2024 · 2 comments
Closed

Using a memmap for the dictionary #148

kirawi opened this issue Feb 3, 2024 · 2 comments

Comments

@kirawi
Copy link

kirawi commented Feb 3, 2024

Is your feature request related to a problem? Please describe.

It's harder to support lower-end hardware (with limited memory) particularly with bigger dictionaries.

Describe the solution you'd like

I would like the option to be able to use a memory map instead to refer to an uncompressed dictionary since storage is usually cheaper than memory. The application I need does not need extreme performance so I feel like the IO penalty would be acceptable. If the dictionary gets processed by Vibrato into something else, it would also be nice to be able serialize it to a file and memmap it as well. fst offers something like that: https://docs.rs/fst/latest/fst/#example-stream-to-a-file-and-memory-map-it-for-searching

Describe alternatives you've considered

None that I'm aware of.

Additional context

None

@kirawi
Copy link
Author

kirawi commented Feb 3, 2024

Actually, this might not make sense since it's a niche requirement. I'll explore it in my own fork though.

@kirawi kirawi closed this as completed Feb 3, 2024
@vbkaisetsu
Copy link
Member

@kirawi I think the Vaporetto tokenizer is a better choice for small devices.
Lattice based tokenizers (including Vibrato) require large dictionaries, while pointwise tokenizers (including Vaporetto) work with smaller models.

There is an example that works on STM32F3DISCOVERY.
https://github.com/daac-tools/vaporetto/tree/main/examples/embedded_device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants