Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment using fsts for Unicode tables #32

Closed
BurntSushi opened this issue Feb 19, 2017 · 4 comments
Closed

experiment using fsts for Unicode tables #32

BurntSushi opened this issue Feb 19, 2017 · 4 comments
Labels

Comments

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 19, 2017

I'm interested to see how well FSTs would do for storing Unicode tables, particularly for use in the regex-syntax crate. It could make set operations especially efficient and reduce binary size.

In order for this to work, the fst crate itself can't depend on regex-syntax. I think the right answer here would be to create a new sub-crate, called fst-regex, which provides the Automaton impl. That would trim off the current regex-syntax and utf8-ranges deps, and the mmap feature could be disabled, which would leave only byteorder, which is fine.

@fulmicoton As my sole (public) user, would this pose any problems for you? (I don't think it would.)

@fulmicoton
Copy link
Contributor

@BurntSushi Can you elaborate on mmap will be disabled ?

@BurntSushi
Copy link
Owner Author

@fulmicoton I meant disabled inside regex-syntax. It is already an optional dep that is enabled by default. :-) See #27.

@fulmicoton
Copy link
Contributor

No problem for me !

@BurntSushi
Copy link
Owner Author

I think the time for this has come and gone. It's unlikely I'd ever be okay with regex-syntax depending on fst, which is a pretty meaty dependency. Moreover, while fsts are reasonably better in terms of space efficiency, they tend to be slower than simple binary search or specialty tries. So I'm going to close this.

FWIW, the ucd-generate tool can output FSTs for many different Unicode things, including properties and Unicode character names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants