Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ewts indexes? #12

Open
eroux opened this issue Oct 29, 2017 · 0 comments
Open

ewts indexes? #12

eroux opened this issue Oct 29, 2017 · 0 comments

Comments

@eroux
Copy link
Collaborator

eroux commented Oct 29, 2017

It's not very clear how indexes are serialized on disk in terms of char encoding (see there), but it seems to me it could be UTF-8 and not UTF-16. In this case, having indexes in ewts would divide the size of the on-disk indexes by 2. First the situation should be made more clear, but if this is correct, index in ewts should be relatively easy to implement, although they'll make the indexing a bit slower. This could certainly be done after the tokenizer, in a separate filter. It's quite important that the ewts string is first converted into unicode and then back into ewts, so that it's normalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant