Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesced words - tokenization #76

Open
drahnr opened this issue Jul 12, 2021 · 1 comment
Open

Coalesced words - tokenization #76

drahnr opened this issue Jul 12, 2021 · 1 comment

Comments

@drahnr
Copy link
Contributor

drahnr commented Jul 12, 2021

I've attempted to deal abbreviated forms of type we've and I'd and it's as part of drahnr/cargo-spellcheck#186 which is a mere workaround.

Probably out of scope of nlprule, yet a pitfal for real life usage.

Since nlprule is going to support spellchecking as well it might be worth discussing / keeping in mind.

@bminixhofer
Copy link
Owner

Good point. This should already be handled by the spellchecking PR, but I'm not exactly sure how to be honest :) I'll make sure to check it when I revisit spellchecking.

Some rarer words which are tokenized as multiple tokens like Côte d'Azur, L'Oréal and cc'ing have a special spelling.txt file, but I'm not so sure right now how common contractions are handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants