New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacements not considering whitespace #186
Comments
I'm sure Rust people got it covered :) Probably, in my case, other replacement items interfere (not that I could find one)... It shouldn't be the nltk code thou, it did not run at that phase. I'm at the final stage of blacklisting, after that, I'll dive deeper into this. |
Yeah, as this is standard Rust behavior (probably a nice exercise on the side to figure out why that is the behavior) I think we'd need to handle this on our side. I definitely can see why we'd want to allow a whitespace in the pattern to not interfere with occurances where it should not be replaced. Potential approaches here to fix this:
I'll leave this open as improvement, I think this would come in useful. |
Not necessarily. In my scripts for Language Models, I had pre-, mid- and, post-replacement needs pre & post: Mostly cleanup & correction & normalization - no words, full strings.
These, with all apostrophe/suffix possibilities! A second problem here: What is a word boundary - which is valid for all languages? Think of apostrophes or abbreviations for example. And related rules (e.g. for apostrophes or "a" vs "â" usage change in time) and source quality of those are also questionable (i.e. are the authors & editors of those Wiki articles good enough?). Half of my blacklist is spelling errors. One could use word replacements for old texts for example (e.g. an author from 75 years ago uses "aceba" which is "acaba" in modern form, which can be handled), but Wiki is definitely chaotic.
I think this will be a good addition. On the other hand, they are risky to work on a full-scale resource. The whole replace/regex business is open to errors and every change should be checked & tested extensively. E.g. I encountered the mentioned problem while checking words for blacklisting, some weird ones ending with "...örnek" or "...sayfa". |
This did not happen on further tests, it was probably a mistake in my regex formats (changed a lot in time). |
I'm closing this for now, as it seems that now also the playground example has the correct behavior. |
I have several settings for abbr in replace array, such as:
I did put a space in front of them to prevent substring replacement, but somehow it did not prevent that. Looking at the code, which does a simple global string-replace (and Rules are not pre-processed anywhere) I could not pinpoint the reason.
Here is a sample original text part:
And this is what it becomes:
There are many samples of similar replacements, recognized them while checking non-dictionary words... Very weird...
The text was updated successfully, but these errors were encountered: