Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacements not considering whitespace #186

Closed
HarikalarKutusu opened this issue Jun 18, 2023 · 6 comments
Closed

Replacements not considering whitespace #186

HarikalarKutusu opened this issue Jun 18, 2023 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed rules

Comments

@HarikalarKutusu
Copy link
Contributor

I have several settings for abbr in replace array, such as:

  [" sf.", " sayfa"],
  [" dk.", " dakika"],
  [" ör.", " örnek"],

I did put a space in front of them to prevent substring replacement, but somehow it did not prevent that. Looking at the code, which does a simple global string-replace (and Rules are not pre-processed anywhere) I could not pinpoint the reason.

Here is a sample original text part:

..30 mermi kapasiteli yeni bir şarjör. 30 mermilik bu şarjörü...

And this is what it becomes:

..30 mermi kapasiteli yeni bir şarjörnek 30 mermilik bu şarjörü...

There are many samples of similar replacements, recognized them while checking non-dictionary words... Very weird...

@MichaelKohler MichaelKohler added help wanted Extra attention is needed needs debugging labels Jun 19, 2023
@MichaelKohler
Copy link
Member

Simplified test case: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a7c7bf52aa926a63c0bbbc3d87eeecfe

@HarikalarKutusu
Copy link
Contributor Author

I'm sure Rust people got it covered :) Probably, in my case, other replacement items interfere (not that I could find one)... It shouldn't be the nltk code thou, it did not run at that phase.

I'm at the final stage of blacklisting, after that, I'll dive deeper into this.

@MichaelKohler
Copy link
Member

Yeah, as this is standard Rust behavior (probably a nice exercise on the side to figure out why that is the behavior) I think we'd need to handle this on our side. I definitely can see why we'd want to allow a whitespace in the pattern to not interfere with occurances where it should not be replaced. Potential approaches here to fix this:

  • is there a way to replace only on word boundaries? Is this what we'd eventually want?
  • should replacements additionally support Regex and therefore we could handle these cases with Regex?

I'll leave this open as improvement, I think this would come in useful.

@MichaelKohler MichaelKohler added the enhancement New feature or request label Jun 19, 2023
@MichaelKohler MichaelKohler changed the title Weird bug in replace Replacements not considering whitespace Jun 19, 2023
@HarikalarKutusu
Copy link
Contributor Author

is there a way to replace only on word boundaries? Is this what we'd eventually want?

Not necessarily. In my scripts for Language Models, I had pre-, mid- and, post-replacement needs

pre & post: Mostly cleanup & correction & normalization - no words, full strings.
mid: One or multi-word replacements. E.g.:

  • "21. yy. => yirmi birinci yüzyıl" (to replace centuries)
  • number => string conversion (or back)
  • spelling corrections.

These, with all apostrophe/suffix possibilities!

A second problem here: What is a word boundary - which is valid for all languages? Think of apostrophes or abbreviations for example.

And related rules (e.g. for apostrophes or "a" vs "â" usage change in time) and source quality of those are also questionable (i.e. are the authors & editors of those Wiki articles good enough?). Half of my blacklist is spelling errors.

One could use word replacements for old texts for example (e.g. an author from 75 years ago uses "aceba" which is "acaba" in modern form, which can be handled), but Wiki is definitely chaotic.

should replacements additionally support Regex and therefore we could handle these cases with Regex?

I think this will be a good addition. On the other hand, they are risky to work on a full-scale resource.

The whole replace/regex business is open to errors and every change should be checked & tested extensively. E.g. I encountered the mentioned problem while checking words for blacklisting, some weird ones ending with "...örnek" or "...sayfa".

@HarikalarKutusu
Copy link
Contributor Author

This did not happen on further tests, it was probably a mistake in my regex formats (changed a lot in time).
You may like to close this - unless you may like to keep it open for the regex option - which would be a great option if someone tests them carefully.

@MichaelKohler
Copy link
Member

I'm closing this for now, as it seems that now also the playground example has the correct behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed rules
Projects
None yet
Development

No branches or pull requests

2 participants