Replacements not considering whitespace #186

HarikalarKutusu · 2023-06-18T22:11:50Z

I have several settings for abbr in replace array, such as:

  [" sf.", " sayfa"],
  [" dk.", " dakika"],
  [" ör.", " örnek"],

I did put a space in front of them to prevent substring replacement, but somehow it did not prevent that. Looking at the code, which does a simple global string-replace (and Rules are not pre-processed anywhere) I could not pinpoint the reason.

Here is a sample original text part:

..30 mermi kapasiteli yeni bir şarjör. 30 mermilik bu şarjörü...

And this is what it becomes:

..30 mermi kapasiteli yeni bir şarjörnek 30 mermilik bu şarjörü...

There are many samples of similar replacements, recognized them while checking non-dictionary words... Very weird...

The text was updated successfully, but these errors were encountered:

MichaelKohler · 2023-06-19T11:12:13Z

Simplified test case: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a7c7bf52aa926a63c0bbbc3d87eeecfe

HarikalarKutusu · 2023-06-19T11:22:49Z

I'm sure Rust people got it covered :) Probably, in my case, other replacement items interfere (not that I could find one)... It shouldn't be the nltk code thou, it did not run at that phase.

I'm at the final stage of blacklisting, after that, I'll dive deeper into this.

MichaelKohler · 2023-06-19T14:30:22Z

Yeah, as this is standard Rust behavior (probably a nice exercise on the side to figure out why that is the behavior) I think we'd need to handle this on our side. I definitely can see why we'd want to allow a whitespace in the pattern to not interfere with occurances where it should not be replaced. Potential approaches here to fix this:

is there a way to replace only on word boundaries? Is this what we'd eventually want?
should replacements additionally support Regex and therefore we could handle these cases with Regex?

I'll leave this open as improvement, I think this would come in useful.

HarikalarKutusu · 2023-06-19T21:40:49Z

is there a way to replace only on word boundaries? Is this what we'd eventually want?

Not necessarily. In my scripts for Language Models, I had pre-, mid- and, post-replacement needs

pre & post: Mostly cleanup & correction & normalization - no words, full strings.
mid: One or multi-word replacements. E.g.:

"21. yy. => yirmi birinci yüzyıl" (to replace centuries)
number => string conversion (or back)
spelling corrections.

These, with all apostrophe/suffix possibilities!

A second problem here: What is a word boundary - which is valid for all languages? Think of apostrophes or abbreviations for example.

And related rules (e.g. for apostrophes or "a" vs "â" usage change in time) and source quality of those are also questionable (i.e. are the authors & editors of those Wiki articles good enough?). Half of my blacklist is spelling errors.

One could use word replacements for old texts for example (e.g. an author from 75 years ago uses "aceba" which is "acaba" in modern form, which can be handled), but Wiki is definitely chaotic.

should replacements additionally support Regex and therefore we could handle these cases with Regex?

I think this will be a good addition. On the other hand, they are risky to work on a full-scale resource.

The whole replace/regex business is open to errors and every change should be checked & tested extensively. E.g. I encountered the mentioned problem while checking words for blacklisting, some weird ones ending with "...örnek" or "...sayfa".

HarikalarKutusu · 2023-06-29T13:05:16Z

This did not happen on further tests, it was probably a mistake in my regex formats (changed a lot in time).
You may like to close this - unless you may like to keep it open for the regex option - which would be a great option if someone tests them carefully.

MichaelKohler · 2023-08-12T12:03:35Z

I'm closing this for now, as it seems that now also the playground example has the correct behavior.

MichaelKohler added help wanted Extra attention is needed needs debugging labels Jun 19, 2023

MichaelKohler added rules and removed needs debugging labels Jun 19, 2023

MichaelKohler added the enhancement New feature or request label Jun 19, 2023

MichaelKohler changed the title ~~Weird bug in replace~~ Replacements not considering whitespace Jun 19, 2023

MichaelKohler closed this as completed Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacements not considering whitespace #186

Replacements not considering whitespace #186

HarikalarKutusu commented Jun 18, 2023

MichaelKohler commented Jun 19, 2023

HarikalarKutusu commented Jun 19, 2023

MichaelKohler commented Jun 19, 2023

HarikalarKutusu commented Jun 19, 2023

HarikalarKutusu commented Jun 29, 2023

MichaelKohler commented Aug 12, 2023

Replacements not considering whitespace #186

Replacements not considering whitespace #186

Comments

HarikalarKutusu commented Jun 18, 2023

MichaelKohler commented Jun 19, 2023

HarikalarKutusu commented Jun 19, 2023

MichaelKohler commented Jun 19, 2023

HarikalarKutusu commented Jun 19, 2023

HarikalarKutusu commented Jun 29, 2023

MichaelKohler commented Aug 12, 2023