Fix stripping (japanese) unicode characters #879

glubsy · 2021-04-29T03:23:07Z

This "stripping of accents" doesn't look right to me. It would be just as good to keep words accentuated and compare them that way.

For example, "ébé.mp3" and "ebe.mp3" appear in the results as "100%" match, which obviously isn't right at all.

This algorithm might be a bit dated now, and perhaps there is some sort of technical debt left from the python2 to python3 migration (since Unicode strings are now the default).

Maybe it could be safe to remove this altogether, I'm not sure. For now I've left it as is, but it might be worth considering.

* Accents are getting removed from Unicode characters to generate similar "words". * Non-latin characters which cannot be processed that way (eg. japanese, greek, russian, etc.) should not be filtered out at all otherwise files are erroneously skipped or detected as dupes if only some characters make it passed the filter. * Starting from an arbitrary unicode codepoint (converted to decimal), above which we know it is pointless to try any sort of processing, we leave the characters as is. * Fix arsenetar#878.

glubsy · 2021-04-29T03:25:33Z

core/engine.py

+    # HACK We shouldn't ignore non-ascii characters altogether. Any Unicode char
+    # above common european characters that cannot be "sanitized" (ie. stripped
+    # of their accents, etc.) are preserved as is. The arbitrary limit is
+    # obtained from this one: ord("\u037e") GREEK QUESTION MARK
    s = "".join(


This is the statement that could be dropped maybe. But this would require some tweak in the test suite as well (fortunately, not a lot, as I have already tried).

arsenetar · 2021-04-29T03:38:32Z

Last time I looked at the file name comparison based matches it seemed in general unicode was not really supported due to the way the various functions work. Probably worth a look over as likely to be other issues with unicode values within. I'm wondering here if we just remove stripping out characters completely and just allow all unicode characters to pass, things like emoticons in filenames may cause similar issues.

arsenetar · 2021-05-26T00:11:15Z

I'm going to merge this, but really this character processing probably needs to be revisited.

glubsy · 2021-05-26T01:16:31Z

I agree. I hope to find some time later to look it over more thoroughly.

glubsy commented Apr 29, 2021

View reviewed changes

Fix including character at the border

f1ae478

arsenetar merged commit 0b46ca2 into arsenetar:master May 26, 2021

glubsy deleted the fix_unicode branch June 19, 2021 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stripping (japanese) unicode characters #879

Fix stripping (japanese) unicode characters #879

glubsy commented Apr 29, 2021

glubsy Apr 29, 2021 •

edited

Loading

arsenetar commented Apr 29, 2021

arsenetar commented May 26, 2021

glubsy commented May 26, 2021

Fix stripping (japanese) unicode characters #879

Fix stripping (japanese) unicode characters #879

Conversation

glubsy commented Apr 29, 2021

glubsy Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

arsenetar commented Apr 29, 2021

arsenetar commented May 26, 2021

glubsy commented May 26, 2021

glubsy Apr 29, 2021 •

edited

Loading