Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stripping (japanese) unicode characters #879

Merged
merged 2 commits into from
May 26, 2021

Conversation

glubsy
Copy link
Contributor

@glubsy glubsy commented Apr 29, 2021

This "stripping of accents" doesn't look right to me. It would be just as good to keep words accentuated and compare them that way.

For example, "ébé.mp3" and "ebe.mp3" appear in the results as "100%" match, which obviously isn't right at all.

This algorithm might be a bit dated now, and perhaps there is some sort of technical debt left from the python2 to python3 migration (since Unicode strings are now the default).

Maybe it could be safe to remove this altogether, I'm not sure. For now I've left it as is, but it might be worth considering.

* Accents are getting removed from Unicode characters to generate similar "words".
* Non-latin characters which cannot be processed that way (eg. japanese, greek, russian, etc.) should not be filtered out at all otherwise files are erroneously skipped or detected as dupes if only some characters make it passed the filter.
* Starting from an arbitrary unicode codepoint (converted to decimal), above which we know it is pointless to try any sort of processing, we leave the characters as is.
* Fix arsenetar#878.
# HACK We shouldn't ignore non-ascii characters altogether. Any Unicode char
# above common european characters that cannot be "sanitized" (ie. stripped
# of their accents, etc.) are preserved as is. The arbitrary limit is
# obtained from this one: ord("\u037e") GREEK QUESTION MARK
s = "".join(
Copy link
Contributor Author

@glubsy glubsy Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the statement that could be dropped maybe. But this would require some tweak in the test suite as well (fortunately, not a lot, as I have already tried).

@arsenetar
Copy link
Owner

Last time I looked at the file name comparison based matches it seemed in general unicode was not really supported due to the way the various functions work. Probably worth a look over as likely to be other issues with unicode values within. I'm wondering here if we just remove stripping out characters completely and just allow all unicode characters to pass, things like emoticons in filenames may cause similar issues.

@arsenetar
Copy link
Owner

I'm going to merge this, but really this character processing probably needs to be revisited.

@arsenetar arsenetar merged commit 0b46ca2 into arsenetar:master May 26, 2021
@glubsy
Copy link
Contributor Author

glubsy commented May 26, 2021

I agree. I hope to find some time later to look it over more thoroughly.

@glubsy glubsy deleted the fix_unicode branch June 19, 2021 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants