Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed UTF-8 characters when no default character is set - fails to "fall through" #14

Closed
eribertomota opened this issue Dec 13, 2018 · 7 comments
Assignees

Comments

@eribertomota
Copy link

From Debian bug 861537[1].

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861537

Seems that there is a patch in a Detox fork[2].

[2] https://github.com/mikrosimage/detox/tree/1.3.0.mikros

Cheers,

Eriberto

@dharple dharple mentioned this issue Jan 13, 2020
@dharple
Copy link
Owner

dharple commented Jan 13, 2020

Thanks for the heads up @eribertomota. I'm looking in to this.

@eribertomota
Copy link
Author

Thanks! Can you release a new version? Debian will release version 11 soon and I have few days to send new uploads.

@dharple
Copy link
Owner

dharple commented Jan 30, 2021

Took me forever to remember how to do it all, but, yeah, I'm tagging v1.3.1 now.

Interestingly enough, the bug showed up in my extremely simple tests. The tests literally just run a dry-run against a bunch of files and then diffs the output between versions.

The "" and "" in the filenames below, generated by v1.3.0, is the bug.

The terminal doesn't render them, but they show up inside of less.

checking: lower
1c1
< detox 1.3.0
---
> detox 1.3.1
43,52c43,52
< sample_test_bed/iso8859/<C3>_capital_ae
< sample_test_bed/iso8859/<C2>_copy
< sample_test_bed/iso8859/<C2>_ellipsis
< sample_test_bed/iso8859/<C2>_reg
< sample_test_bed/iso8859/<C3><C3><C3>_thin_<C2><C3><C3><C3>_south
< sample_test_bed/iso8859/<C3>_latin_small_letter_y_with_diaeresis
< sample_test_bed/iso8859/<C2>_yen
< sample_test_bed/iso8859/<C2>_trade_mark
< sample_test_bed/iso8859/<C2>_pound
< sample_test_bed/iso8859/<C2>_cent
---
> sample_test_bed/iso8859/®_reg
> sample_test_bed/iso8859/¢_cent
> sample_test_bed/iso8859/Æ_capital_ae
> sample_test_bed/iso8859/<U+0085>_ellipsis
> sample_test_bed/iso8859/ÐÏÑ_thin_<U+008A>ØÙÞ_south
> sample_test_bed/iso8859/<U+0099>_trade_mark
> sample_test_bed/iso8859/ÿ_latin_small_letter_y_with_diaeresis
> sample_test_bed/iso8859/¥_yen
> sample_test_bed/iso8859/©_copy
> sample_test_bed/iso8859/£_pound
54,63c54,63
< sample_test_bed/unicode/<C3><C2>_reg
< sample_test_bed/unicode/<C3><C2>_copy
< sample_test_bed/unicode/<C3><C2>_trade_mark
< sample_test_bed/unicode/<C3><C2>_cent
< sample_test_bed/unicode/<C3><C2>_latin_small_letter_y_with_diaeresis
< sample_test_bed/unicode/<C3><C2>_yen
< sample_test_bed/unicode/<C3><C2>_ellipsis
< sample_test_bed/unicode/<C3><C2><C3><C2><C3><C2>_thin_<C3><C2><C3><C2><C3><C2><C3><C2>_south
< sample_test_bed/unicode/<C3><C2>_pound
< sample_test_bed/unicode/<C3><C2>_capital_ae
---
> sample_test_bed/unicode/Â¥_yen
> sample_test_bed/unicode/Ã<U+0086>_capital_ae
> sample_test_bed/unicode/¢_cent
> sample_test_bed/unicode/Ã<U+0090>Ã<U+008F>Ã<U+0091>_thin_Â<U+008A>Ã<U+0098>Ã<U+0099>Ã<U+009E>_south
> sample_test_bed/unicode/£_pound
> sample_test_bed/unicode/®_reg
> sample_test_bed/unicode/©_copy
> sample_test_bed/unicode/ÿ_latin_small_letter_y_with_diaeresis
> sample_test_bed/unicode/Â<U+0099>_trade_mark
> sample_test_bed/unicode/Â<U+0085>_ellipsis

@dharple
Copy link
Owner

dharple commented Jan 30, 2021

Hey Eriberto,

v1.3.1 is tagged and ready for review. Let me know if you encounter any problems.

Doug

@eribertomota
Copy link
Author

eribertomota commented Jan 31, 2021 via email

@dharple dharple reopened this Jan 31, 2021
@dharple
Copy link
Owner

dharple commented Jan 31, 2021

Eriberto,

You're absolutely correct. There were at least two off-by-one errors in the UTF-8 translation. It is working correctly for me now, using the test file "mÉ Æ.txt". If I have no translation for Unicode values 0x00c9 or 0x00c6, nothing happens to them.

So, using the test table and detoxrc from the original Debian bug, I get:

$ rm m*.txt ; touch  "mÉ Æ.txt" ; ~/work/detox/src/detox -vs gnu *.txt
Scanning: mÉ Æ.txt
mÉ Æ.txt -> mÉ_Æ.txt

@eribertomota
Copy link
Author

Working fine now!

Uploaded to Debian.

Thank you very much.

@dharple dharple assigned dharple and unassigned dharple Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants