Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--noctrl option tries to replace unicode right single quotation mark #14

Open
averms opened this issue Nov 18, 2021 · 6 comments
Open

Comments

@averms
Copy link

averms commented Nov 18, 2021

Let's say I have a file called bob’s.txt. I run prename on it and get:

• prename -n --noctrl bob’s.txt
'bob’s.txt' would be renamed to 'bob�_s.txt'

Actually running the command would result in the file being named 62 6f 62 E2 5F 73 2E 74 78 74, which is invalid UTF-8 because of the E2 byte. I don't think this should happen.

@ap
Copy link
Owner

ap commented Nov 18, 2021

That won’t happen if you add -T UTF-8.

I’m very much unsure whether that is the correct answer or just a workaround. Should rename try to do that for you? Should it try to do that for you but only under specific circumstances?

@averms
Copy link
Author

averms commented Nov 18, 2021

The larger issue is that --noctrl should, by default, only replace the 33 control characters defined by ASCII. This way it works with ASCII filenames and UTF-8 filenames. I'm very much a beginner at Perl so I won't try too hard to guess at what the program does, but it seems like the character class [:cntrl:] is somehow matching more than just the ASCII control characters.

Upon further reading it seems like [:cntrl:] is Unicode-aware with use feature 'unicode_strings' and rename automatically enables a feature bundle for the user's version of Perl. This breaks when the strings are not Unicode, which is the case when rename is run without any -T args.

@ap
Copy link
Owner

ap commented Nov 20, 2021

You’re right.

The larger issue is that --noctrl should, by default, only replace the 33 control characters defined by ASCII.

I take it you’re saying it should use Unicode-aware [:cntrl:] under -T?

@averms
Copy link
Author

averms commented Nov 22, 2021

You’re right.

The larger issue is that --noctrl should, by default, only replace the 33 control characters defined by ASCII.

I take it you’re saying it should use Unicode-aware [:cntrl:] under -T?

For -T, I'm not really sure what the best path is. Currently rename does match all Unicode control characters under -T, so it would be a breaking change to restrict it to the control characters defined by ASCII.

I think the simplest solution is to no feature 'unicode_strings' if the user hasn't specified -T.

@ap
Copy link
Owner

ap commented Nov 22, 2021

I think the simplest solution is to no feature 'unicode_strings' if the user hasn't specified -T.

Just for --noctrl or in general? Doing that in general would affect all patterns, including those in user code passed with -e.

@averms
Copy link
Author

averms commented Nov 23, 2021

I think the simplest solution is to no feature 'unicode_strings' if the user hasn't specified -T.

Just for --noctrl or in general? Doing that in general would affect all patterns, including those in user code passed with -e.

In general. As it stands enabling feature unicode_strings is incorrect when the internal encoding for a string is not utf-8. Looking at https://metacpan.org/pod/Unicode::Semantics, I imagine one could devise many examples of rename(1) outputting broken utf-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants