Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Make readdir return unicode #11

Closed
Astara opened this Issue · 21 comments

3 participants

@Astara

Example prog:

> 'ls' -1|tail -6 |perl -CSD -e'use 5.14.0; use utf8::all;while (<>) {
print
}
print "opening dir\n";
opendir(my $dh, ".");
my @files =  grep { /^[^.]/ } readdir $dh;
my @sfiles=sort @files;
my $start= @sfiles-6;
for (my $i=$start; $i<@sfiles;++$i) {
        printf "%s\n", ${sfiles[$i]};
}
'
zwadobef.ttf
Ü-chan.ttf
みかちゃん-p.ttf
みかちゃん-pb.ttf
みかちゃん-ps.ttf
みかちゃん.ttf
opening dir
zwadobef.ttf
Ã-chan.ttf
ã¿ãã¡ãã-p.ttf
ã¿ãã¡ãã-pb.ttf
ã¿ãã¡ãã-ps.ttf
ã¿ãã¡ãã.ttf

The top part starting with zwadobef.ttf down to opening dir, is as UTF8 as perl gets -- it's reading in and writing out text from STDIN/STDOUT...

However, when it reads the files itself -- same files... and prints them to the screen... they come out mangled.

So perl can't really read unicode filenames -- I coudn't reproduce it, but I did have a case where a prog tried to append another path to the files (absolute path)... and wasn't able to open the resulting files, but in redoing the tests, couldn't reproduce it, so I dunno about that part of it -- if perl isn't treating those chars like unicode, no telling what might happen as soon as you start doing string ops on them...

@doherty
Owner

You will need to decode the filenames. readdir appears to be giving you some binary data, and you're trying to interpret that as a UTF8 string.

opendir my $dh, '.' or die "Couldn't open directory '.': $!";
my @files =  grep { /^[^.]/ } readdir $dh;
say $files[0]; #bytes
use Encode qw(decode);
my $u = decode("utf8", $files[0]);
say $u; # utf8 string

utf8::all currently doesn't do anything to make readdir read in UTF8 - but we should. Patches are welcome!

@Leont

readdir is an overridable builtin, that shouldn't be too hard. Making it play nice with something like autodie OTOH could be tricky.

@Astara
@Leont

The information readdir is giving me is already in UTF-8 format -- bytewise, but perl treats it as Latin1...

readdir should be considered binary data unless specified otherwise, at least on non-darwin unices. Perl's current behavior is perfectly sane there. The mojibake is showing up because you're printing that binary data to a utf8 filehandle. Either you must handle it as binary, or as unicode, but mixing them is a recipe for trouble.

@Astara

Um.... "ls" seems to be taking that same data and displaying it correctly. Could you explain how it is doing the wrong thing?

Given all of the utilities like du, ls, stat, take the data on disk and transport it unaltered to the console -- why is it that perl is corrupting the data?

@doherty
Owner

@Leont: Yes, playing nicely with autodie is definitely a requirement. Do you have any advice on that?

@Astara

Your comment doesn't appear to be related to the subject at hand.

What does autodie have to do with the price of rice in china or this issue?

@Astara

Here's an updated version of the initial program with autodie all, full warnings and use strict....just to avert anymore questions about perl 'thinking' it is doing anything wrong by corrupting output. Notice there is no difference in execution (NOTE: with or without -CSD, the output is the same):

'ls' -1|tail -6 |perl -CSD  -wE'use autodie qw(:all);use strict;use 5.14.0; use utf8::all;while (<>) {
print
}
print "opening dir\n";
opendir(my $dh, ".");
my @Files = grep { /^[^.]/ } readdir $dh;
my @sfiles=sort @Files;
my $start= @sfiles-6;
for (my $i=$start; $i<@sfiles;++$i) {
printf "%s\n", ${sfiles[$i]};
}
'
zwadobef.ttf
Ü-chan.ttf
みかちゃん-p.ttf
みかちゃん-pb.ttf
みかちゃん-ps.ttf
みかちゃん.ttf
opening dir
zwadobef.ttf
�-chan.ttf
�����-p.ttf
�����-pb.ttf
�����-ps.ttf
�����.ttf
@doherty
Owner

@Astara: Sorry, I think you haven't followed what we've been talking about.

utf8::all currently doesn't make readdir give you utf8. It will still give you bytes when utf8::all is in scope. You'll need to decode it manually, as I showed you in my first comment. When you got the filenames piped in over STDIN, they were in UTF-8 because utf8::all was in scope. So there's the difference you're looking for.

Now, we can potentially fix utf8::all so it does that for you, however it will involve overriding CORE::readdir. That's fine - unless something else (like autodie) is also trying to override CORE::readdir. Then things might get tricky - and I've asked Leon for advice on that question, if he has any.

@Leont

Yes, playing nicely with autodie is definitely a requirement. Do you have any advice on that?

Both pragma's should check if the functions they override are already overridden, and if so should wrap those instead of the core ones. It seems autodie does that already, actually.

@Astara

@doherty I also wasn't, exactly, responding to what you were taking about but to Leont's statement:
readdir should be considered binary data unless specified otherwise, at least on non-darwin unices. Perl's current behavior is perfectly sane there.

To that I disagreed -- since if it was binary data -- then I submit, that it doesn't know
HOW it is encoded, and therefore should do nothing to it to 'convert it' -- OR --
it should treat all inputs as UTF-8 encoded -- and output them the same way.

In neither case should any translation from one character set to another be required or done.

That's my issue: data in can't be losslessly processed with perl because it assumes binary data is already in some specific format which it uses to convert to UTF-8.

That only works for those using a LATIN1 charset and seems to me to be a symptom of using it as a default (or restated, the presumption that it is LATIN1)...

Am I making any sense? ;-)

Added, looking at other calls/issues)
While I would hope that you can make utf8::all work with every call -- it's not JUST readdir...but text coming in from a dbm (hopefully it is treated as UTF-8 and no conversion is done...right, or it is left alone).
Others -- maybe the worst might be syscall -- what is the output to/from a random syscall encoded in? You can't say -- that includes not saying it is latin1 or UTF-8.

The only reason it MATTERS, is because perl is often, if not most often, used as a text-processing language -- to do that non-problematically, all binary input needs to be UTF-8 encoded -- and NOT range checked for Unicode-compatibility -- i.e. Just use the UTF-8 encoding algorithm, "blindly" at both ends (or not -- by option) -- if I say
use utf8::all, then the impact of that should be that functions like 'length' and string functions treat the string
as UTF-8 -- not cause any conversions.

If I dont' ask it to use UTF-8, then it would use byte semantics in string functions...

OR -- most preferably -- the default becomes use UTF-8 -- and for chars that are undefined in Unicode, they match as 'undefined' if asked for Properties (alpha, number, symbol)...

Would that work or why wouldn't it work?

@doherty
Owner

I'm sorry, I will not copy-and-paste perlunitut here for you.

In this example, you are getting some bytes from readdir, and then printing them through a utf8 filehandle. You are the one doing the conversion, therefore. Please understand that the programmer must know what data they are working with, and in particular what encoding text has. Perl is not magic, and neither is utf8::all.

@Astara

Is see, getting condescending now.

Try to keep it civil and what do I get...

Um... when I use utf8::all, that means I want all of my handles / streams to be treated as UTF-8.

A directory is just a file handle with special handling. That file handle fails to be treated as UTF-8 as I, the programmer, specifically specified by saying utf8::all. You hit the nail on the head -- I the programmer know that my directory names are UTF-8 encoded, and that my output is also UTF-8 encoded.

Unfortunately, a design flaw was introduced in perl 5.8 that treats input asymmetrically with output. Someone said treating all as UTF-8 caused problems, but maybe that was due to the conversion algorithm, I don't know.

I'm saying that as a programmer, I should be able to turn off that asymmetry. Ideally -- by default -- an
that only in the control of a "use bytes" or use locale(russian/latin/whatever) would it assume otherwise.

It's the presumption of asymmetry that is my issue. I don't need to post the perlunicode man page or the design documents of UTF-8 here, do I? (snoot)...

The perlunitut is buggy...(I guess i should report that...)...

It says:
Encoding (as a verb) is the conversion from text to binary. To encode,
you have to supply the target encoding, for example "iso-8859-1" or
"UTF-8".
Tell me .. if I have a wav file, why is lame called an mp3 encoder? Is a wav file text?

Nope.

What is 'text'? That is undefined. You have to know the source encoding as well as the target encoding
in order to encode. See iconv (not the perl version -- as it is incompatible by design). But you will see
that it requires both a "From" and a "To" encoding.

Or is there another section from the perlunitut you wanted me to proof?

@doherty
Owner

Yes, I understand what you want. And utf8::all doesn't do that yet. I already said we'd try to make readdir give you UTF-8 instead of binary, what more do you want?

@Astara

For you to not have to do it.

That perl wouldn't mangle it in the first place....

It's like it's hard enough with you having to go through contortions to handle conditions like this, but on top of that, there are a bunch of other issues potentially lurking and waiting to cause probs...

I just wish there was a way to tell perl to really assume utf-8, itself and know
it was designed into the language (or committed to being supported as part of Perl's core) rather than rely on a module that is outside the Core.

On top of THAT, perl's unicode handling is still developing, so that's almost sure to destabilize something that's as comprehensive as this module is intended.

So ... what else would I like? to see this module as part of perl's core?
;-)

I know that's probably not something you can just flick a switch and make happen, but that's something that I think would be great to see...

@Leont

That's my issue: data in can't be losslessly processed with perl because it assumes binary data is already in some specific format which it uses to convert to UTF-8.

That is bullshit. Complete and utter bullshit. doherty already showed you how you can handle this in his first reply. There is no data loss. The fact that print won't work correctly until you've decoded your data doesn't change that.

To that I disagreed -- since if it was binary data -- then I submit, that it doesn't know
HOW it is encoded, and therefore should do nothing to it to 'convert it' -- OR --
it should treat all inputs as UTF-8 encoded -- and output them the same way.

I agree things could be better, but we have to deal with 25 years of history. Perl is older than Unicode.

Am I making any sense? ;-)

Not really, to be honest.

@Leont

Is see, getting condescending now.

No, he's not. He's pointing you to the manual because you don't seem to have a deep understanding of how this stuff works (which is ok, this is confusing and contorted matter). You're not recognizing the limits of your knowledge, instead you appear to act with a sense of entitlement, getting mad at the author of a module that puts effort into fixing an issue you ran into.

@doherty doherty referenced this issue from a commit
@doherty Wrap CORE::readdir to provide UTF-8 filenames
readdir returns filenames as bytes, but that is
inconvenient. Now, we wrap readdir and decode the
returned filenames so the user is presented with
the UTF-8 filenames they expected.

Fixes GH #11
e3de34f
@doherty
Owner

@Leont: If you have a few moments, could you let me know if the attached commit looks any good?

@Astara

"I agree things could be better, but we have to deal with 25 years of history. Perl is older than Unicode."
So was ascii.
Notice how it is no longer used in modern standards.
Morse code used to be a standard at one time too.
Things that don't change go extinct.
The point I would make is that I had a mail filter prog that
worked up to and through 5.8.0 -- then in 5.8.1 an incompatible 'fix' for the unicode conversion introduced in 5.8.0 was introduced that BROKE existing code. Everyone whines about compatibility, but it is a sham. It was the introduction of a broken algorithm, that broke compatibility.

I understand why they introduced the change in 5.8.1, because 5.8.0 broke binary compatibility with other programs that didn't encode everything in UTF-8. But they screwed it up when they decided to go with a broken output encoding algorithm that treats all values >256 as to-be-encoded Unicode values, and values <256 as binary. If they had gone all binary or all unicode, previous programs wouldn't have broken.

But 5.8 didn't come out 25 years ago...more like 10-15 years ago. The breakage that was implemented came long after UTF-8 was established.

Not that they care, their daddy had to protect them from mean me...laugh...
at least it gave me permission to do my own thing.

Sorry, read comment above the one you just sent and it inflamed a few wounds.

Really, the best thing for perl to have done would have been to be compatible with the rest linux's utils and "listen" to the environment. If they had done that -- they never would have the problems in 5.8.0 they did. Now they are all too scared to do it right so it won't ever get fixed until a perl5-fork comes out since the current maintainers of perl5 are the reason larry had to "fork" and create a perl6... because he would have had the same whining about compatibility that I've had to deal with. Guido V R. had to do the same w/python, but he didn't have has many old-timers to yell loudly when he had to re-architect things and issue a completely incompatible change. But the same thing happened there. Larry had the baggage of a a larger and older (more conservative) community, it's still unclear when Perl6 would be ready to take the place of perl5... but those who maintain perl5 can't make the choices to keep it viable into the future either. Even MS learned that lesson when they dropped compat with
Win98...

I don't know enough about Core wrap to know if that will work or not... but here's a rub - are there users besides me who already convert to UTF-8 when they readdir and how will this code affect those programs?

@Leont

If you have a few moments, could you let me know if the attached commit looks any good?

You're not handling scalar context right (and why the «no warnings qw(redefine);»?). Otherwise it looks like it should work on Unix systems. Not sure what's going to happen on Windows, it may very well break badly there. Perl uses legacy APIs instead of Unicode ones there, for lack of someone fixing that.

@doherty doherty referenced this issue from a commit
@doherty Mimic readdir's context-sensitivity
CORE::readdir is context-sensitive: in list context, it returns
all the files, which we emulated with utf8 goodness. But in
scalar context, it returns the next file, rather than the number
of files. Mimic this behaviour in utf8::all's wrapper function.

Thanks for Leon Timmermans for catching this omission.

Fixes GH #11
ba5ad7e
@doherty doherty referenced this issue from a commit
@doherty Wrap CORE::readdir to provide UTF-8 filenames
readdir returns filenames as bytes, but that is
inconvenient. Now, we wrap readdir and decode the
returned filenames so the user is presented with
the UTF-8 filenames they expected.

Thanks for Leon Timmermans for code review

Fixes GH #11
bc37b2a
@doherty
Owner

Thanks for your help, Leon. This is in the 0.008 release.

@doherty doherty closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.