Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 filename come back mangles from scandir/walk #56

Closed
ghost opened this issue Sep 21, 2015 · 2 comments
Closed

UTF8 filename come back mangles from scandir/walk #56

ghost opened this issue Sep 21, 2015 · 2 comments

Comments

@ghost
Copy link

ghost commented Sep 21, 2015

Bytes of the real filename string

":".join("{:02x}".format(ord(c)) for c in u"C:\Temp\xx鳭僣yy.txt")
'43:3a:5c:54:65:6d:70:5c:78:78:9ced:50e3:79:79:2e:74:78:74'

Write something to file, and check size to confirm

with open(u"C:\Temp\xx鳭僣yy.txt","w") as f: f.write("abc")
op.getsize(u"C:\Temp\xx鳭僣yy.txt")
3L # success!

Lets look for filenames in the directory using scandir (only file in dir):

for f in scandir.scandir_c("C:\Temp"): ":".join("{:02x}".format(ord(c)) for c in f.name)
'78:78:3f:3f:79:79:2e:74:78:74'
for f in scandir.scandir_python("C:\Temp"): ":".join("{:02x}".format(ord(c)) for c in f.name)
'78:78:3f:3f:79:79:2e:74:78:74'
for f in scandir.scandir_generic("C:\Temp"): ":".join("{:02x}".format(ord(c)) for c in f.name)
'78:78:3f:3f:79:79:2e:74:78:74'

Note the "3f:3f" is "??", so the filename is being printed as 'xx??yy.txt'

Scandir seems unable to retrieve the UTF8 encoded filename, even though I am able to write to this file and check the size using Python. The standing listdir/walk in OS module also suffer the same problem.

How can I get a directory listing with UTF8 filenames preserved?

@benhoyt
Copy link
Owner

benhoyt commented Sep 21, 2015

This is (annoying) but expected behaviour, due to the way byte and unicode filenames are handled in Python 2.x. To get around it, just pass a unicode string instead of a byte string to scandir, like you're doing with open(), for example:

scandir.scandir(u"C:\Temp\xx鳭僣yy.txt")

Let me know if this works. Note that this is (or it should be!) the same behaviour as os.listdir() on Python 2.x.

@benhoyt
Copy link
Owner

benhoyt commented Sep 22, 2015

Sorry, I said "filename" and copied the unicode filename rather than the unicode directory.

But no, my module isn't "broken". :-) It's operating by design, as per os.listdir(). The behaviour of bytes paths on Windows is kinda weird -- if you pass in a byte string, you get out byte strings with non-ASCII chars replaced by ? characters on Windows. This is different from on Linux, where you get UTF-8. So bytes paths are kind of half broken on Windows Python.

What you need to do is simply path a unicode path to scandir. Like so:

>>> os.mkdir('temp')
>>> f = open(u'temp\\xx\u9ced\u50e3yy.txt', 'w')  # create a unicode filename
>>> f.close()
>>> [e.name for e in scandir.scandir('temp')]  # this is what you are doing
['xx??yy.txt']
>>> [e.name for e in scandir.scandir(u'temp')]  # this is what you need to do
[u'xx\u9ced\u50e3yy.txt']
>>> [e.name.encode('utf-8') for e in scandir.scandir(u'temp')]  # or as UTF-8
['xx\xe9\xb3\xad\xe5\x83\xa3yy.txt']

Note that, by design, this exactly matches the behaviour of os.listdir() on Python 2.x:

>>> os.listdir('temp')
['xx??yy.txt']
>>> os.listdir(u'temp')
[u'xx\u9ced\u50e3yy.txt']

@benhoyt benhoyt closed this as completed Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant