Unicode file name issue on windows #42

pombredanne · 2013-09-20T15:05:52Z

This is the same as #9
BUT I cannot encode to utf-8 as I do not know which encoding was used to create that filename. The file was created on some linux and is extracted on windows.

A test file is in this archive: https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip

When extracted on windows I can os.walk or os.listdir(u'some dir') (using a unicode string for the input dir) where this file has been extracted and open the file in Python alright. ie os.path.exists is True and the file can be opened.

Yet magic.from_file fails when passing the filename whether as a string or unicode

The point being that you cannot rely on my file system encoding to convert the file name, because it was created elsewhere, and the encoding is arbitrary, eventually unknown.

The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES

NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper

ahupp · 2013-09-27T16:13:53Z

I was under the impression that linux used utf-8 by default but that might not be right. Does it actually fail if you use the .decode('utf-8') trick?

pombredanne · 2013-09-27T17:17:44Z

@ahupp you cannot convert these back to UTF reliably and consistently unless you know the original encoding that was used. And you cannot know the original encoding that was used, because it is not stored anywhere along with the filename.

Linux does not use any special encoding for file names, this is just a sequence of bytes. The bytes are encoded according to the locale of the user that created the file originally. Yes, this is a per-user setting. Once the file moves to another user, that info is lost and you cannot reconstruct it. And the file will eventually have a different decoded name according to that other user locale. This is easy to reproduce when moving archives around: say a Chinese guy creates an archive with Chinese filenames encoded in Gb2312 and sends that over to me. Unless I know that he was using that encoding, filenames will eventually look like gibberish

pombredanne · 2013-09-27T17:18:08Z

so decoding to utf-8 fails

ahupp · 2013-09-27T19:51:13Z

Sorry I meant encode not decode. The issue in #9 was that they passed a unicode string to magic, which was (by default) decoded to ascii bytes by ctypes. If you turn that into a str (this is all pre-3.0 mind you) by running decode manually on it it avoids the problem.

In your case, are you passing a unicode string as the filename? Or are you passing in bytes? A transcript from the terminal would help understand the specific issue.

pombredanne · 2013-10-14T11:11:08Z

@ahupp Adam, see this zip:
https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip

When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright.
magic.from_file fails when passing the filename whether as a string or unicode

The point being that you cannot rely on my file system encoding to convert the file name, because it was create elsewhere, and the encoding is arbitrary, eventually unknown.

The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES

NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper

mojotx · 2013-10-15T21:32:10Z

I was having a similar issue. I'm trying to detect mime types on an arbitrary set of files, some with various encodings. I was using os.walk() to read a bunch of files in an input directory. I fixed my program by just declaring the source directory as a unicode string, and everything magically began working. (pun intended)

The whole trick is to get the filename to be a unicode string instead of an ascii string. If you pass a unicode string into os.walk(), then you get unicode strings back, and everything works as it should.

pombredanne · 2013-10-16T22:15:11Z

@mojotx Thanks for the input, but this what I am doing alright. " I can os.listdir(u'some dir')"
The filename in https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip can be worked out consistently with Python alright using a unicode string as a input dir for os.walk or listidr.
But python-magic still chokes when getting this same unicode filename

mojotx · 2013-10-17T00:49:01Z

@pombredanne, which version of Python are you using? Just curious. I'm still having similar issues, using Python 2.7.

mojotx · 2013-10-17T02:06:53Z

I think I may have found the problem. See http://www.python.org/peps/pep-0263.html

Adding the following as the first line of magic.py seems to have fixed it.

-- coding: utf-8 --

After stepping tediously through the code with pdbpp, it appears that even though you pass in a UTF-8 encoded Unicode string, the magic.py is defaulting back to ascii, which is then causing problems. There may be a better way to fix this, but this appears to work for me. I may be uninstalling the pypi version and just cloning the Git repository and doing my own branch.

pombredanne · 2013-10-17T07:41:17Z

@mojotx Can you be more explicit? do you have a unit test passing that uses the file I provided?
I cannot fathom how changing the source file encoding could impact this :^) and I could not reproduce the fix you claim above by forcing a source file encoding. And btw, I am talking about head here (though other versions have likely the same issue)

pombredanne · 2013-10-17T07:43:04Z

@mojotx I am using CPython 2.5, 2.6 and 2.7 on Linux/Win/Mac

mojotx · 2013-10-17T12:49:03Z

@pombredanne Strange. Now I can't reproduce the issue. You can grab my testing program by cloning https://github.com/mojoTX/mojoTX.git

I put your image file into my repository as well, since it's good for testing.

mojotx · 2013-10-17T12:52:58Z

@pombredanne You can tinker with my "test_magic.py" script. If you initialize SOURCEDIR as a unicode string it works, but otherwise it does not.

mojotx · 2013-10-17T20:59:39Z

@pombredanne @ahupp:

I think one of the problems is that if you pass in a Unicode filename, the "Unicode-ness" gets discarded at line 183 of magic.py, in coerce_filename():

def coerce_filename(filename):
if filename is None:
return None
return filename.encode(sys.getfilesystemencoding())

A test program I wrote shows illustrates what happens:

#!/usr/bin/env python2

-- coding: utf-8 --

import sys, pprint

s = u"Python является моим любимым языком программирования"
pprint.pprint( s )
print "sys.getfilesystemencoding() is %s" % sys.getfilesystemencoding()
s2 = s.encode(sys.getfilesystemencoding())
pprint.pprint( s2 )

pombredanne · 2013-10-18T09:44:49Z

@mojotx I appreciate your efforts but I feel your are missing entirely the point of this bug: "When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright."
Sorry if I was not clear enough that this issue is a windows issue only . Things are ok on posix. Let me update the issue

pombredanne · 2013-10-18T14:00:47Z

@mojotx @ahupp and FWIW patching the code to bypass re recoding using file system encoding does not matter here. The issue is IMHO the fact that the file code itself cannot cope with these filenames at all on windows. In a sense this is rather a file (command) bug than a bug here

mojotx · 2013-10-21T15:37:51Z

@pombredanne Okay, sorry, I misunderstood the issue.

pombredanne · 2013-10-21T15:50:42Z

@mojotx no problem, this is a tough one. In fact this is intractable in some corner cases like this as for instance you cannot reliably store a path string on an OS and expect it to work on another OS at all. For now the problem lies in the file command at least to support unicode paths on windows, IMHO

fluxer · 2013-11-19T23:28:32Z

It seems that changing "return filename.encode(sys.getfilesystemencoding())" to "return filename" in coerce_filename() solves the decode issue for me, not sure why filename is encoded in the first place.

pombredanne · 2013-11-20T08:50:43Z

@fluxer Thanks Ivailo: let me recheck that, did you have a test case that was failing and is no longer failing now?
which os/files/locale were you using?

fluxer · 2013-11-20T09:35:01Z

I'm not using a famous GNU/Linux distribution, nor Windows.

I have a test case that is always failing for me, here is an archive with a few example files: https://dl.dropboxusercontent.com/u/54183088/magic.tar.gz.

Locales as follows:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

ahupp · 2014-01-04T20:43:57Z

Ok, I see the problem. The coerce_filename change is there because passing
unicode values to ctypes causes it to do an implcit .encode('ascii'), (at
least in python 3). This of course fails for non-ascii unicode. But for
non-unicode strings I'd expected this is a no-op but apparently it chokes
somehow that I don't really understand. The fix is to only do the
conversion for unicode strings and leave byte-strings unchanged. Of course
supporting this in both python 3 and 2 is a PITA.

On Wed, Nov 20, 2013 at 3:35 AM, Ivailo Monev notifications@github.comwrote:

I'm not using a famous distribution..

I have a test case that is always failing for me, here is an archive with
a few example files:
https://dl.dropboxusercontent.com/u/54183088/magic.tar.gz.

Locales as follows:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-28873957
.

Adam Hupp | http://hupp.org/adam/

ahupp · 2014-01-04T20:44:44Z

I could reproduce the problem with Ivalo's files, and now I get:

for i in os.listdir('.'): print magic.from_file(i)
PEM certificate
PEM certificate

(BTW, I hope those are public keys :)

ahupp · 2014-01-04T20:59:59Z

I believe I've pushed a fix for this to master. Could someone hitting this error give it a try?

fluxer · 2014-01-04T21:04:14Z

I will give it a shot, I only and have Python v2.7.6 installed tough so someone else will have to test this with Python v3.x.

The keys are from ca-certificates (http://ftp.debian.org/debian/pool/main/c/ca-certificates/) so yeah - they are public.

fluxer · 2014-01-04T21:14:31Z

Works fine for me now, thanks a bunch!

PhobosK · 2015-08-03T20:09:24Z

Since this is a real issue (fixed now), but was committed way after the last available official release (even the one here 0.4.3), would you please consider preparing a recent release with this and all other fixes you have added since 2012 ?

Thanks

ahupp · 2015-10-19T06:25:17Z

Yeah, the release schedule has been poky. I just pushed 0.4.7 which is up-to-date.

ahupp closed this as completed Jan 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode file name issue on windows #42

Unicode file name issue on windows #42

pombredanne commented Sep 20, 2013

ahupp commented Sep 27, 2013

pombredanne commented Sep 27, 2013

pombredanne commented Sep 27, 2013

ahupp commented Sep 27, 2013

pombredanne commented Oct 14, 2013

mojotx commented Oct 15, 2013

pombredanne commented Oct 16, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

pombredanne commented Oct 17, 2013

pombredanne commented Oct 17, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

pombredanne commented Oct 18, 2013

pombredanne commented Oct 18, 2013

mojotx commented Oct 21, 2013

pombredanne commented Oct 21, 2013

fluxer commented Nov 19, 2013

pombredanne commented Nov 20, 2013

fluxer commented Nov 20, 2013

ahupp commented Jan 4, 2014

ahupp commented Jan 4, 2014

ahupp commented Jan 4, 2014

fluxer commented Jan 4, 2014

fluxer commented Jan 4, 2014

PhobosK commented Aug 3, 2015

ahupp commented Oct 19, 2015

Unicode file name issue on windows #42

Unicode file name issue on windows #42

Comments

pombredanne commented Sep 20, 2013

ahupp commented Sep 27, 2013

pombredanne commented Sep 27, 2013

pombredanne commented Sep 27, 2013

ahupp commented Sep 27, 2013

pombredanne commented Oct 14, 2013

mojotx commented Oct 15, 2013

pombredanne commented Oct 16, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

-- coding: utf-8 --

pombredanne commented Oct 17, 2013

pombredanne commented Oct 17, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

mojotx commented Oct 17, 2013

-- coding: utf-8 --

pombredanne commented Oct 18, 2013

pombredanne commented Oct 18, 2013

mojotx commented Oct 21, 2013

pombredanne commented Oct 21, 2013

fluxer commented Nov 19, 2013

pombredanne commented Nov 20, 2013

fluxer commented Nov 20, 2013

ahupp commented Jan 4, 2014

ahupp commented Jan 4, 2014

ahupp commented Jan 4, 2014

fluxer commented Jan 4, 2014

fluxer commented Jan 4, 2014

PhobosK commented Aug 3, 2015

ahupp commented Oct 19, 2015