Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode file name issue on windows #42

Closed
pombredanne opened this issue Sep 20, 2013 · 28 comments
Closed

Unicode file name issue on windows #42

pombredanne opened this issue Sep 20, 2013 · 28 comments

Comments

@pombredanne
Copy link

This is the same as #9
BUT I cannot encode to utf-8 as I do not know which encoding was used to create that filename. The file was created on some linux and is extracted on windows.

A test file is in this archive: https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip

When extracted on windows I can os.walk or os.listdir(u'some dir') (using a unicode string for the input dir) where this file has been extracted and open the file in Python alright. ie os.path.exists is True and the file can be opened.

Yet magic.from_file fails when passing the filename whether as a string or unicode

The point being that you cannot rely on my file system encoding to convert the file name, because it was created elsewhere, and the encoding is arbitrary, eventually unknown.

The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES

NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper

@ahupp
Copy link
Owner

ahupp commented Sep 27, 2013

I was under the impression that linux used utf-8 by default but that might not be right. Does it actually fail if you use the .decode('utf-8') trick?

@pombredanne
Copy link
Author

@ahupp you cannot convert these back to UTF reliably and consistently unless you know the original encoding that was used. And you cannot know the original encoding that was used, because it is not stored anywhere along with the filename.

Linux does not use any special encoding for file names, this is just a sequence of bytes. The bytes are encoded according to the locale of the user that created the file originally. Yes, this is a per-user setting. Once the file moves to another user, that info is lost and you cannot reconstruct it. And the file will eventually have a different decoded name according to that other user locale. This is easy to reproduce when moving archives around: say a Chinese guy creates an archive with Chinese filenames encoded in Gb2312 and sends that over to me. Unless I know that he was using that encoding, filenames will eventually look like gibberish

@pombredanne
Copy link
Author

so decoding to utf-8 fails

@ahupp
Copy link
Owner

ahupp commented Sep 27, 2013

Sorry I meant encode not decode. The issue in #9 was that they passed a unicode string to magic, which was (by default) decoded to ascii bytes by ctypes. If you turn that into a str (this is all pre-3.0 mind you) by running decode manually on it it avoids the problem.

In your case, are you passing a unicode string as the filename? Or are you passing in bytes? A transcript from the terminal would help understand the specific issue.

@pombredanne
Copy link
Author

@ahupp Adam, see this zip:
https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip

When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright.
magic.from_file fails when passing the filename whether as a string or unicode

The point being that you cannot rely on my file system encoding to convert the file name, because it was create elsewhere, and the encoding is arbitrary, eventually unknown.

The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES

NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper

@mojotx
Copy link

mojotx commented Oct 15, 2013

I was having a similar issue. I'm trying to detect mime types on an arbitrary set of files, some with various encodings. I was using os.walk() to read a bunch of files in an input directory. I fixed my program by just declaring the source directory as a unicode string, and everything magically began working. (pun intended)

The whole trick is to get the filename to be a unicode string instead of an ascii string. If you pass a unicode string into os.walk(), then you get unicode strings back, and everything works as it should.

@pombredanne
Copy link
Author

@mojotx Thanks for the input, but this what I am doing alright. " I can os.listdir(u'some dir')"
The filename in https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip can be worked out consistently with Python alright using a unicode string as a input dir for os.walk or listidr.
But python-magic still chokes when getting this same unicode filename

@mojotx
Copy link

mojotx commented Oct 17, 2013

@pombredanne, which version of Python are you using? Just curious. I'm still having similar issues, using Python 2.7.

@mojotx
Copy link

mojotx commented Oct 17, 2013

I think I may have found the problem. See http://www.python.org/peps/pep-0263.html

Adding the following as the first line of magic.py seems to have fixed it.

-- coding: utf-8 --

After stepping tediously through the code with pdbpp, it appears that even though you pass in a UTF-8 encoded Unicode string, the magic.py is defaulting back to ascii, which is then causing problems. There may be a better way to fix this, but this appears to work for me. I may be uninstalling the pypi version and just cloning the Git repository and doing my own branch.

@pombredanne
Copy link
Author

@mojotx Can you be more explicit? do you have a unit test passing that uses the file I provided?
I cannot fathom how changing the source file encoding could impact this :^) and I could not reproduce the fix you claim above by forcing a source file encoding. And btw, I am talking about head here (though other versions have likely the same issue)

@pombredanne
Copy link
Author

@mojotx I am using CPython 2.5, 2.6 and 2.7 on Linux/Win/Mac

@mojotx
Copy link

mojotx commented Oct 17, 2013

@pombredanne Strange. Now I can't reproduce the issue. You can grab my testing program by cloning https://github.com/mojoTX/mojoTX.git

I put your image file into my repository as well, since it's good for testing.

@mojotx
Copy link

mojotx commented Oct 17, 2013

@pombredanne You can tinker with my "test_magic.py" script. If you initialize SOURCEDIR as a unicode string it works, but otherwise it does not.

@mojotx
Copy link

mojotx commented Oct 17, 2013

@pombredanne @ahupp:

I think one of the problems is that if you pass in a Unicode filename, the "Unicode-ness" gets discarded at line 183 of magic.py, in coerce_filename():

def coerce_filename(filename):
if filename is None:
return None
return filename.encode(sys.getfilesystemencoding())

A test program I wrote shows illustrates what happens:

#!/usr/bin/env python2

-- coding: utf-8 --

import sys, pprint

s = u"Python является моим любимым языком программирования"
pprint.pprint( s )
print "sys.getfilesystemencoding() is %s" % sys.getfilesystemencoding()
s2 = s.encode(sys.getfilesystemencoding())
pprint.pprint( s2 )

@pombredanne
Copy link
Author

@mojotx I appreciate your efforts but I feel your are missing entirely the point of this bug: "When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright."
Sorry if I was not clear enough that this issue is a windows issue only . Things are ok on posix. Let me update the issue

@pombredanne
Copy link
Author

@mojotx @ahupp and FWIW patching the code to bypass re recoding using file system encoding does not matter here. The issue is IMHO the fact that the file code itself cannot cope with these filenames at all on windows. In a sense this is rather a file (command) bug than a bug here

@mojotx
Copy link

mojotx commented Oct 21, 2013

@pombredanne Okay, sorry, I misunderstood the issue.

@pombredanne
Copy link
Author

@mojotx no problem, this is a tough one. In fact this is intractable in some corner cases like this as for instance you cannot reliably store a path string on an OS and expect it to work on another OS at all. For now the problem lies in the file command at least to support unicode paths on windows, IMHO

@fluxer
Copy link

fluxer commented Nov 19, 2013

It seems that changing "return filename.encode(sys.getfilesystemencoding())" to "return filename" in coerce_filename() solves the decode issue for me, not sure why filename is encoded in the first place.

@pombredanne
Copy link
Author

@fluxer Thanks Ivailo: let me recheck that, did you have a test case that was failing and is no longer failing now?
which os/files/locale were you using?

@fluxer
Copy link

fluxer commented Nov 20, 2013

I'm not using a famous GNU/Linux distribution, nor Windows.

I have a test case that is always failing for me, here is an archive with a few example files: https://dl.dropboxusercontent.com/u/54183088/magic.tar.gz.

Locales as follows:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

@ahupp
Copy link
Owner

ahupp commented Jan 4, 2014

Ok, I see the problem. The coerce_filename change is there because passing
unicode values to ctypes causes it to do an implcit .encode('ascii'), (at
least in python 3). This of course fails for non-ascii unicode. But for
non-unicode strings I'd expected this is a no-op but apparently it chokes
somehow that I don't really understand. The fix is to only do the
conversion for unicode strings and leave byte-strings unchanged. Of course
supporting this in both python 3 and 2 is a PITA.

On Wed, Nov 20, 2013 at 3:35 AM, Ivailo Monev notifications@github.comwrote:

I'm not using a famous distribution..

I have a test case that is always failing for me, here is an archive with
a few example files:
https://dl.dropboxusercontent.com/u/54183088/magic.tar.gz.

Locales as follows:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=


Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-28873957
.

Adam Hupp | http://hupp.org/adam/

@ahupp
Copy link
Owner

ahupp commented Jan 4, 2014

I could reproduce the problem with Ivalo's files, and now I get:

for i in os.listdir('.'): print magic.from_file(i)
PEM certificate
PEM certificate

(BTW, I hope those are public keys :)

@ahupp
Copy link
Owner

ahupp commented Jan 4, 2014

I believe I've pushed a fix for this to master. Could someone hitting this error give it a try?

@fluxer
Copy link

fluxer commented Jan 4, 2014

I will give it a shot, I only and have Python v2.7.6 installed tough so someone else will have to test this with Python v3.x.

The keys are from ca-certificates (http://ftp.debian.org/debian/pool/main/c/ca-certificates/) so yeah - they are public.

@fluxer
Copy link

fluxer commented Jan 4, 2014

Works fine for me now, thanks a bunch!

@ahupp ahupp closed this as completed Jan 4, 2014
@PhobosK
Copy link

PhobosK commented Aug 3, 2015

Since this is a real issue (fixed now), but was committed way after the last available official release (even the one here 0.4.3), would you please consider preparing a recent release with this and all other fixes you have added since 2012 ?

Thanks

@ahupp
Copy link
Owner

ahupp commented Oct 19, 2015

Yeah, the release schedule has been poky. I just pushed 0.4.7 which is up-to-date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants