New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode file name issue on windows #42
Comments
I was under the impression that linux used utf-8 by default but that might not be right. Does it actually fail if you use the .decode('utf-8') trick? |
@ahupp you cannot convert these back to UTF reliably and consistently unless you know the original encoding that was used. And you cannot know the original encoding that was used, because it is not stored anywhere along with the filename. Linux does not use any special encoding for file names, this is just a sequence of bytes. The bytes are encoded according to the locale of the user that created the file originally. Yes, this is a per-user setting. Once the file moves to another user, that info is lost and you cannot reconstruct it. And the file will eventually have a different decoded name according to that other user locale. This is easy to reproduce when moving archives around: say a Chinese guy creates an archive with Chinese filenames encoded in Gb2312 and sends that over to me. Unless I know that he was using that encoding, filenames will eventually look like gibberish |
so decoding to utf-8 fails |
Sorry I meant encode not decode. The issue in #9 was that they passed a unicode string to magic, which was (by default) decoded to ascii bytes by ctypes. If you turn that into a str (this is all pre-3.0 mind you) by running decode manually on it it avoids the problem. In your case, are you passing a unicode string as the filename? Or are you passing in bytes? A transcript from the terminal would help understand the specific issue. |
@ahupp Adam, see this zip: When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright. The point being that you cannot rely on my file system encoding to convert the file name, because it was create elsewhere, and the encoding is arbitrary, eventually unknown. The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper |
I was having a similar issue. I'm trying to detect mime types on an arbitrary set of files, some with various encodings. I was using os.walk() to read a bunch of files in an input directory. I fixed my program by just declaring the source directory as a unicode string, and everything magically began working. (pun intended) The whole trick is to get the filename to be a unicode string instead of an ascii string. If you pass a unicode string into os.walk(), then you get unicode strings back, and everything works as it should. |
@mojotx Thanks for the input, but this what I am doing alright. " I can os.listdir(u'some dir')" |
@pombredanne, which version of Python are you using? Just curious. I'm still having similar issues, using Python 2.7. |
I think I may have found the problem. See http://www.python.org/peps/pep-0263.html Adding the following as the first line of magic.py seems to have fixed it. -- coding: utf-8 --After stepping tediously through the code with pdbpp, it appears that even though you pass in a UTF-8 encoded Unicode string, the magic.py is defaulting back to ascii, which is then causing problems. There may be a better way to fix this, but this appears to work for me. I may be uninstalling the pypi version and just cloning the Git repository and doing my own branch. |
@mojotx Can you be more explicit? do you have a unit test passing that uses the file I provided? |
@mojotx I am using CPython 2.5, 2.6 and 2.7 on Linux/Win/Mac |
@pombredanne Strange. Now I can't reproduce the issue. You can grab my testing program by cloning https://github.com/mojoTX/mojoTX.git I put your image file into my repository as well, since it's good for testing. |
@pombredanne You can tinker with my "test_magic.py" script. If you initialize SOURCEDIR as a unicode string it works, but otherwise it does not. |
I think one of the problems is that if you pass in a Unicode filename, the "Unicode-ness" gets discarded at line 183 of magic.py, in coerce_filename(): def coerce_filename(filename): A test program I wrote shows illustrates what happens: #!/usr/bin/env python2 -- coding: utf-8 --import sys, pprint s = u"Python является моим любимым языком программирования" |
@mojotx I appreciate your efforts but I feel your are missing entirely the point of this bug: "When extracted on windows I can os.listdir(u'some dir') where this file has been extracted and open the file in Python alright." |
@pombredanne Okay, sorry, I misunderstood the issue. |
@mojotx no problem, this is a tough one. In fact this is intractable in some corner cases like this as for instance you cannot reliably store a path string on an OS and expect it to work on another OS at all. For now the problem lies in the file command at least to support unicode paths on windows, IMHO |
It seems that changing "return filename.encode(sys.getfilesystemencoding())" to "return filename" in coerce_filename() solves the decode issue for me, not sure why filename is encoded in the first place. |
@fluxer Thanks Ivailo: let me recheck that, did you have a test case that was failing and is no longer failing now? |
I'm not using a famous GNU/Linux distribution, nor Windows. I have a test case that is always failing for me, here is an archive with a few example files: https://dl.dropboxusercontent.com/u/54183088/magic.tar.gz. Locales as follows: |
Ok, I see the problem. The coerce_filename change is there because passing On Wed, Nov 20, 2013 at 3:35 AM, Ivailo Monev notifications@github.comwrote:
Adam Hupp | http://hupp.org/adam/ |
I could reproduce the problem with Ivalo's files, and now I get:
(BTW, I hope those are public keys :) |
I believe I've pushed a fix for this to master. Could someone hitting this error give it a try? |
I will give it a shot, I only and have Python v2.7.6 installed tough so someone else will have to test this with Python v3.x. The keys are from ca-certificates (http://ftp.debian.org/debian/pool/main/c/ca-certificates/) so yeah - they are public. |
Works fine for me now, thanks a bunch! |
Since this is a real issue (fixed now), but was committed way after the last available official release (even the one here 0.4.3), would you please consider preparing a recent release with this and all other fixes you have added since 2012 ? Thanks |
Yeah, the release schedule has been poky. I just pushed 0.4.7 which is up-to-date. |
This is the same as #9
BUT I cannot encode to utf-8 as I do not know which encoding was used to create that filename. The file was created on some linux and is extracted on windows.
A test file is in this archive: https://github.com/pombredanne/python-magic/blob/issue_42/testdata/a.zip
When extracted on windows I can os.walk or os.listdir(u'some dir') (using a unicode string for the input dir) where this file has been extracted and open the file in Python alright. ie os.path.exists is True and the file can be opened.
Yet magic.from_file fails when passing the filename whether as a string or unicode
The point being that you cannot rely on my file system encoding to convert the file name, because it was created elsewhere, and the encoding is arbitrary, eventually unknown.
The best explanation of the problem is here IMHO: http://web.archive.org/web/20081219193532/http://boodebr.org/main/python/all-about-python-and-unicode#UNI_FILENAMES
NB: FWIW, I cannot get run the file command line to work on that path either... so it could be a problem with the file code itself, not your ctypes wrapper
The text was updated successfully, but these errors were encountered: