Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError in coerce_filename #27

Closed
marians opened this issue Jan 15, 2013 · 7 comments
Closed

UnicodeDecodeError in coerce_filename #27

marians opened this issue Jan 15, 2013 · 7 comments

Comments

@marians
Copy link

marians commented Jan 15, 2013

I have a file named:

  'aktuelle_Dokumente.jsp?docTyp=ST&wp=15&dokNum=8.+Schulrechts\xe4\xae\xa4erungsgesetz&searchDru=suchen'

When I try to read this with magic's from_file method, I get the following exception:

  Traceback (most recent call last):
    File "repo-audit.py", line 133, in <module>
      auditor.run()
    File "repo-audit.py", line 27, in run
      e = AuditEntry(fullpath, self.logfile, self.mime_magic)
    File "repo-audit.py", line 61, in __init__
      self.mimetype = self.file_type()
    File "repo-audit.py", line 119, in file_type
      return self.mime_magic.from_file(self.path)
    File "/.../venv/lib/python2.7/site-packages/magic.py", line 70, in from_file
      return magic_file(self.cookie, filename)
    File "/.../venv/lib/python2.7/site-packages/magic.py", line 170, in magic_file
      return _magic_file(cookie, coerce_filename(filename))
    File "/.../venv/lib/python2.7/site-packages/magic.py", line 146, in coerce_filename
      return filename.encode(sys.getfilesystemencoding())
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 130: ordinal not in range(128)

I am on MacOS X 10.8.2 with python 2.7.2.

@ahupp
Copy link
Owner

ahupp commented Jan 15, 2013

What happens when you do open(.. that filename...), does it produce the
same error or throw?

On Tue, Jan 15, 2013 at 6:15 AM, Marian Steinbach
notifications@github.comwrote:

I have a file named:

'aktuelle_Dokumente.jsp?docTyp=ST&wp=15&dokNum=8.+Schulrechts\xe4\xae\xa4erungsgesetz&searchDru=suchen'

When I try to read this with magic's from_file method, I get the following
exception:

Traceback (most recent call last):
File "repo-audit.py", line 133, in
auditor.run()
File "repo-audit.py", line 27, in run
e = AuditEntry(fullpath, self.logfile, self.mime_magic)
File "repo-audit.py", line 61, in init
self.mimetype = self.file_type()
File "repo-audit.py", line 119, in file_type
return self.mime_magic.from_file(self.path)
File "/.../venv/lib/python2.7/site-packages/magic.py", line 70, in from_file
return magic_file(self.cookie, filename)
File "/.../venv/lib/python2.7/site-packages/magic.py", line 170, in magic_file
return _magic_file(cookie, coerce_filename(filename))
File "/.../venv/lib/python2.7/site-packages/magic.py", line 146, in coerce_filename
return filename.encode(sys.getfilesystemencoding())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 130: ordinal not in range(128)

I am on MacOS X 10.8.2 with python 2.7.2.


Reply to this email directly or view it on GitHubhttps://github.com//issues/27.

Adam Hupp | http://hupp.org/adam/

@ahupp
Copy link
Owner

ahupp commented Jun 3, 2013

Have a chance to look at this?

@ghost
Copy link

ghost commented May 19, 2014

I can confirm this behaviour. Seems like coerce_filename has problems with UTF-8 encoded file names. Can be reproduced like this:

import magic
path = "/tmp/test\xfc.txt" # == /tmp/test/&uuml;.txt ...german u Umlaut
with open(path, "w") as f:
    f.write('\n') # works
magic.coerce_filename(path) # fails with:

Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/python_magic-0.4.6-py2.7.egg/magic.py
", line 183, in coerce_filename
return filename.encode(sys.getfilesystemencoding())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 9: ordinal
not in range(128)

I'm running Python 2.7.5 with python_magic-0.4.6

@ahupp
Copy link
Owner

ahupp commented May 30, 2014

I think the problem is that getfilesystemencoding() only returns UTF-8 if your LANG is set appropriately. Otherwise it's ascii or similar. I can repro the error that way in the unit test.

Does this patch work for you?

diff --git a/magic.py b/magic.py
index cd5ff24..10685ac 100644
--- a/magic.py
+++ b/magic.py
@@ -193,14 +193,15 @@ def coerce_filename(filename):
return None

 # ctypes will implicitly convert unicode strings to bytes with
  • .encode('ascii'). A more useful default here is

  • getfilesystemencoding(). We need to leave byte-str unchanged.

  • .encode('ascii'). If you use the filesystem encoding

  • then you'll get inconsistent behavior (crashes) depending on the user's

  • LANG environment variable

is_unicode = (sys.version_info[0] <= 2 and
isinstance(filename, unicode)) or
(sys.version_info[0] >= 3 and
isinstance(filename, str))
if is_unicode:

  •    return filename.encode(sys.getfilesystemencoding())
    
  •    return filename.encode('utf-8')
    
    else:
    return filename

@ahupp
Copy link
Owner

ahupp commented May 30, 2014

@ghost
Copy link

ghost commented May 30, 2014

Yes, the patch works.
Thanks a lot!

@ahupp
Copy link
Owner

ahupp commented May 30, 2014

Fixed in 012f8a9

@ahupp ahupp closed this as completed May 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants