Allow unicode filenames in Python 2 #119

mgeier · 2015-03-25T16:03:37Z

Currently we check with isinstance(file, str), which doesn't allow the Python 2 unicode type.

We should probably allow both "raw bytes" and "unicode" in both Python 2 and 3.
We would have to call encode() only on the "unicode" strings.
Alternatively, we could disallow Python 3 bytes objects, but I think it's better to allow them, because they are allowed for standard Python file objects, too.

I'm not sure which encoding is used by default, but we should check if we should probably change to sys.getfilesystemencoding().
This idea is stolen from https://github.com/vokimon/python-wavefile

It's kind of annoying to check for unicode vs bytes, but I guess this would work in 2.6, 2.7 and 3.3+:

if isinstance(file, type(u"")):
    # unicode
if isinstance(file, type(b"")):
    # bytes

We could also just check if it's any of the "string-like" types and then try if there is an encode() method:

if isinstance(file, (type(u""), type(b""))):
    try:
        file = file.encode()
    except AttributeError:
        pass

In the combined check we could also use isinstance(file, (type(u""), bytes)), but I don't think that's better.

If Python 3.0-3.2 compatibility is important (which I think isn't), we could, instead of type(u""), use from __future__ import unicode_literals and check for type(""), but this would lead to problems in other parts of the code (e.g. mode strings) where isinstance(..., str) would have to be extended to accept unicode in Python 2 (which would make things more complicated than before).

As yet another alternative, we could try to encode() before the type check and then drop the check for str and only check for bytes:

try:
    file = file.encode()
except AttributeError:
    pass
if isinstance(file, bytes):
    ...
elif isinstance(file, int):
    ...
...

I think this doesn't restrict compatibility and even looks nicer than the other alternatives, so it is probably the way to go.

There should probably also be tests, but I don't know how the encoding stuff can be reasonably tested cross-platform with the least effort.

See #119.

bastibe · 2015-03-23T21:49:26Z

honestly, I would postpone this until we have tox running and can actually test this in different versions of Python. Also, this should be a relatively popular issue: Are there any best practices for cases like this?

mgeier · 2015-03-24T11:50:10Z

I would postpone this until we have tox running and can actually test this in different versions of Python.

Sure, this isn't urgent at all.
But tox alone doesn't solve the problem, we'd actually have to test it on file systems with different encodings ... I'm not sure if that's doable with reasonable effort.

The code looks quite innocuous:

try:
    file = file.encode(_sys.getfilesystemencoding())
except AttributeError:
    pass

I guess we could just add it and wait if there are any bug reports.

Also, this should be a relatively popular issue: Are there any best practices for cases like this?

I don't know.

I guess normally Python's open() function is used which handles all this internally, doesn't it?
The documentation talks about the encoding of the content, but I don't see anything about the encoding of the file name itself ...

According to this stackoverflow page, open() handles Unicode file names and does The Right Thing(TM). So I guess normally this isn't an issue at all; we just have to take care of the encoding manually because we don't use open().

BTW, the page also mentions sys.getfilesystemencoding(), I guess we should also use this.

See #119.

mgeier · 2015-03-24T14:34:49Z

Hmm ... this is more complicated than I thought ...
I thought using file.encode() in a try block takes care of everything, but the problem is that in Python 2, the str type also has an encode() method (unlike bytes in Python 3).

Therefore, we'll either have to add an additional check for not isinstance(file, bytes) or replace the try block with a check for unicode, which seems to be awkward to do in a Python 2/3 compatible manner ...

Another problem is the call to str() when checking the file name extension. This raises an error in Python 2 if file (or file.name) is unicode.

See #119.

mgeier · 2015-03-25T17:01:37Z

I added two commits to allow Unicode in Python 2, which didn't really work anywhere (ee4a400) and to specifically use sys.getfilesystemencoding() (7405827).

There are no tests yet, because I don't know how test this with little effort (but still reasonably well).
I don't really want to use obscure Unicode characters in our test WAV files.
The current tests pass on CPython 2.6, 2.7, 3.4 and PyPy on Linux.
I also tried with a few exotic Unicode file names and it seems to work fine.

mgeier · 2015-03-29T19:14:19Z

I added another commit (19b7e7c).
This fixes the handling of filenames of type bytes in Python 3 and at the same time it disallows bytes (but of course not Python 2 str) for mode/format/subtype/endian.

I think this is the expected behavior, because it's very close to the built-in open() function.

I currently don't have the motivation to create unit tests that thoroughly test all combinations of unicode/str/bytes containing ASCII and non-ASCII letters for file/mode/format/subtype/endian.
I suggest that we merge this (after some review) and create the tests later (issue #122).
The current tests all pass on CPython 2.6, 2.7, 3.4 and PyPy on Linux.

bastibe · 2015-03-29T19:22:06Z

Oh my, what a terrible situation.

Is there any way to factor the string sanitation out into a function?

At any rate, I feel like a few comments would make things clearer--otherwise this just looks unnecessarily complex.

mgeier · 2015-03-29T19:23:43Z

Sure, that's the next step I'm working on ...

But first I saw that the tests are actually not passing ... another commit is coming up ...

... and disallow bytes for mode/format/subtype/endian. Regarding mode, this is the same behavior as in the built-in open() function.

mgeier · 2015-03-29T19:46:05Z

OK, I replaced the last commit with a (hopefully) fixed one: 0ac4082

mgeier · 2015-03-29T22:01:15Z

I completely refactored __init__() in d2ab9c8.

Better?

mgeier · 2015-03-31T09:12:40Z

If someone is fighting with Unicode implementation issues, I highly recommend watching this video: http://nedbatchelder.com/text/unipain.html

This talks about Python 2 and 3 separately, it's even more pain if you want to support both with the same code base ...

bastibe · 2015-03-31T09:45:32Z

Better?

Yes, very much so.

bastibe · 2015-03-31T11:29:01Z

We now have a bunch of methods that kind of wrap sndfile commands. Maybe now would be the time to factor them out into their own class, and separate the Python part from the C-interop part.

mgeier · 2015-03-31T12:22:24Z

I think we should first make the release 0.7.0 and then do the suggested refactoring in a separate issue/PR.
My last commit isn't even strictly part of this PR, since it has nothing to do with Unicode or file names or anything like that (but it was kind of the same workflow, so I guess it's OK).

Allow unicode filenames in Python 2

mgeier added a commit that referenced this pull request Mar 23, 2015

Allow unicode file names in Python 2

67a7db8

See #119.

mgeier added a commit that referenced this pull request Mar 24, 2015

Allow unicode file names in Python 2

052f2f7

See #119.

mgeier added a commit that referenced this pull request Mar 25, 2015

Allow unicode strings in Python 2

932384c

See #119.

mgeier added 2 commits March 25, 2015 17:01

Allow unicode strings in Python 2

ee4a400

See #119.

Use sys.getfilesystemencoding() to encode Unicode file names

7405827

This was referenced Mar 28, 2015

support unicode filenames #121

Closed

Test Unicode and bytes handling (Python 2 and 3) in all string arguments #122

Open

Fix detection of file extension for bytes (in Py3k)

0ac4082

... and disallow bytes for mode/format/subtype/endian. Regarding mode, this is the same behavior as in the built-in open() function.

mgeier force-pushed the issue-119 branch from 19b7e7c to 0ac4082 Compare March 29, 2015 19:45

Re-factor SoundFile.__init__()

d2ab9c8

bastibe added a commit that referenced this pull request Apr 9, 2015

Merge pull request #119 from bastibe/issue-119

bd8f25e

Allow unicode filenames in Python 2

bastibe merged commit bd8f25e into master Apr 9, 2015

mgeier deleted the issue-119 branch April 9, 2015 17:10

mgeier added this to the 0.7.0 milestone Apr 9, 2015

This was referenced Apr 13, 2015

Support Unicode filenames in Windows with sf_wchar_open() #129

Merged

Full Unicode support on Windows vokimon/python-wavefile#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow unicode filenames in Python 2 #119

Allow unicode filenames in Python 2 #119

mgeier commented Mar 25, 2015

bastibe commented Mar 23, 2015

mgeier commented Mar 24, 2015

mgeier commented Mar 24, 2015

mgeier commented Mar 25, 2015

mgeier commented Mar 29, 2015

bastibe commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 31, 2015

bastibe commented Mar 31, 2015

bastibe commented Mar 31, 2015

mgeier commented Mar 31, 2015

Allow unicode filenames in Python 2 #119

Allow unicode filenames in Python 2 #119

Conversation

mgeier commented Mar 25, 2015

bastibe commented Mar 23, 2015

mgeier commented Mar 24, 2015

mgeier commented Mar 24, 2015

mgeier commented Mar 25, 2015

mgeier commented Mar 29, 2015

bastibe commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 29, 2015

mgeier commented Mar 31, 2015

bastibe commented Mar 31, 2015

bastibe commented Mar 31, 2015

mgeier commented Mar 31, 2015