Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow unicode filenames in Python 2 #119

Merged
merged 4 commits into from
Apr 9, 2015
Merged

Allow unicode filenames in Python 2 #119

merged 4 commits into from
Apr 9, 2015

Conversation

mgeier
Copy link
Contributor

@mgeier mgeier commented Mar 25, 2015

Currently we check with isinstance(file, str), which doesn't allow the Python 2 unicode type.

We should probably allow both "raw bytes" and "unicode" in both Python 2 and 3.
We would have to call encode() only on the "unicode" strings.
Alternatively, we could disallow Python 3 bytes objects, but I think it's better to allow them, because they are allowed for standard Python file objects, too.

I'm not sure which encoding is used by default, but we should check if we should probably change to sys.getfilesystemencoding().
This idea is stolen from https://github.com/vokimon/python-wavefile

It's kind of annoying to check for unicode vs bytes, but I guess this would work in 2.6, 2.7 and 3.3+:

if isinstance(file, type(u"")):
    # unicode
if isinstance(file, type(b"")):
    # bytes

We could also just check if it's any of the "string-like" types and then try if there is an encode() method:

if isinstance(file, (type(u""), type(b""))):
    try:
        file = file.encode()
    except AttributeError:
        pass

In the combined check we could also use isinstance(file, (type(u""), bytes)), but I don't think that's better.

If Python 3.0-3.2 compatibility is important (which I think isn't), we could, instead of type(u""), use from __future__ import unicode_literals and check for type(""), but this would lead to problems in other parts of the code (e.g. mode strings) where isinstance(..., str) would have to be extended to accept unicode in Python 2 (which would make things more complicated than before).

As yet another alternative, we could try to encode() before the type check and then drop the check for str and only check for bytes:

try:
    file = file.encode()
except AttributeError:
    pass
if isinstance(file, bytes):
    ...
elif isinstance(file, int):
    ...
...

I think this doesn't restrict compatibility and even looks nicer than the other alternatives, so it is probably the way to go.

There should probably also be tests, but I don't know how the encoding stuff can be reasonably tested cross-platform with the least effort.

mgeier added a commit that referenced this pull request Mar 23, 2015
@bastibe
Copy link
Owner

bastibe commented Mar 23, 2015

honestly, I would postpone this until we have tox running and can actually test this in different versions of Python. Also, this should be a relatively popular issue: Are there any best practices for cases like this?

@mgeier
Copy link
Contributor Author

mgeier commented Mar 24, 2015

I would postpone this until we have tox running and can actually test this in different versions of Python.

Sure, this isn't urgent at all.
But tox alone doesn't solve the problem, we'd actually have to test it on file systems with different encodings ... I'm not sure if that's doable with reasonable effort.

The code looks quite innocuous:

try:
    file = file.encode(_sys.getfilesystemencoding())
except AttributeError:
    pass

I guess we could just add it and wait if there are any bug reports.

Also, this should be a relatively popular issue: Are there any best practices for cases like this?

I don't know.

I guess normally Python's open() function is used which handles all this internally, doesn't it?
The documentation talks about the encoding of the content, but I don't see anything about the encoding of the file name itself ...

According to this stackoverflow page, open() handles Unicode file names and does The Right Thing(TM). So I guess normally this isn't an issue at all; we just have to take care of the encoding manually because we don't use open().

BTW, the page also mentions sys.getfilesystemencoding(), I guess we should also use this.

mgeier added a commit that referenced this pull request Mar 24, 2015
@mgeier
Copy link
Contributor Author

mgeier commented Mar 24, 2015

Hmm ... this is more complicated than I thought ...
I thought using file.encode() in a try block takes care of everything, but the problem is that in Python 2, the str type also has an encode() method (unlike bytes in Python 3).

Therefore, we'll either have to add an additional check for not isinstance(file, bytes) or replace the try block with a check for unicode, which seems to be awkward to do in a Python 2/3 compatible manner ...

Another problem is the call to str() when checking the file name extension. This raises an error in Python 2 if file (or file.name) is unicode.

mgeier added a commit that referenced this pull request Mar 25, 2015
@mgeier
Copy link
Contributor Author

mgeier commented Mar 25, 2015

I added two commits to allow Unicode in Python 2, which didn't really work anywhere (ee4a400) and to specifically use sys.getfilesystemencoding() (7405827).

There are no tests yet, because I don't know how test this with little effort (but still reasonably well).
I don't really want to use obscure Unicode characters in our test WAV files.
The current tests pass on CPython 2.6, 2.7, 3.4 and PyPy on Linux.
I also tried with a few exotic Unicode file names and it seems to work fine.

@mgeier
Copy link
Contributor Author

mgeier commented Mar 29, 2015

I added another commit (19b7e7c).
This fixes the handling of filenames of type bytes in Python 3 and at the same time it disallows bytes (but of course not Python 2 str) for mode/format/subtype/endian.

I think this is the expected behavior, because it's very close to the built-in open() function.

I currently don't have the motivation to create unit tests that thoroughly test all combinations of unicode/str/bytes containing ASCII and non-ASCII letters for file/mode/format/subtype/endian.
I suggest that we merge this (after some review) and create the tests later (issue #122).
The current tests all pass on CPython 2.6, 2.7, 3.4 and PyPy on Linux.

@bastibe
Copy link
Owner

bastibe commented Mar 29, 2015

Oh my, what a terrible situation.

Is there any way to factor the string sanitation out into a function?

At any rate, I feel like a few comments would make things clearer--otherwise this just looks unnecessarily complex.

@mgeier
Copy link
Contributor Author

mgeier commented Mar 29, 2015

Sure, that's the next step I'm working on ...

But first I saw that the tests are actually not passing ... another commit is coming up ...

... and disallow bytes for mode/format/subtype/endian.

Regarding mode, this is the same behavior as in the built-in open() function.
@mgeier
Copy link
Contributor Author

mgeier commented Mar 29, 2015

OK, I replaced the last commit with a (hopefully) fixed one: 0ac4082

@mgeier
Copy link
Contributor Author

mgeier commented Mar 29, 2015

I completely refactored __init__() in d2ab9c8.

Better?

@mgeier
Copy link
Contributor Author

mgeier commented Mar 31, 2015

If someone is fighting with Unicode implementation issues, I highly recommend watching this video: http://nedbatchelder.com/text/unipain.html

This talks about Python 2 and 3 separately, it's even more pain if you want to support both with the same code base ...

@bastibe
Copy link
Owner

bastibe commented Mar 31, 2015

Better?

Yes, very much so.

@bastibe
Copy link
Owner

bastibe commented Mar 31, 2015

We now have a bunch of methods that kind of wrap sndfile commands. Maybe now would be the time to factor them out into their own class, and separate the Python part from the C-interop part.

@mgeier
Copy link
Contributor Author

mgeier commented Mar 31, 2015

I think we should first make the release 0.7.0 and then do the suggested refactoring in a separate issue/PR.
My last commit isn't even strictly part of this PR, since it has nothing to do with Unicode or file names or anything like that (but it was kind of the same workflow, so I guess it's OK).

bastibe added a commit that referenced this pull request Apr 9, 2015
Allow unicode filenames in Python 2
@bastibe bastibe merged commit bd8f25e into master Apr 9, 2015
@mgeier mgeier deleted the issue-119 branch April 9, 2015 17:10
@mgeier mgeier added this to the 0.7.0 milestone Apr 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants