Use bytes paths in plat_other #13

takluyver · 2017-08-01T11:33:17Z

This fixes issues with non-ascii paths when using the fallback freedesktop trash implementation.

This is a rather ugly set of changes, but it's based on two things:

urllib.quote() on Python 2 requires bytes. On Python 3, it can handle unicode or bytes.
Paths on Unix filesystems are really bytes. We can usually represent them as unicode, but for 100% fidelity, they should be passed around as bytes.

So this is a complete switch around from #12: rather than converting all paths to unicode when they're passed in, convert them all to bytes.

I added tests for passing in a path as unicode or as bytes. Before the changes, both tests fail under Python 2. Afterwards, everything passes, on both versions of Python.

ghost · 2017-08-02T01:35:07Z

Thanks for tackling this. It is indeed a bit ugly, but necessary. I agree with you about handling path through bytes.

Before we decide to take this path, here's an idea: what about using pathlib? Doesn't it properly handle the whole str/bytes issue? Sure, in some cases, it means bringing in a dependency, but only on python2 (or python3 < 3.4) and unix. That doesn't seem like a big deal to me.

What do you think?

ghost · 2017-08-02T01:40:48Z

Oh, nevermind, I just fired an interpreter to see that Path() doesn't even take a byte string. useless.

Will review the PR soon...

miigotu · 2017-08-02T06:08:51Z

I disagree with your premise to just use bytes. You will have issues when the paths are not in the same encoding as the system. full unicode always works, just use something different than urllib.quote imo. There is a reason python3 uses unicode as default str type, and this PR breaks cases with python3

takluyver · 2017-08-02T08:07:50Z

@miigotu I'm generally on board with using unicode by default, but for Unix filenames specifically, I think the underlying implementation needs to deal with bytes, because that's what the filenames are. The public API definitely needs to accept unicode paths, though.

On my computer, sys.getfilesystemencoding() reports UTF-8, as it will on most modern Linux & Mac systems. But I can create a filename that's not valid UTF-8:

with open(b'abc\xe1', 'w'): pass

If we try to convert that to unicode on Python 2, it doesn't work:

>>> b'abc\xe1'.decode(sys.getfilesystemencoding())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 3: unexpected end of data

There is no unicode string which my computer will understand as referring to that file (at least on Python 2 - Python 3 has a trick to do it). The only reliable way to store the name for it is as bytes.

You will have issues when the paths are not in the same encoding as the system.

The current implementation already has those issues - it decodes bytes paths using sys.getfilesystemencoding(), so if they're in a different encoding, it produces the wrong unicode, which Python will then convert back into the wrong bytes. I don't see anything clever we can do here: wrong data is wrong data.

Again, note that both the current implementation and this PR allow passing either bytes or unicode into the function. The only difference is what they do with it internally.

ghost · 2017-08-02T11:55:56Z

You will have issues when the paths are not in the same encoding as the system.

@miigotu this is precisely the case where you will have issues using unicode. A latin-1 FS mounted on a utf-8 system will likely have troubles. I haven't tested lately and I think that python 3 does some surrogate unicode decode dance, which partially mitigate the issue, but it's still a hack.

Even if os.listdir('/path') on a latin-1 FS properly returns a str with surrogates, you're still going to have problems if you mix this path with a utf-8 path. Your code will break when it's time to encode the thing. So, in all cases, you're in trouble.

In cases when you don't need to display the path, the most reliable way to refer to path is to use, for example os.listdir(b'/path'), which will return byte paths and then make sure you don't mix them with strings.

If you need to display paths in your UI, then you're in trouble anyway. You'll have to implement encoding-guessing algorithms and all. But that's something you'll do at the application level.

As @takluyver says, at the library level, we have to support byte paths properly without trying to decode them.

ghost · 2017-08-04T00:42:53Z

This is good stuff, merging. Thanks @takluyver . Will create a v1.4 release after a little cooldown period.

Also, as you suggested earlier, I've added travis config so that tests are ran automatically :)

takluyver · 2017-08-04T07:31:03Z

Thanks! If you could ping me here when the release happens, that would be great. :-)

ghost · 2017-08-08T01:35:31Z

@takluyver ping!

takluyver · 2017-08-08T10:46:39Z

Thank-you! That's made our tests pass :-).

I saw that there was a minor hiccup with Windows requiring a quick 1.4.1. I'm happy to write an Appveyor config and a couple of tests if you're interested in more CI. I'm not going to evangelise this, though - I only usually bother with one CI service for most of my projects.

takluyver added 3 commits August 1, 2017 11:07

Fix test on Python 3

dd69eda

Add failing test (on Python 2) for unicode file names

4181ed6

Use bytes throughout plat_other

7fece24

ghost self-assigned this Aug 2, 2017

ghost merged commit 7fece24 into arsenetar:master Aug 4, 2017

takluyver deleted the unicode-trash branch August 4, 2017 07:29

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bytes paths in plat_other #13

Use bytes paths in plat_other #13

takluyver commented Aug 1, 2017

ghost commented Aug 2, 2017

ghost commented Aug 2, 2017

miigotu commented Aug 2, 2017

takluyver commented Aug 2, 2017

ghost commented Aug 2, 2017 •

edited by ghost

ghost commented Aug 4, 2017

takluyver commented Aug 4, 2017

ghost commented Aug 8, 2017

takluyver commented Aug 8, 2017

Use bytes paths in plat_other #13

Use bytes paths in plat_other #13

Conversation

takluyver commented Aug 1, 2017

ghost commented Aug 2, 2017

ghost commented Aug 2, 2017

miigotu commented Aug 2, 2017

takluyver commented Aug 2, 2017

ghost commented Aug 2, 2017 • edited by ghost

ghost commented Aug 4, 2017

takluyver commented Aug 4, 2017

ghost commented Aug 8, 2017

takluyver commented Aug 8, 2017

ghost commented Aug 2, 2017 •

edited by ghost