Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bytes paths in plat_other #13

Merged
3 commits merged into from Aug 4, 2017
Merged

Conversation

takluyver
Copy link
Contributor

This fixes issues with non-ascii paths when using the fallback freedesktop trash implementation.

This is a rather ugly set of changes, but it's based on two things:

  1. urllib.quote() on Python 2 requires bytes. On Python 3, it can handle unicode or bytes.
  2. Paths on Unix filesystems are really bytes. We can usually represent them as unicode, but for 100% fidelity, they should be passed around as bytes.

So this is a complete switch around from #12: rather than converting all paths to unicode when they're passed in, convert them all to bytes.

I added tests for passing in a path as unicode or as bytes. Before the changes, both tests fail under Python 2. Afterwards, everything passes, on both versions of Python.

@ghost
Copy link

ghost commented Aug 2, 2017

Thanks for tackling this. It is indeed a bit ugly, but necessary. I agree with you about handling path through bytes.

Before we decide to take this path, here's an idea: what about using pathlib? Doesn't it properly handle the whole str/bytes issue? Sure, in some cases, it means bringing in a dependency, but only on python2 (or python3 < 3.4) and unix. That doesn't seem like a big deal to me.

What do you think?

@ghost ghost self-assigned this Aug 2, 2017
@ghost
Copy link

ghost commented Aug 2, 2017

Oh, nevermind, I just fired an interpreter to see that Path() doesn't even take a byte string. useless.

Will review the PR soon...

@miigotu
Copy link

miigotu commented Aug 2, 2017

I disagree with your premise to just use bytes. You will have issues when the paths are not in the same encoding as the system. full unicode always works, just use something different than urllib.quote imo. There is a reason python3 uses unicode as default str type, and this PR breaks cases with python3

@takluyver
Copy link
Contributor Author

@miigotu I'm generally on board with using unicode by default, but for Unix filenames specifically, I think the underlying implementation needs to deal with bytes, because that's what the filenames are. The public API definitely needs to accept unicode paths, though.

On my computer, sys.getfilesystemencoding() reports UTF-8, as it will on most modern Linux & Mac systems. But I can create a filename that's not valid UTF-8:

with open(b'abc\xe1', 'w'): pass

If we try to convert that to unicode on Python 2, it doesn't work:

>>> b'abc\xe1'.decode(sys.getfilesystemencoding())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 3: unexpected end of data

There is no unicode string which my computer will understand as referring to that file (at least on Python 2 - Python 3 has a trick to do it). The only reliable way to store the name for it is as bytes.

You will have issues when the paths are not in the same encoding as the system.

The current implementation already has those issues - it decodes bytes paths using sys.getfilesystemencoding(), so if they're in a different encoding, it produces the wrong unicode, which Python will then convert back into the wrong bytes. I don't see anything clever we can do here: wrong data is wrong data.

Again, note that both the current implementation and this PR allow passing either bytes or unicode into the function. The only difference is what they do with it internally.

@ghost
Copy link

ghost commented Aug 2, 2017

You will have issues when the paths are not in the same encoding as the system.

@miigotu this is precisely the case where you will have issues using unicode. A latin-1 FS mounted on a utf-8 system will likely have troubles. I haven't tested lately and I think that python 3 does some surrogate unicode decode dance, which partially mitigate the issue, but it's still a hack.

Even if os.listdir('/path') on a latin-1 FS properly returns a str with surrogates, you're still going to have problems if you mix this path with a utf-8 path. Your code will break when it's time to encode the thing. So, in all cases, you're in trouble.

In cases when you don't need to display the path, the most reliable way to refer to path is to use, for example os.listdir(b'/path'), which will return byte paths and then make sure you don't mix them with strings.

If you need to display paths in your UI, then you're in trouble anyway. You'll have to implement encoding-guessing algorithms and all. But that's something you'll do at the application level.

As @takluyver says, at the library level, we have to support byte paths properly without trying to decode them.

@ghost
Copy link

ghost commented Aug 4, 2017

This is good stuff, merging. Thanks @takluyver . Will create a v1.4 release after a little cooldown period.

Also, as you suggested earlier, I've added travis config so that tests are ran automatically :)

@ghost ghost merged commit 7fece24 into arsenetar:master Aug 4, 2017
@takluyver takluyver deleted the unicode-trash branch August 4, 2017 07:29
@takluyver
Copy link
Contributor Author

Thanks! If you could ping me here when the release happens, that would be great. :-)

@ghost
Copy link

ghost commented Aug 8, 2017

@takluyver ping!

@takluyver
Copy link
Contributor Author

Thank-you! That's made our tests pass :-).

I saw that there was a minor hiccup with Windows requiring a quick 1.4.1. I'm happy to write an Appveyor config and a couple of tests if you're interested in more CI. I'm not going to evangelise this, though - I only usually bother with one CI service for most of my projects.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants