New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pathlib everywhere in beets #1409

Open
brunal opened this Issue Apr 9, 2015 · 9 comments

Comments

Projects
None yet
4 participants
@brunal
Collaborator

brunal commented Apr 9, 2015

Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to beets.util.*_path() functions. It could be extended to provide desired repr().

I feel like working on it. It is huge though.

  • Am I missing a delicate point that prevents us from using it
  • Should it be a 1.3.12 milestone?
@sampsyo

This comment has been minimized.

Show comment
Hide comment
@sampsyo

sampsyo Apr 9, 2015

Member

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

Member

sampsyo commented Apr 9, 2015

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

@brunal

This comment has been minimized.

Show comment
Hide comment
@brunal

brunal Apr 10, 2015

Collaborator

Currently,

  • all paths are byte strings
  • when there is a path in the input command we encode it with beets.util._fsencoding()
  • when we display a path we decode it with that same encoding
  • displayable_path() does that operation
  • syspath() does some ugly dark magic
  • bytestring_path() ensures always gives a byte string path, no matter the input
  • normpath() expands everything and gives a byte string

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode
The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

Collaborator

brunal commented Apr 10, 2015

Currently,

  • all paths are byte strings
  • when there is a path in the input command we encode it with beets.util._fsencoding()
  • when we display a path we decode it with that same encoding
  • displayable_path() does that operation
  • syspath() does some ugly dark magic
  • bytestring_path() ensures always gives a byte string path, no matter the input
  • normpath() expands everything and gives a byte string

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode
The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

@brunal

This comment has been minimized.

Show comment
Hide comment
@brunal

brunal Apr 10, 2015

Collaborator
  • surrogateescape is not natively available in python 2
  • pathlib expects unicode paths
  • pathlib port on python 2 is not up-to-date with the 3.4 version and does not use surrogateescape (since it's not natively available)

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

Collaborator

brunal commented Apr 10, 2015

  • surrogateescape is not natively available in python 2
  • pathlib expects unicode paths
  • pathlib port on python 2 is not up-to-date with the 3.4 version and does not use surrogateescape (since it's not natively available)

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

@sampsyo

This comment has been minimized.

Show comment
Hide comment
@sampsyo

sampsyo Apr 10, 2015

Member

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

  • How hard will it be to be to make pathlib use surrogate escaping on Python 2 via python-future?
  • How will this work with people's existing databases? Shall we incrementally "upgrade" current bytestring paths to surrogate-escaped Unicode?
  • How good is our current test coverage for unexpected, badly-encoded filenames? A few more tests might be useful to make sure we don't break anything.
  • I'm not sure what to do about that "strict" mode on Windows in the new fsencode. We'll at least need to handle this case to avoid crashing.
  • A cursory glance at the source suggests that pathlib does not, by itself, do the \\?\ prefix on Windows that syspath does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. 😬
Member

sampsyo commented Apr 10, 2015

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

  • How hard will it be to be to make pathlib use surrogate escaping on Python 2 via python-future?
  • How will this work with people's existing databases? Shall we incrementally "upgrade" current bytestring paths to surrogate-escaped Unicode?
  • How good is our current test coverage for unexpected, badly-encoded filenames? A few more tests might be useful to make sure we don't break anything.
  • I'm not sure what to do about that "strict" mode on Windows in the new fsencode. We'll at least need to handle this case to avoid crashing.
  • A cursory glance at the source suggests that pathlib does not, by itself, do the \\?\ prefix on Windows that syspath does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. 😬
@brunal

This comment has been minimized.

Show comment
Hide comment
@brunal

brunal Apr 10, 2015

Collaborator

Roughly,

  • I don't know yet. I feel like I'll won't be able to use the proposed Pathlib backport but another one that would install surrogate escaping from python-future before the real pathlib code.
  • I'm not sure updating the db model is needed: just take the bytes before sending it to the db
  • It's decent but not that good. Right now it breaks every few weeks (due to future.unicode_literals)
  • I believe Windows only provides Unicode filenames and deals with encoding problems itself, so we don't need to do anything
  • Indeed it does not! Another thing missing is path expansion (~ → /home/foo). Since Pathlib is OO, it's just a matter of subclassing it and adding/overriding required methods.
Collaborator

brunal commented Apr 10, 2015

Roughly,

  • I don't know yet. I feel like I'll won't be able to use the proposed Pathlib backport but another one that would install surrogate escaping from python-future before the real pathlib code.
  • I'm not sure updating the db model is needed: just take the bytes before sending it to the db
  • It's decent but not that good. Right now it breaks every few weeks (due to future.unicode_literals)
  • I believe Windows only provides Unicode filenames and deals with encoding problems itself, so we don't need to do anything
  • Indeed it does not! Another thing missing is path expansion (~ → /home/foo). Since Pathlib is OO, it's just a matter of subclassing it and adding/overriding required methods.
@sampsyo

This comment has been minimized.

Show comment
Hide comment
@sampsyo

sampsyo Apr 10, 2015

Member

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

Member

sampsyo commented Apr 10, 2015

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

@LordSputnik

This comment has been minimized.

Show comment
Hide comment
@LordSputnik

LordSputnik Jul 14, 2015

Collaborator

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

Collaborator

LordSputnik commented Jul 14, 2015

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

@sampsyo

This comment has been minimized.

Show comment
Hide comment
@sampsyo

sampsyo Jul 14, 2015

Member

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

Member

sampsyo commented Jul 14, 2015

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

@jrobeson

This comment has been minimized.

Show comment
Hide comment
@jrobeson

jrobeson Jul 8, 2016

Contributor

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

Contributor

jrobeson commented Jul 8, 2016

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment