Use pathlib everywhere in beets #1409

Open
brunal opened this Issue Apr 9, 2015 · 9 comments

Projects

None yet

4 participants

@brunal
Collaborator
brunal commented Apr 9, 2015

Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to beets.util.*_path() functions. It could be extended to provide desired repr().

I feel like working on it. It is huge though.

  • Am I missing a delicate point that prevents us from using it
  • Should it be a 1.3.12 milestone?
@sampsyo
Member
sampsyo commented Apr 9, 2015

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

@brunal
Collaborator
brunal commented Apr 10, 2015

Currently,

  • all paths are byte strings
  • when there is a path in the input command we encode it with beets.util._fsencoding()
  • when we display a path we decode it with that same encoding
  • displayable_path() does that operation
  • syspath() does some ugly dark magic
  • bytestring_path() ensures always gives a byte string path, no matter the input
  • normpath() expands everything and gives a byte string

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode
The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

@brunal
Collaborator
brunal commented Apr 10, 2015
  • surrogateescape is not natively available in python 2
  • pathlib expects unicode paths
  • pathlib port on python 2 is not up-to-date with the 3.4 version and does not use surrogateescape (since it's not natively available)

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

@sampsyo
Member
sampsyo commented Apr 10, 2015

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

  • How hard will it be to be to make pathlib use surrogate escaping on Python 2 via python-future?
  • How will this work with people's existing databases? Shall we incrementally "upgrade" current bytestring paths to surrogate-escaped Unicode?
  • How good is our current test coverage for unexpected, badly-encoded filenames? A few more tests might be useful to make sure we don't break anything.
  • I'm not sure what to do about that "strict" mode on Windows in the new fsencode. We'll at least need to handle this case to avoid crashing.
  • A cursory glance at the source suggests that pathlib does not, by itself, do the \\?\ prefix on Windows that syspath does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. 😬
@brunal
Collaborator
brunal commented Apr 10, 2015

Roughly,

  • I don't know yet. I feel like I'll won't be able to use the proposed Pathlib backport but another one that would install surrogate escaping from python-future before the real pathlib code.
  • I'm not sure updating the db model is needed: just take the bytes before sending it to the db
  • It's decent but not that good. Right now it breaks every few weeks (due to future.unicode_literals)
  • I believe Windows only provides Unicode filenames and deals with encoding problems itself, so we don't need to do anything
  • Indeed it does not! Another thing missing is path expansion (~ → /home/foo). Since Pathlib is OO, it's just a matter of subclassing it and adding/overriding required methods.
@sampsyo
Member
sampsyo commented Apr 10, 2015

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

@LordSputnik
Collaborator

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

@sampsyo
Member
sampsyo commented Jul 14, 2015

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

@jrobeson
Contributor
jrobeson commented Jul 8, 2016 edited

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment