Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pathlib everywhere in beets #1409

Open
brunal opened this issue Apr 9, 2015 · 15 comments
Open

Use pathlib everywhere in beets #1409

brunal opened this issue Apr 9, 2015 · 15 comments

Comments

@brunal
Copy link
Collaborator

@brunal brunal commented Apr 9, 2015

Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to beets.util.*_path() functions. It could be extended to provide desired repr().

I feel like working on it. It is huge though.

  • Am I missing a delicate point that prevents us from using it
  • Should it be a 1.3.12 milestone?
@sampsyo
Copy link
Member

@sampsyo sampsyo commented Apr 9, 2015

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

@brunal
Copy link
Collaborator Author

@brunal brunal commented Apr 10, 2015

Currently,

  • all paths are byte strings
  • when there is a path in the input command we encode it with beets.util._fsencoding()
  • when we display a path we decode it with that same encoding
  • displayable_path() does that operation
  • syspath() does some ugly dark magic
  • bytestring_path() ensures always gives a byte string path, no matter the input
  • normpath() expands everything and gives a byte string

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode
The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

@brunal
Copy link
Collaborator Author

@brunal brunal commented Apr 10, 2015

  • surrogateescape is not natively available in python 2
  • pathlib expects unicode paths
  • pathlib port on python 2 is not up-to-date with the 3.4 version and does not use surrogateescape (since it's not natively available)

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

@sampsyo
Copy link
Member

@sampsyo sampsyo commented Apr 10, 2015

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

  • How hard will it be to be to make pathlib use surrogate escaping on Python 2 via python-future?
  • How will this work with people's existing databases? Shall we incrementally "upgrade" current bytestring paths to surrogate-escaped Unicode?
  • How good is our current test coverage for unexpected, badly-encoded filenames? A few more tests might be useful to make sure we don't break anything.
  • I'm not sure what to do about that "strict" mode on Windows in the new fsencode. We'll at least need to handle this case to avoid crashing.
  • A cursory glance at the source suggests that pathlib does not, by itself, do the \\?\ prefix on Windows that syspath does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. 😬
@brunal
Copy link
Collaborator Author

@brunal brunal commented Apr 10, 2015

Roughly,

  • I don't know yet. I feel like I'll won't be able to use the proposed Pathlib backport but another one that would install surrogate escaping from python-future before the real pathlib code.
  • I'm not sure updating the db model is needed: just take the bytes before sending it to the db
  • It's decent but not that good. Right now it breaks every few weeks (due to future.unicode_literals)
  • I believe Windows only provides Unicode filenames and deals with encoding problems itself, so we don't need to do anything
  • Indeed it does not! Another thing missing is path expansion (~ → /home/foo). Since Pathlib is OO, it's just a matter of subclassing it and adding/overriding required methods.
@sampsyo
Copy link
Member

@sampsyo sampsyo commented Apr 10, 2015

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

@LordSputnik
Copy link
Collaborator

@LordSputnik LordSputnik commented Jul 14, 2015

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

@sampsyo
Copy link
Member

@sampsyo sampsyo commented Jul 14, 2015

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

@jrobeson
Copy link
Contributor

@jrobeson jrobeson commented Jul 8, 2016

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

@jtpavlock
Copy link
Contributor

@jtpavlock jtpavlock commented Jul 28, 2020

I'd like to tackle this and I want to make sure I'm considering all the potential issues. The proposed solution I am thinking is using pathlib everywhere and storing paths as strings in the database (using pathlib's string representation).

PEP 519 on pathlib

The hope is that people will gravitate towards path objects like pathlib and that will move people away from operating directly with bytes.

Some current issues:

  1. Long windows paths

    • If I'm understanding it correctly, the issue is basically defined here.

    Issue: Pathlib doesn't seem to correctly deal with these situations, as illustrated by the answer to that stackoverflow question.
    Potential Solution: Don't attempt to solve this on beet's end. Instead, direct users to enable long path support explicitly. This is the same decision that python made.

  2. Dealing with user's current databases
    Issue: Currently, all user's paths are bytes, and pathlib doesn't accept bytes.
    Potential Solution: I'm not the most experienced with databases, but is this something that could be solved with a database migration?

  3. Unix uses bytes
    Issue: So pathlib abstracts this layer and has it's own method for opening files and such. However, how do we deal with native Unix utilities that return and deal with bytes?
    Potential Solution: Python3 has surrogate escapes, which can represent bytes as strings, which can then be passed to pathlib to create a path object. These path objects then get passed around instead of the surrogate escape strings, which should hopefully prevent any unforeseen issues that come with using surrogate escapes directly.

Other notes:

  1. As far as enforcement, type hints and then static type checking may prove to be useful.
  2. PEP 519 (mentioned above) introduces a lot of nice adoption features for pathlib, notably, most (all?) os modules and open now accept pathlib objects. The issue is that the above PEP was introduced to Python 3.6. Whether or not these features are necessary for beet's adoption, I'm not sure.

These are my rough initial thoughts on the matter, and I'm only just now looking into these issues, so I'm sure I'm missing something and am hoping others can help fill in the gaps so we can work this out.

Other relevant PEPS: 529, 538, 540, 383

@sampsyo
Copy link
Member

@sampsyo sampsyo commented Jul 28, 2020

Hi! Thanks for being interested in taking this on!

I think the big obstacle here is about Unicode, surrogate escapes, and SQLite storage. Namely, because surrogate escapes are a Python-specific quirk, SQLite cannot store strings that contain them. For example:

>>> weird_string = 'café'.encode('latin1').decode('utf8', 'surrogateescape')
>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select ?;', (weird_string,)).fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed

Basically, SQLite stores values as bytes internally. Python tries to "pretend" that it's storing Unicode objects by relying on UTF-8 encoding.

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.


Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.


Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.


Finally finally, you briefly remarked that type annotations could help with this. I agree! Maybe a good plan of attack would be (after dropping Python 2 support) just go whole-hog with type annotations in parts of the code that deal with paths. Then a later phase can explore holistic bytes -> Path conversion.

@jtpavlock
Copy link
Contributor

@jtpavlock jtpavlock commented Jul 28, 2020

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.

I'll keep it as is, at least for the first iteration.

Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.

Hopefully this is something the abstraction of the os module and pathlib objects will take care of. I suppose we'll see if that's true or not.

Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.

Might be worth digging into. If it begins to create more problems than it's worth, then maybe we can decide otherwise.

@jtpavlock
Copy link
Contributor

@jtpavlock jtpavlock commented Jul 28, 2020

My initial plan is to pass pathlib.PurePath objects around everywhere, converting to pathlib.Path objects if actual file operations are needed, and converting the objects to and from bytes for the database. Consider the following example:

>>> weird_bytes = os.fsencode('café')
>>> weird_bytes
b'caf\xc3\xa9'
>>> weird_str = os.fsdecode(weird_bytes)
>>> weird_str
'café'
>>> p = Path(weird_str)
>>> p
PosixPath('café')
>>> bytes(p)
b'caf\xc3\xa9'

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

The more I get into it though, the more it seems life would be much easier if we had the benefits of PEP 519 i.e. dropping python 3.5 support. For reference, here's the download stats for the last 180 days:

$ pypinfo -d 180 beets pyversion 
Served from cache: False
Data processed: 643.30 GiB
Data billed: 643.30 GiB
Estimated cost: $3.15

| python_version | download_count |
| -------------- | -------------- |
| 3.8            |          4,146 |
| 3.7            |          2,570 |
| 2.7            |          1,923 |
| 3.6            |          1,861 |
| 3.5            |            132 |
| 3.9            |             77 |
| 3.4            |             32 |
| 3.3            |              2 |
| Total          |         10,743 |

I wish I could do more, but I hit my free quota on these stats 😆

@sampsyo
Copy link
Member

@sampsyo sampsyo commented Jul 29, 2020

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

That seems reasonable! The only difference is that we will want to make sure we are doing the same str-to-bytes conversion as we currently do on Windows for storage.

No objection to dropping Python 3.5 at this point.

@jtpavlock
Copy link
Contributor

@jtpavlock jtpavlock commented Jul 29, 2020

No objection to dropping Python 3.5 at this point.

Great, this would be much harder to implement without the 3.6 improvements, and would probably be worth putting off at that point. I should have this done by the time beets 1.5.x (or whatever the next one is) gets released.

@jtpavlock jtpavlock mentioned this issue Jul 31, 2020
0 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
@jrobeson @sampsyo @brunal @LordSputnik @jtpavlock and others
You can’t perform that action at this time.