New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pathlib everywhere in beets #1409
Comments
I'm all for it. Our current path handling stuff is too fiddly. I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint. I also don't know how pathlib handles long names on Windows (the issue described here). Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration. |
Currently,
Is that right? Am I forgetting something? Python 3.2 introduced I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them. |
I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a |
Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:
|
Roughly,
|
OK, this migration plan sounds good. On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes. On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned. |
I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters. |
Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us. |
there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2 @brunal : do you still wanna work on this? I'd love to help out. |
I'd like to tackle this and I want to make sure I'm considering all the potential issues. The proposed solution I am thinking is using pathlib everywhere and storing paths as strings in the database (using pathlib's string representation).
Some current issues:
Other notes:
These are my rough initial thoughts on the matter, and I'm only just now looking into these issues, so I'm sure I'm missing something and am hoping others can help fill in the gaps so we can work this out. |
Hi! Thanks for being interested in taking this on! I think the big obstacle here is about Unicode, surrogate escapes, and SQLite storage. Namely, because surrogate escapes are a Python-specific quirk, SQLite cannot store strings that contain them. For example: >>> weird_string = 'café'.encode('latin1').decode('utf8', 'surrogateescape')
>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select ?;', (weird_string,)).fetchall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed Basically, SQLite stores values as bytes internally. Python tries to "pretend" that it's storing Unicode objects by relying on UTF-8 encoding. Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system. Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better. Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere. Finally finally, you briefly remarked that type annotations could help with this. I agree! Maybe a good plan of attack would be (after dropping Python 2 support) just go whole-hog with type annotations in parts of the code that deal with paths. Then a later phase can explore holistic |
I'll keep it as is, at least for the first iteration.
Hopefully this is something the abstraction of the os module and pathlib objects will take care of. I suppose we'll see if that's true or not.
Might be worth digging into. If it begins to create more problems than it's worth, then maybe we can decide otherwise. |
My initial plan is to pass pathlib.PurePath objects around everywhere, converting to pathlib.Path objects if actual file operations are needed, and converting the objects to and from bytes for the database. Consider the following example:
Does this seem reasonable? Note, it is possible to call The more I get into it though, the more it seems life would be much easier if we had the benefits of PEP 519 i.e. dropping python 3.5 support. For reference, here's the download stats for the last 180 days:
I wish I could do more, but I hit my free quota on these stats |
That seems reasonable! The only difference is that we will want to make sure we are doing the same No objection to dropping Python 3.5 at this point. |
Great, this would be much harder to implement without the 3.6 improvements, and would probably be worth putting off at that point. I should have this done by the time beets 1.5.x (or whatever the next one is) gets released. |
I'm happy to give this a shot, it seems like it'd be very helpful. |
Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to
beets.util.*_path()
functions. It could be extended to provide desiredrepr()
.I feel like working on it. It is huge though.
The text was updated successfully, but these errors were encountered: