Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_filename_length creates invalid unicode characters #1418

Open
hushpiper opened this issue Apr 13, 2015 · 15 comments
Open

max_filename_length creates invalid unicode characters #1418

hushpiper opened this issue Apr 13, 2015 · 15 comments
Labels
bug bugs that are confirmed and actionable

Comments

@hushpiper
Copy link

When using the max_filename_length config option to truncate long filenames, setting it to a number that is not a multiple of 4 seems to create invalid unicode characters at the end. Example from my server (CentOS 6.6, Beets 1.3.11), with an album by Belye Flagi Zazhigayte Medlenno, the full album name is:

Даже если пролетариат возьмёт власть в свои руки, весна всё равно достанется нам, а цели войны останутся целями войны

I set max_filename_length to 25, run beet mv and then beet ls -p and get this:

$ beet ls -p Белые
/var/subsonic/music/Белые флаги з/Даже если про/01 08.01.07.mp3
/var/subsonic/music/Белые флаги з/Даже если про/02 Мы вооруж.mp3
/var/subsonic/music/Белые флаги з/Даже если про/03 16.11.06.mp3

But if I try to actually ls the folder, when trying to tab complete I get this:

$ ls /var/subsonic/music/Белые\ флаги\ з
Белые флаги з�/ 

The final character is invalid, and I can't seem to input it, so I can't actually interact with the folder without using the inode. The same happens if I use a longer value like 100, with the exception that the full artist name comes out while the longer album name has the invalid character at the end.

I noticed that, in the case above using 25, the artist name was truncated even though it's only 13 unicode characters (<25), so I'm assuming that it's counting the bytes that make up the 1-4 byte unicode character, rather than each unicode character itself. I haven't tested this on a non-unicode album name, but I did test it with different numbers: even numbers don't necessarily work (100 did not), but powers of 2 (64) and multiples of 4 (24) both work without causing any invalid characters. Since it displays the path without issues through beets, I suspect it might be something to do with the way python handles the characters vs the way bash/ls does.

@sampsyo sampsyo added the bug bugs that are confirmed and actionable label Apr 13, 2015
@sampsyo
Copy link
Member

sampsyo commented Apr 13, 2015

Thank you! This is related to #496, where truncation was also creating invalid filenames by bypassing the replace rules.

To resolve this, you could imagine doing one of two things:

  1. Do the truncation on pre-encoded, Unicode character strings.
  2. Somehow ensure, after truncation on the byte string, that we trim partial encoded characters.

The trouble with option 1 is that, on Unix, paths are byte strings—so a filesystem's maximum length is specified in bytes, not characters. (Not sure about Windows; it's probably characters there.) And the difficulty with 2 is that finding the character boundaries amounts to re-decoding back to Unicode.

I'm sure there's something clever we can do to get this right, but I'm not quite sure what it is yet. 😃

@lorentzkim
Copy link

I'm not sure if this is the correct solution (I am not a python developer), but perhaps it's a path towards a solution? See referenced commit.

@sampsyo
Copy link
Member

sampsyo commented May 29, 2015

It's getting there, but since the truncation still happens after the original Unicode becomes bytes, it assumes a specific encoding (UTF-8). We should somehow do the truncation on the original Unicode string and only encode to bytes once.

@untitaker
Copy link
Collaborator

We should somehow do the truncation on the original Unicode string and only encode to bytes once.

Do you plan make max_filename_length specify amount of chars on Unix, as on Windows? Because if not, we might as well use @lorentzkim's solution, but using sys.getfilesystemencoding instead of utf-8.

@sampsyo
Copy link
Member

sampsyo commented Jun 2, 2015

Very good point, @untitaker—without looking again, I think the right answer will be characters on Windows and bytes on Unix. In which case you're probably right—we could just be sure to use the same encoding we used originally to get to bytes.

One trick though: in some encodings, it may be possible that half a character is actually a valid, but different, character. (I don't think it's possible in UTF-8, but it could happen elsewhere. Even in UTF-8, this could interact poorly with combining characters.) In which case we'd want to avoid encoding, splitting, and re-decoding for fear of losing information about character boundaries. I don't see an obvious way around that without encoding one character at a time, though…

untitaker added a commit to untitaker/beets that referenced this issue Jun 3, 2015
@untitaker
Copy link
Collaborator

One trick though: in some encodings, it may be possible that half a character
is actually a valid, but different, character.

I hadn't thought about that, but it is possible in UTF-8.

b'\xc3\xa4' means ä, but its first byte might as well mean Ã.

In which case we'd want to avoid encoding, splitting, and re-decoding for
fear of losing information about character boundaries. I don't see an obvious
way around that without encoding one character at a time, though…

I've created a branch where exactly this happens.

@sampsyo
Copy link
Member

sampsyo commented Jul 1, 2015

Very cool; this looks like it's on the right track.

For maintainability's sake, it would be awesome to combine the encoding and truncation steps. That is, rather than encoding, decoding again, and then re-encoding one byte at a time to truncate, we should do the encoding and truncation all at once. Where we currently call bytestring_path in the destination method, we should instead call a new function that does both encoding and truncation.

This way, we won't have to mentally maintain the invariant that the same encoding (with the same Unicode normalization?) has to be used in both places—something that I could imagine biting us in the future.

It would be fine to not truncate in fragment mode, which is only used in non-filesystem scenarios.

@untitaker
Copy link
Collaborator

I think the right answer will be characters on Windows and bytes on Unix

How will this distinction hold up with Python 3?

@sampsyo
Copy link
Member

sampsyo commented Jul 3, 2015

I believe it will be the same: these constraints come from the OS, and (fortunately) don't have to do with Python's representation.

@untitaker
Copy link
Collaborator

There are some restrictions coming from Python, concretely that Python 3 can't deal with non-ASCII byte paths. We'd have to convert those to strings everytime while Python 2 doesn't seem to have any problems with unicode paths.

@untitaker
Copy link
Collaborator

I'd suggest to just follow Python 3's lead and make everything unicode -- the stability problems with that approach regarding certain UNIX environments are already given through its stdlib.

@sampsyo
Copy link
Member

sampsyo commented Jul 3, 2015

Python 3 can't deal with non-ASCII byte paths

In my understanding/experience, bytestring paths work fine on Python 3:

$ python3
>>> import os
>>> os.listdir(b'Desktop/cafe\xcc\x81')
[b'test']

In fact, when you use Unicode paths on Python 3 on Unix, it actually just encodes them to bytes before passing them off to the OS. The surrogate codepoints let you shuttle arbitrary bytes through a Unicode representation, but that doesn't change the fact that you're still dealing with a bytes API.

In any case, I absolutely agree with you that we should stick with Unicode internally when we switch to Python 3—but, eventually, the limit enforced by the OS won't care about our internal representation. I believe (although it's unverified!) that all the length restrictions on Unix filenames, for any Unixy OS I know of, are in terms of bytes, so we'll need to enforce the truncation that way even if we use a Unicode representation internally.

Path encodings are fun! 😱

@untitaker
Copy link
Collaborator

In my understanding/experience, bytestring paths work fine on Python 3:

Ok well, I've tested with open().

@sampsyo
Copy link
Member

sampsyo commented Jul 3, 2015

Oddly, that works fine for me here too:

>>> open(b'Desktop/cafe\xcc\x81/test')
<_io.TextIOWrapper name=b'Desktop/cafe\xcc\x81/test' mode='r' encoding='UTF-8'>

@untitaker
Copy link
Collaborator

Argh, nevermind.

On Fri, Jul 03, 2015 at 03:48:41PM -0700, Adrian Sampson wrote:

Oddly, that works fine for me here too:

>>> open(b'Desktop/cafe\xcc\x81/test')
<_io.TextIOWrapper name=b'Desktop/cafe\xcc\x81/test' mode='r' encoding='UTF-8'>

Reply to this email directly or view it on GitHub:
#1418 (comment)

untitaker added a commit to untitaker/beets that referenced this issue Jul 7, 2015
untitaker added a commit to untitaker/beets that referenced this issue Jul 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs that are confirmed and actionable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants