max_filename_length creates invalid unicode characters #1418

hushpiper · 2015-04-13T04:49:19Z

When using the max_filename_length config option to truncate long filenames, setting it to a number that is not a multiple of 4 seems to create invalid unicode characters at the end. Example from my server (CentOS 6.6, Beets 1.3.11), with an album by Belye Flagi Zazhigayte Medlenno, the full album name is:

Даже если пролетариат возьмёт власть в свои руки, весна всё равно достанется нам, а цели войны останутся целями войны

I set max_filename_length to 25, run beet mv and then beet ls -p and get this:

$ beet ls -p Белые
/var/subsonic/music/Белые флаги з/Даже если про/01 08.01.07.mp3
/var/subsonic/music/Белые флаги з/Даже если про/02 Мы вооруж.mp3
/var/subsonic/music/Белые флаги з/Даже если про/03 16.11.06.mp3

But if I try to actually ls the folder, when trying to tab complete I get this:

$ ls /var/subsonic/music/Белые\ флаги\ з
Белые флаги з�/

The final character is invalid, and I can't seem to input it, so I can't actually interact with the folder without using the inode. The same happens if I use a longer value like 100, with the exception that the full artist name comes out while the longer album name has the invalid character at the end.

I noticed that, in the case above using 25, the artist name was truncated even though it's only 13 unicode characters (<25), so I'm assuming that it's counting the bytes that make up the 1-4 byte unicode character, rather than each unicode character itself. I haven't tested this on a non-unicode album name, but I did test it with different numbers: even numbers don't necessarily work (100 did not), but powers of 2 (64) and multiples of 4 (24) both work without causing any invalid characters. Since it displays the path without issues through beets, I suspect it might be something to do with the way python handles the characters vs the way bash/ls does.

The text was updated successfully, but these errors were encountered:

sampsyo · 2015-04-13T05:23:17Z

Thank you! This is related to #496, where truncation was also creating invalid filenames by bypassing the replace rules.

To resolve this, you could imagine doing one of two things:

Do the truncation on pre-encoded, Unicode character strings.
Somehow ensure, after truncation on the byte string, that we trim partial encoded characters.

The trouble with option 1 is that, on Unix, paths are byte strings—so a filesystem's maximum length is specified in bytes, not characters. (Not sure about Windows; it's probably characters there.) And the difficulty with 2 is that finding the character boundaries amounts to re-decoding back to Unicode.

I'm sure there's something clever we can do to get this right, but I'm not quite sure what it is yet. 😃

lorentzkim · 2015-05-29T14:16:47Z

I'm not sure if this is the correct solution (I am not a python developer), but perhaps it's a path towards a solution? See referenced commit.

sampsyo · 2015-05-29T14:25:01Z

It's getting there, but since the truncation still happens after the original Unicode becomes bytes, it assumes a specific encoding (UTF-8). We should somehow do the truncation on the original Unicode string and only encode to bytes once.

untitaker · 2015-06-01T22:36:57Z

We should somehow do the truncation on the original Unicode string and only encode to bytes once.

Do you plan make max_filename_length specify amount of chars on Unix, as on Windows? Because if not, we might as well use @lorentzkim's solution, but using sys.getfilesystemencoding instead of utf-8.

sampsyo · 2015-06-02T21:20:40Z

Very good point, @untitaker—without looking again, I think the right answer will be characters on Windows and bytes on Unix. In which case you're probably right—we could just be sure to use the same encoding we used originally to get to bytes.

One trick though: in some encodings, it may be possible that half a character is actually a valid, but different, character. (I don't think it's possible in UTF-8, but it could happen elsewhere. Even in UTF-8, this could interact poorly with combining characters.) In which case we'd want to avoid encoding, splitting, and re-decoding for fear of losing information about character boundaries. I don't see an obvious way around that without encoding one character at a time, though…

Fix beetbox#1418

untitaker · 2015-06-03T09:19:11Z

One trick though: in some encodings, it may be possible that half a character
is actually a valid, but different, character.

I hadn't thought about that, but it is possible in UTF-8.

b'\xc3\xa4' means ä, but its first byte might as well mean Ã.

In which case we'd want to avoid encoding, splitting, and re-decoding for
fear of losing information about character boundaries. I don't see an obvious
way around that without encoding one character at a time, though…

I've created a branch where exactly this happens.

sampsyo · 2015-07-01T21:33:39Z

Very cool; this looks like it's on the right track.

For maintainability's sake, it would be awesome to combine the encoding and truncation steps. That is, rather than encoding, decoding again, and then re-encoding one byte at a time to truncate, we should do the encoding and truncation all at once. Where we currently call bytestring_path in the destination method, we should instead call a new function that does both encoding and truncation.

This way, we won't have to mentally maintain the invariant that the same encoding (with the same Unicode normalization?) has to be used in both places—something that I could imagine biting us in the future.

It would be fine to not truncate in fragment mode, which is only used in non-filesystem scenarios.

untitaker · 2015-07-03T12:00:53Z

I think the right answer will be characters on Windows and bytes on Unix

How will this distinction hold up with Python 3?

sampsyo · 2015-07-03T17:25:56Z

I believe it will be the same: these constraints come from the OS, and (fortunately) don't have to do with Python's representation.

untitaker · 2015-07-03T21:08:56Z

There are some restrictions coming from Python, concretely that Python 3 can't deal with non-ASCII byte paths. We'd have to convert those to strings everytime while Python 2 doesn't seem to have any problems with unicode paths.

untitaker · 2015-07-03T21:13:02Z

I'd suggest to just follow Python 3's lead and make everything unicode -- the stability problems with that approach regarding certain UNIX environments are already given through its stdlib.

sampsyo · 2015-07-03T22:44:17Z

Python 3 can't deal with non-ASCII byte paths

In my understanding/experience, bytestring paths work fine on Python 3:

$ python3
>>> import os
>>> os.listdir(b'Desktop/cafe\xcc\x81')
[b'test']

In fact, when you use Unicode paths on Python 3 on Unix, it actually just encodes them to bytes before passing them off to the OS. The surrogate codepoints let you shuttle arbitrary bytes through a Unicode representation, but that doesn't change the fact that you're still dealing with a bytes API.

In any case, I absolutely agree with you that we should stick with Unicode internally when we switch to Python 3—but, eventually, the limit enforced by the OS won't care about our internal representation. I believe (although it's unverified!) that all the length restrictions on Unix filenames, for any Unixy OS I know of, are in terms of bytes, so we'll need to enforce the truncation that way even if we use a Unicode representation internally.

Path encodings are fun! 😱

untitaker · 2015-07-03T22:46:10Z

In my understanding/experience, bytestring paths work fine on Python 3:

Ok well, I've tested with open().

sampsyo · 2015-07-03T22:48:38Z

Oddly, that works fine for me here too:

>>> open(b'Desktop/cafe\xcc\x81/test')
<_io.TextIOWrapper name=b'Desktop/cafe\xcc\x81/test' mode='r' encoding='UTF-8'>

untitaker · 2015-07-04T10:39:29Z

Argh, nevermind.

On Fri, Jul 03, 2015 at 03:48:41PM -0700, Adrian Sampson wrote:

Oddly, that works fine for me here too:
>>> open(b'Desktop/cafe\xcc\x81/test')
<_io.TextIOWrapper name=b'Desktop/cafe\xcc\x81/test' mode='r' encoding='UTF-8'>
Reply to this email directly or view it on GitHub:
#1418 (comment)

Fix beetbox#1418

sampsyo added the bug bugs that are confirmed and actionable label Apr 13, 2015

untitaker added a commit to untitaker/beets that referenced this issue Jun 3, 2015

Truncate on character boundaries

625ec59

Fix beetbox#1418

untitaker mentioned this issue Jul 3, 2015

[WIP] Truncate on character boundaries #1533

Closed

untitaker added a commit to untitaker/beets that referenced this issue Jul 7, 2015

Truncate on character boundaries

6c4b659

Fix beetbox#1418

untitaker added a commit to untitaker/beets that referenced this issue Jul 7, 2015

Truncate on character boundaries

03dd315

Fix beetbox#1418

sampsyo mentioned this issue Jul 8, 2015

Fixed issue #496 - sanitize/truncate bug #1361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_filename_length creates invalid unicode characters #1418

max_filename_length creates invalid unicode characters #1418

hushpiper commented Apr 13, 2015

sampsyo commented Apr 13, 2015

lorentzkim commented May 29, 2015

sampsyo commented May 29, 2015

untitaker commented Jun 1, 2015

sampsyo commented Jun 2, 2015

untitaker commented Jun 3, 2015

sampsyo commented Jul 1, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 3, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 4, 2015

max_filename_length creates invalid unicode characters #1418

max_filename_length creates invalid unicode characters #1418

Comments

hushpiper commented Apr 13, 2015

sampsyo commented Apr 13, 2015

lorentzkim commented May 29, 2015

sampsyo commented May 29, 2015

untitaker commented Jun 1, 2015

sampsyo commented Jun 2, 2015

untitaker commented Jun 3, 2015

sampsyo commented Jul 1, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 3, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 3, 2015

sampsyo commented Jul 3, 2015

untitaker commented Jul 4, 2015