Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

Closed

Conversation

jacobtylerwalls
Copy link
Member

@jacobtylerwalls
Copy link
Member Author

buildbot, test on oracle.

1 similar comment
@jacobtylerwalls
Copy link
Member Author

buildbot, test on oracle.

@kezabelle
Copy link
Contributor

kezabelle commented Jan 13, 2022

I suspect that the code as-is handles chopping halfway through multibyte characters more correctly than just naively doing joined_column_names = joined_column_names.encode('utf-8')[0:other_length].decode('utf-8'), but I'm not sure I can tell from the tests, and because I'm only running the tests locally on SQLite I can't readily check the other backends...

I think my question is: what happens when the last glyph/grapheme/whatever in the trimming is a mb utf-8 char like an emoji? Does that get taken wholesale, or dropped (or chopped erroneously, like my back of the napkin thing above)? If it's taken wholesale, is that definitely always fine and never going to push over the limit? (possibly, given it's tracking the working length)

If the test is answering that and I'm just not grokking enough of it to understand that, well, sorry and my bad :)


Edit: spitballing further, the following also passes on SQLite, so perhaps the intricacies of what's occurring in your current while loop aren't being fully exercised in the test?

while len(joined_column_names.encode('utf-8')) > other_length:
    joined_column_names = joined_column_names[0:-1]

@jacobtylerwalls
Copy link
Member Author

Hey @kezabelle, thanks for digging in. Let me know if this sounds right.

I suspect that the code as-is handles chopping halfway through multibyte characters more correctly than just naively doing joined_column_names = joined_column_names.encode('utf-8')[0:other_length].decode('utf-8')

Indeed, no dice:

>>> mysql = 'I♥Django' * 4
>>> len(mysql)
32
>>> len(mysql.encode('utf-8'))
40
>>> mysql.encode('utf-8')[:12]
b'I\xe2\x99\xa5DjangoI\xe2'
>>> mysql.encode('utf-8')[:12].decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 11: unexpected end of data

what happens when the last glyph/grapheme/whatever in the trimming is a mb utf-8 char like an emoji? Does that get taken wholesale, or dropped (or chopped erroneously, like my back of the napkin thing above)? If it's taken wholesale, is that definitely always fine and never going to push over the limit?

I'm intending the Oracle test to show that it gets dropped ("I" rather than "I♥D".) There the max length is 30, and it's at 28 without the three-byte heart. A comment might help?

>>> expected_result_for_oracle = 'indexes_a_c1_c2_I_67189ba7ix'
>>> len(expected_result_for_oracle)
28

because I'm only running the tests locally on SQLite I can't readily check the other backends...

I discovered it was simple to just temporarily fiddle with max_length a few lines above to mock the other constraints. But yes, this was finally the ticket to get me to clone django-docker-box 🤓


Edit: spitballing further, the following also passes on SQLite

OMG that snippet is so much simpler! I was so paranoid about checking the correct lengths (and indexing using the correct lengths) that I forgot that stripping from the right would just take care of things.

@jacobtylerwalls
Copy link
Member Author

jacobtylerwalls commented Jan 13, 2022

buildbot, test on oracle. (Edit: this didn't seem to get picked up, so trying again next)

@jacobtylerwalls
Copy link
Member Author

buildbot, test on oracle.

@charettes
Copy link
Member

charettes commented Jan 13, 2022

This might have other implications but have we investigated the possibility of normalizing the index names to drop multibyte characters instead?

# Taken from django.utils.slugify(allow_unicode=False)
index_name = unicodedata.normalize('NFKD', index_name).encode('ascii', 'ignore').decode('ascii')

@jacobtylerwalls
Copy link
Member Author

Interesting, what do you think the backwards compatibility ramifications of that would be? I was thinking that only stripping from the right when exceeding the character limit on a backend would be alright because such an index could never have been created.

@charettes
Copy link
Member

charettes commented Jan 13, 2022

@jacobtylerwalls I think that there might some code that will expect the index name generation logic to remain stable at least there was in Django 1.7 addition of migrations where we changed the format from the one used by South.

I was thinking that only stripping from the right when exceeding the character limit on a backend would be alright because such an index could never have been created.

Right, your approach is fully backward compatible for this reason but the systematic Unicode normalization of index names isn't as it changes the generated names of the index containing multibyte characters but fitting the max allowed length.

Only performing the normalization when we detect overflow would be backward compatible but then you still need to determine whether or not an index name overflows.

@jacobtylerwalls
Copy link
Member Author

Makes sense. I'm trying to think through whether one approach is better than the other. Both implementations seem low-complexity. My next thought is to wonder if users will find it unexpected that sometimes their index names contain multibyte characters and sometimes don't? (This would be avoided with the just-chop-the-right-end strategy.)

Copy link
Member

@felixxm felixxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtylerwalls Thanks 👍 I'd prefer your approach rather than Unicode normalization.

django/db/backends/base/schema.py Outdated Show resolved Hide resolved
django/db/backends/base/schema.py Outdated Show resolved Hide resolved
django/db/backends/base/schema.py Outdated Show resolved Hide resolved
Comment on lines 65 to 70
long_name_for_backend = {
'mysql': 'I♥Django' * 4,
'oracle': 'I♥D',
'postgresql': 'I♥Django' * 4,
'sqlite': 'I♥Django' * 17,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not friendly for 3rd-party backends. We should use connection.ops.max_name_length() and generate names that are too long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart. Yeah, I was just blindly following the test above.

tests/indexes/tests.py Outdated Show resolved Hide resolved
@django django deleted a comment from jacobtylerwalls Jan 28, 2022
Co-authored-by: Keryn Knight <keryn@kerynknight.com>
Copy link
Member

@felixxm felixxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtylerwalls Thanks for updates 👍 I pushed small edits.

index_name = '%s_%s_%s' % (table_name, '_'.join(column_names), hash_suffix_part)
if len(index_name) <= max_length:
index_name = '%s_%s_%s' % (table_name, joined_column_names, hash_suffix_part)
if len(index_name.encode('utf-8')) <= max_length:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'utf-8' is the default. We can remove it.

Comment on lines 1009 to 1012
while len(table_name.encode('utf-8')) > other_length:
table_name = table_name[:-1]
while len(joined_column_names.encode('utf-8')) > other_length:
joined_column_names = joined_column_names[:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of cases names won't contain multibyte chars, so it should be worth avoiding multiple encoding and slicing, e.g.:

if len(table_name.encode()) == len(table_name):
    table_name = table_name[:other_length]
else:
    # Shorten table name accounting for multibyte characters.
    while len(table_name.encode()) > other_length:
        table_name = table_name[:-1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small hook for this.

@felixxm felixxm changed the title Fixed #28949 -- Accounted for multibyte characters in index creation when truncating long names. Fixed #28949 -- Fixed index name truncation for multibyte characters. Jan 28, 2022
@felixxm
Copy link
Member

felixxm commented Jan 28, 2022

@jacobtylerwalls I have doubts about using encode() 😕 As far as I'm aware, identifier limits are express in chars not bytes, chars that can have 2, 3, 4 bytes. It may also depend on encoding of the operating system or database 🤯, so this change is not fully backward compatible because it will truncate existing names in a different way. For example:

class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
    żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True)

Currently, the table name is truncated to backends_żółćęśąźńżółćęśąź3d64 and index name to backends_ż_żółćęśąźńż_bd4c8b2f, but with this patch all diacritic marks are counted as 2 chars and index name is truncated to backends__żółćę_bd4c8b2f.

I'm afraid that it may not be feasible to fix this without breaking backward compatibility (see also #9816 (comment)).

@felixxm
Copy link
Member

felixxm commented Jan 28, 2022

3-byte chars are also affects, e.g.

class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
    żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True)
                                                              
    class Meta:
        db_table="♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥"

Currently, the index name is truncated to ♥♥♥♥♥♥♥♥♥♥_żółćęśąźńż_9ba11e7f, but with this patch it is truncated to ♥♥♥_żółćę_9ba11e7f. It's worth mentioning that ♥♥♥♥♥♥♥♥♥♥_żółćęśąźńż_9ba11e7f works fine on Oracle.

@jacobtylerwalls
Copy link
Member Author

Thanks for checking. Yes, this patch looks wrong-track and so do most of the premises in the original ticket.

I looked up MySQL (5.7 docs), and it looks like the following has worked since 4.1.5:

The sizes of the identifier string columns in the grant tables are measured in characters. You can use multibyte characters without reducing the number of characters permitted for values stored in these columns.

I know nothing about Oracle, but Oracle v. 21 docs say:

Oracle Database prior to version 12.2 limit identifier names, such as table names, column names, and primary key names, to 30 characters. Oracle Database 12.2 and higher have a default limit of 128 characters.

...

If the identifiers use multi-byte characters, the MaxIdentifierLength may need to be set with character expansion ratio in mind to assure that all identifiers can be created in the Oracle database. For example, if the Oracle database character set is UTF8, a single character may require up to 4 bytes. Thus, to guarantee that all identifiers can be created in an Oracle database that does not support long identifiers, the MaxIdentifierLength should be set to 7 characters (i.e. 30 characters divided by 4).

I found that a little hard to parse, but I think it's saying that the >12.2 limit of 128 "characters" is actually "bytes", so Django's 30 character limit would no longer cause any problems with even 30 * 4 < 128. Is that right?

Maybe the original reporter was on an Oracle version below 12.2 at the time of the report?

There is probably nothing to fix here. Could close as "needsinfo"? We would need to know how the "name error" mentioned in the report happened. Did the database do an auto-truncation? I doubt we have to provide a general solution for names teasing the upper limit (see ticket-33169). Thanks again for looking.

@felixxm
Copy link
Member

felixxm commented Jan 28, 2022

I found that a little hard to parse, ...

Yes, it's really confusing.

... but I think it's saying that the >12.2 limit of 128 "characters" is actually "bytes", so Django's 30 character limit would no longer cause any problems with even 30 * 4 < 128. Is that right?

Probably 😄 It's documented this way, but a few times I noticed that if you use multibyte chars in identifiers and you're close to the edge anything can happen 😄 I'd say that if you decided to use non-ASCII chars in identifiers, you actually did this to yourself. Any solution would be error-prone.

@felixxm felixxm closed this Jan 28, 2022
@jacobtylerwalls jacobtylerwalls deleted the truncate-multibyte-column branch January 28, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants