Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

jacobtylerwalls · 2022-01-02T14:49:11Z

jacobtylerwalls · 2022-01-02T15:08:22Z

buildbot, test on oracle.

jacobtylerwalls · 2022-01-02T17:35:52Z

buildbot, test on oracle.

kezabelle · 2022-01-13T12:58:22Z

I suspect that the code as-is handles chopping halfway through multibyte characters more correctly than just naively doing joined_column_names = joined_column_names.encode('utf-8')[0:other_length].decode('utf-8'), but I'm not sure I can tell from the tests, and because I'm only running the tests locally on SQLite I can't readily check the other backends...

I think my question is: what happens when the last glyph/grapheme/whatever in the trimming is a mb utf-8 char like an emoji? Does that get taken wholesale, or dropped (or chopped erroneously, like my back of the napkin thing above)? If it's taken wholesale, is that definitely always fine and never going to push over the limit? (possibly, given it's tracking the working length)

If the test is answering that and I'm just not grokking enough of it to understand that, well, sorry and my bad :)

Edit: spitballing further, the following also passes on SQLite, so perhaps the intricacies of what's occurring in your current while loop aren't being fully exercised in the test?

while len(joined_column_names.encode('utf-8')) > other_length:
    joined_column_names = joined_column_names[0:-1]

jacobtylerwalls · 2022-01-13T14:55:57Z

Hey @kezabelle, thanks for digging in. Let me know if this sounds right.

I suspect that the code as-is handles chopping halfway through multibyte characters more correctly than just naively doing joined_column_names = joined_column_names.encode('utf-8')[0:other_length].decode('utf-8')

Indeed, no dice:

>>> mysql = 'I♥Django' * 4
>>> len(mysql)
32
>>> len(mysql.encode('utf-8'))
40
>>> mysql.encode('utf-8')[:12]
b'I\xe2\x99\xa5DjangoI\xe2'
>>> mysql.encode('utf-8')[:12].decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 11: unexpected end of data

what happens when the last glyph/grapheme/whatever in the trimming is a mb utf-8 char like an emoji? Does that get taken wholesale, or dropped (or chopped erroneously, like my back of the napkin thing above)? If it's taken wholesale, is that definitely always fine and never going to push over the limit?

I'm intending the Oracle test to show that it gets dropped ("I" rather than "I♥D".) There the max length is 30, and it's at 28 without the three-byte heart. A comment might help?

>>> expected_result_for_oracle = 'indexes_a_c1_c2_I_67189ba7ix'
>>> len(expected_result_for_oracle)
28

because I'm only running the tests locally on SQLite I can't readily check the other backends...

I discovered it was simple to just temporarily fiddle with max_length a few lines above to mock the other constraints. But yes, this was finally the ticket to get me to clone django-docker-box 🤓

Edit: spitballing further, the following also passes on SQLite

OMG that snippet is so much simpler! I was so paranoid about checking the correct lengths (and indexing using the correct lengths) that I forgot that stripping from the right would just take care of things.

jacobtylerwalls · 2022-01-13T14:59:23Z

buildbot, test on oracle. (Edit: this didn't seem to get picked up, so trying again next)

jacobtylerwalls · 2022-01-13T15:18:00Z

buildbot, test on oracle.

charettes · 2022-01-13T15:23:57Z

This might have other implications but have we investigated the possibility of normalizing the index names to drop multibyte characters instead?

# Taken from django.utils.slugify(allow_unicode=False)
index_name = unicodedata.normalize('NFKD', index_name).encode('ascii', 'ignore').decode('ascii')

jacobtylerwalls · 2022-01-13T19:00:17Z

Interesting, what do you think the backwards compatibility ramifications of that would be? I was thinking that only stripping from the right when exceeding the character limit on a backend would be alright because such an index could never have been created.

charettes · 2022-01-13T22:07:08Z

@jacobtylerwalls I think that there might some code that will expect the index name generation logic to remain stable at least there was in Django 1.7 addition of migrations where we changed the format from the one used by South.

I was thinking that only stripping from the right when exceeding the character limit on a backend would be alright because such an index could never have been created.

Right, your approach is fully backward compatible for this reason but the systematic Unicode normalization of index names isn't as it changes the generated names of the index containing multibyte characters but fitting the max allowed length.

Only performing the normalization when we detect overflow would be backward compatible but then you still need to determine whether or not an index name overflows.

jacobtylerwalls · 2022-01-14T20:17:28Z

Makes sense. I'm trying to think through whether one approach is better than the other. Both implementations seem low-complexity. My next thought is to wonder if users will find it unexpected that sometimes their index names contain multibyte characters and sometimes don't? (This would be avoided with the just-chop-the-right-end strategy.)

felixxm

@jacobtylerwalls Thanks 👍 I'd prefer your approach rather than Unicode normalization.

django/db/backends/base/schema.py

felixxm · 2022-01-27T08:03:21Z

tests/indexes/tests.py

+        long_name_for_backend = {
+            'mysql': 'I♥Django' * 4,
+            'oracle': 'I♥D',
+            'postgresql': 'I♥Django' * 4,
+            'sqlite': 'I♥Django' * 17,
+        }


This is not friendly for 3rd-party backends. We should use connection.ops.max_name_length() and generate names that are too long.

Smart. Yeah, I was just blindly following the test above.

tests/indexes/tests.py

Co-authored-by: Keryn Knight <keryn@kerynknight.com>

felixxm

@jacobtylerwalls Thanks for updates 👍 I pushed small edits.

felixxm · 2022-01-28T08:53:09Z

django/db/backends/base/schema.py

-        index_name = '%s_%s_%s' % (table_name, '_'.join(column_names), hash_suffix_part)
-        if len(index_name) <= max_length:
+        index_name = '%s_%s_%s' % (table_name, joined_column_names, hash_suffix_part)
+        if len(index_name.encode('utf-8')) <= max_length:


'utf-8' is the default. We can remove it.

felixxm · 2022-01-28T09:01:26Z

django/db/backends/base/schema.py

+        while len(table_name.encode('utf-8')) > other_length:
+            table_name = table_name[:-1]
+        while len(joined_column_names.encode('utf-8')) > other_length:
+            joined_column_names = joined_column_names[:-1]


In most of cases names won't contain multibyte chars, so it should be worth avoiding multiple encoding and slicing, e.g.:

if len(table_name.encode()) == len(table_name): table_name = table_name[:other_length] else: # Shorten table name accounting for multibyte characters. while len(table_name.encode()) > other_length: table_name = table_name[:-1]

I added a small hook for this.

felixxm · 2022-01-28T10:43:43Z

@jacobtylerwalls I have doubts about using encode() 😕 As far as I'm aware, identifier limits are express in chars not bytes, chars that can have 2, 3, 4 bytes. It may also depend on encoding of the operating system or database 🤯, so this change is not fully backward compatible because it will truncate existing names in a different way. For example:

class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
    żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True)

Currently, the table name is truncated to backends_żółćęśąźńżółćęśąź3d64 and index name to backends_ż_żółćęśąźńż_bd4c8b2f, but with this patch all diacritic marks are counted as 2 chars and index name is truncated to backends__żółćę_bd4c8b2f.

I'm afraid that it may not be feasible to fix this without breaking backward compatibility (see also #9816 (comment)).

felixxm · 2022-01-28T10:56:03Z

3-byte chars are also affects, e.g.

class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
    żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True)
                                                              
    class Meta:
        db_table="♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥"

Currently, the index name is truncated to ♥♥♥♥♥♥♥♥♥♥_żółćęśąźńż_9ba11e7f, but with this patch it is truncated to ♥♥♥_żółćę_9ba11e7f. It's worth mentioning that ♥♥♥♥♥♥♥♥♥♥_żółćęśąźńż_9ba11e7f works fine on Oracle.

jacobtylerwalls · 2022-01-28T14:33:39Z

Thanks for checking. Yes, this patch looks wrong-track and so do most of the premises in the original ticket.

I looked up MySQL (5.7 docs), and it looks like the following has worked since 4.1.5:

The sizes of the identifier string columns in the grant tables are measured in characters. You can use multibyte characters without reducing the number of characters permitted for values stored in these columns.

I know nothing about Oracle, but Oracle v. 21 docs say:

Oracle Database prior to version 12.2 limit identifier names, such as table names, column names, and primary key names, to 30 characters. Oracle Database 12.2 and higher have a default limit of 128 characters.

...

If the identifiers use multi-byte characters, the MaxIdentifierLength may need to be set with character expansion ratio in mind to assure that all identifiers can be created in the Oracle database. For example, if the Oracle database character set is UTF8, a single character may require up to 4 bytes. Thus, to guarantee that all identifiers can be created in an Oracle database that does not support long identifiers, the MaxIdentifierLength should be set to 7 characters (i.e. 30 characters divided by 4).

I found that a little hard to parse, but I think it's saying that the >12.2 limit of 128 "characters" is actually "bytes", so Django's 30 character limit would no longer cause any problems with even 30 * 4 < 128. Is that right?

Maybe the original reporter was on an Oracle version below 12.2 at the time of the report?

There is probably nothing to fix here. Could close as "needsinfo"? We would need to know how the "name error" mentioned in the report happened. Did the database do an auto-truncation? I doubt we have to provide a general solution for names teasing the upper limit (see ticket-33169). Thanks again for looking.

felixxm · 2022-01-28T15:13:15Z

I found that a little hard to parse, ...

Yes, it's really confusing.

... but I think it's saying that the >12.2 limit of 128 "characters" is actually "bytes", so Django's 30 character limit would no longer cause any problems with even 30 * 4 < 128. Is that right?

Probably 😄 It's documented this way, but a few times I noticed that if you use multibyte chars in identifiers and you're close to the edge anything can happen 😄 I'd say that if you decided to use non-ASCII chars in identifiers, you actually did this to yourself. Any solution would be error-prone.

felixxm reviewed Jan 27, 2022

View reviewed changes

django deleted a comment from jacobtylerwalls Jan 28, 2022

Fixed #28949 -- Fixed index name truncation for multibyte characters.

7d98c65

Co-authored-by: Keryn Knight <keryn@kerynknight.com>

felixxm force-pushed the truncate-multibyte-column branch from b7e964b to 7d98c65 Compare January 28, 2022 09:47

felixxm reviewed Jan 28, 2022

View reviewed changes

felixxm changed the title ~~Fixed #28949 -- Accounted for multibyte characters in index creation when truncating long names.~~ Fixed #28949 -- Fixed index name truncation for multibyte characters. Jan 28, 2022

felixxm closed this Jan 28, 2022

jacobtylerwalls deleted the truncate-multibyte-column branch January 28, 2022 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

jacobtylerwalls commented Jan 2, 2022

jacobtylerwalls commented Jan 2, 2022

jacobtylerwalls commented Jan 2, 2022

kezabelle commented Jan 13, 2022 •

edited

jacobtylerwalls commented Jan 13, 2022

jacobtylerwalls commented Jan 13, 2022 •

edited

jacobtylerwalls commented Jan 13, 2022

charettes commented Jan 13, 2022 •

edited

jacobtylerwalls commented Jan 13, 2022

charettes commented Jan 13, 2022 •

edited

jacobtylerwalls commented Jan 14, 2022

felixxm left a comment

felixxm Jan 27, 2022

jacobtylerwalls Jan 27, 2022

felixxm left a comment

felixxm Jan 28, 2022

felixxm Jan 28, 2022

felixxm Jan 28, 2022

felixxm commented Jan 28, 2022

felixxm commented Jan 28, 2022

jacobtylerwalls commented Jan 28, 2022

felixxm commented Jan 28, 2022

Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273

Conversation

jacobtylerwalls commented Jan 2, 2022

jacobtylerwalls commented Jan 2, 2022

jacobtylerwalls commented Jan 2, 2022

kezabelle commented Jan 13, 2022 • edited

jacobtylerwalls commented Jan 13, 2022

jacobtylerwalls commented Jan 13, 2022 • edited

jacobtylerwalls commented Jan 13, 2022

charettes commented Jan 13, 2022 • edited

jacobtylerwalls commented Jan 13, 2022

charettes commented Jan 13, 2022 • edited

jacobtylerwalls commented Jan 14, 2022

felixxm left a comment

Choose a reason for hiding this comment

felixxm Jan 27, 2022

Choose a reason for hiding this comment

jacobtylerwalls Jan 27, 2022

Choose a reason for hiding this comment

felixxm left a comment

Choose a reason for hiding this comment

felixxm Jan 28, 2022

Choose a reason for hiding this comment

felixxm Jan 28, 2022

Choose a reason for hiding this comment

felixxm Jan 28, 2022

Choose a reason for hiding this comment

felixxm commented Jan 28, 2022

felixxm commented Jan 28, 2022

jacobtylerwalls commented Jan 28, 2022

felixxm commented Jan 28, 2022

kezabelle commented Jan 13, 2022 •

edited

jacobtylerwalls commented Jan 13, 2022 •

edited

charettes commented Jan 13, 2022 •

edited

charettes commented Jan 13, 2022 •

edited