New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273
Fixed #28949 -- Fixed index name truncation for multibyte characters. #15273
Conversation
buildbot, test on oracle. |
1 similar comment
buildbot, test on oracle. |
I suspect that the code as-is handles chopping halfway through multibyte characters more correctly than just naively doing I think my question is: what happens when the last glyph/grapheme/whatever in the trimming is a mb utf-8 char like an emoji? Does that get taken wholesale, or dropped (or chopped erroneously, like my back of the napkin thing above)? If it's taken wholesale, is that definitely always fine and never going to push over the limit? (possibly, given it's tracking the working length) If the test is answering that and I'm just not grokking enough of it to understand that, well, sorry and my bad :) Edit: spitballing further, the following also passes on SQLite, so perhaps the intricacies of what's occurring in your current while loop aren't being fully exercised in the test?
|
Hey @kezabelle, thanks for digging in. Let me know if this sounds right.
Indeed, no dice: >>> mysql = 'I♥Django' * 4
>>> len(mysql)
32
>>> len(mysql.encode('utf-8'))
40
>>> mysql.encode('utf-8')[:12]
b'I\xe2\x99\xa5DjangoI\xe2'
>>> mysql.encode('utf-8')[:12].decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 11: unexpected end of data
I'm intending the Oracle test to show that it gets dropped ("I" rather than "I♥D".) There the max length is 30, and it's at 28 without the three-byte heart. A comment might help? >>> expected_result_for_oracle = 'indexes_a_c1_c2_I_67189ba7ix'
>>> len(expected_result_for_oracle)
28
I discovered it was simple to just temporarily fiddle with
OMG that snippet is so much simpler! I was so paranoid about checking the correct lengths (and indexing using the correct lengths) that I forgot that stripping from the right would just take care of things. |
buildbot, test on oracle. (Edit: this didn't seem to get picked up, so trying again next) |
buildbot, test on oracle. |
This might have other implications but have we investigated the possibility of normalizing the index names to drop multibyte characters instead? # Taken from django.utils.slugify(allow_unicode=False)
index_name = unicodedata.normalize('NFKD', index_name).encode('ascii', 'ignore').decode('ascii') |
Interesting, what do you think the backwards compatibility ramifications of that would be? I was thinking that only stripping from the right when exceeding the character limit on a backend would be alright because such an index could never have been created. |
@jacobtylerwalls I think that there might some code that will expect the index name generation logic to remain stable at least there was in Django 1.7 addition of migrations where we changed the format from the one used by South.
Right, your approach is fully backward compatible for this reason but the systematic Unicode normalization of index names isn't as it changes the generated names of the index containing multibyte characters but fitting the max allowed length. Only performing the normalization when we detect overflow would be backward compatible but then you still need to determine whether or not an index name overflows. |
Makes sense. I'm trying to think through whether one approach is better than the other. Both implementations seem low-complexity. My next thought is to wonder if users will find it unexpected that sometimes their index names contain multibyte characters and sometimes don't? (This would be avoided with the just-chop-the-right-end strategy.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtylerwalls Thanks 👍 I'd prefer your approach rather than Unicode normalization.
tests/indexes/tests.py
Outdated
long_name_for_backend = { | ||
'mysql': 'I♥Django' * 4, | ||
'oracle': 'I♥D', | ||
'postgresql': 'I♥Django' * 4, | ||
'sqlite': 'I♥Django' * 17, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not friendly for 3rd-party backends. We should use connection.ops.max_name_length()
and generate names that are too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart. Yeah, I was just blindly following the test above.
Co-authored-by: Keryn Knight <keryn@kerynknight.com>
b7e964b
to
7d98c65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtylerwalls Thanks for updates 👍 I pushed small edits.
django/db/backends/base/schema.py
Outdated
index_name = '%s_%s_%s' % (table_name, '_'.join(column_names), hash_suffix_part) | ||
if len(index_name) <= max_length: | ||
index_name = '%s_%s_%s' % (table_name, joined_column_names, hash_suffix_part) | ||
if len(index_name.encode('utf-8')) <= max_length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'utf-8'
is the default. We can remove it.
django/db/backends/base/schema.py
Outdated
while len(table_name.encode('utf-8')) > other_length: | ||
table_name = table_name[:-1] | ||
while len(joined_column_names.encode('utf-8')) > other_length: | ||
joined_column_names = joined_column_names[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of cases names won't contain multibyte chars, so it should be worth avoiding multiple encoding and slicing, e.g.:
if len(table_name.encode()) == len(table_name):
table_name = table_name[:other_length]
else:
# Shorten table name accounting for multibyte characters.
while len(table_name.encode()) > other_length:
table_name = table_name[:-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a small hook for this.
@jacobtylerwalls I have doubts about using class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True) Currently, the table name is truncated to I'm afraid that it may not be feasible to fix this without breaking backward compatibility (see also #9816 (comment)). |
3-byte chars are also affects, e.g. class żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń(models.Model):
żółćęśąźńżółćęśąźńżółćęśąźńżółćęśąźń = models.IntegerField(db_index=True)
class Meta:
db_table="♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥" Currently, the index name is truncated to |
Thanks for checking. Yes, this patch looks wrong-track and so do most of the premises in the original ticket. I looked up MySQL (5.7 docs), and it looks like the following has worked since 4.1.5:
I know nothing about Oracle, but Oracle v. 21 docs say:
...
I found that a little hard to parse, but I think it's saying that the >12.2 limit of 128 "characters" is actually "bytes", so Django's 30 character limit would no longer cause any problems with even 30 * 4 < 128. Is that right? Maybe the original reporter was on an Oracle version below 12.2 at the time of the report? There is probably nothing to fix here. Could close as "needsinfo"? We would need to know how the "name error" mentioned in the report happened. Did the database do an auto-truncation? I doubt we have to provide a general solution for names teasing the upper limit (see ticket-33169). Thanks again for looking. |
Yes, it's really confusing.
Probably 😄 It's documented this way, but a few times I noticed that if you use multibyte chars in identifiers and you're close to the edge anything can happen 😄 I'd say that if you decided to use non-ASCII chars in identifiers, you actually did this to yourself. Any solution would be error-prone. |
ticket-28949