Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preventing integrityerrors by using a tag that has just another case variation #220

Closed
wants to merge 4 commits into from

Conversation

fabiant7t
Copy link

A tag won't get created if there is another one with the same name in another case variation. This prevents an IntegrityError being raised when using MySQL.

See #204

@fabiant7t
Copy link
Author

@hackedhead @hroncok CamelCase and camelCase are being stored as one Tag only.

@apollo13 I increased the number of expected queries based on the actual number. It could probably get optimized.

tag_objs.add(self.through.tag_model().objects.create(name=new_tag))
for probably_new_tag in str_tags - set(t.name for t in existing):
try:
tag_obj = self.through.tag_model().objects.filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go via slug and slugify instead of name I think; all in all I am not to sure about this approach :( Also instead of expect IndexError we should use .exists on the query

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did slug matching in my fork, if you want to take a look:
https://github.com/hackedhead/django-taggit/blob/slugmatch/taggit/managers.py#L140

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally fine with using any other approach that prevents MySQL from raising an IntegrityError once a tag gets added that already exists in another case variation. That's the major blocker for me.

The root of all the evil here seems to be MySQL's implementation of a string comparison, which is case insensitive. Therefore tags like awesome and AWESOME are equal and considered a duplicate, resulting in an IntegrityError due to the index that forces uniqueness. Case sensitive comparisons are possible, but require at least one of the two strings to be stored as binary, e.g. as a varbinary. A CharField however maps to a varchar, and there is no Django Field that maps to a varbinary at the moment (see creation.py in django.db.backends.mysql).

My approach, to use the existing tag awesome instead of creating a new one AWESOME makes sense to me and is my preferred behavior in this particular situation (... tagging things ...). In comparison, databases like SQLite3 and probably PostgreSQL, that do proper string comparisons, will create a new tag with the name AWESOME and the slug awesome-1 (or any other number that is not used yet). But is something AWESOME really not awesome and vice verca?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thinking as well, if two tags are named similarly enough that the slugify function generates the same slug for both of them (if a user happened to use the wrong capitalization, or punctuation, etc) I want them to be tagged the same. I'm basically using the slugify process to avoid having to do fuzzy matching (case/punctuation/etc differences) myself on the Name part. Since tags are about groups things together, erring on the side of collision seems preferable to me as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mon, Apr 7, 2014 at 2:32 PM, Fabian Topfstedt
notifications@github.comwrote:

The root of all the evil here seems to be MySQL's implementation of a
string comparison, which is case insensitive. Therefore tags like
awesome and AWESOME are equal and considered a duplicate, resulting
in an IntegrityError due to the index that forces uniqueness. Case
sensitive comparisons are possible, but require at least one of the two
strings to be stored as binary, e.g. as a varbinary.

You just need to choose a case sensitive collation afaik.

But is something AWESOME really not awesome and vice verca?

A "tag" is not necessarily the "TAG" (Treatment Action Group) -- so this
would be highly backwards incompatible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apollo13 Using the utf8_bin collation seems to work. But the standard collation is case insensitive and you cannot set utf8_bin as the default collation in the settings due to sessions.Session and admin.LogEntry (see https://docs.djangoproject.com/en/dev/ref/databases/#collation-settings).

Which means that I (and everyone who reads this) is now able to solve the problem by manually changing the collation of the taggit tables. However, a lot of people will run into the same problem in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants