Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient QuerySet.__contains__ #3906

Closed
wants to merge 1 commit into from
Closed

Conversation

gormster
Copy link

Doing x in qs current involves iterating over every single row in the
database. This allows the search to be done in the database with the
indexed primary key.

Doing `x in qs` current involves iterating over every single row in the
database. This allows the search to be done in the database with the
indexed primary key.
@@ -223,6 +223,9 @@ def __or__(self, other):
combined.query.combine(other.query, sql.OR)
return combined

def __contains__(self, obj):
return isinstance(obj, self.model) and self.filter(pk=obj.pk).exists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will subtly change existing code. First, if the queryset is already evaluated, then currently contains checks against the cached results. Second, when calling x in qs, the queryset doesn't get evaluated any more.

My preference is to not change this unless there is large-scale support to change the behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - I'll change it to use the cached results, which, you're right, it should probably do.

However - if anyone's currently relying on __contains__ to evaluate a QuerySet, they probably shouldn't. It's a little-known side effect and causes a lot more harm than good. The Pythonic way of testing membership is in - every Python devs knows this and will use it naturally. The fact that using in causes an enormous performance hit on list-like QuerySets is something most devs won't see coming, and in a lot of cases won't notice until it's far too late.

This could be held up till a major release - 1.9 or 2.0 - where devs could expect changes to core behaviour. But I'm strongly of the opinion that, sooner or later, it should be known that you can use in for testing membership like you would any other Python collection, and that it has no side effects on your QuerySet.

@carljm
Copy link
Member

carljm commented Jan 13, 2015

The current behavior of __contains__ is intentional, and consistent with the behavior of e.g. if queryset and len(queryset), both of which also fully evaluate the queryset, fetching all rows.

In all of these cases, which approach is more efficient depends on whether the full results of the queryset will otherwise be used. If they will be, then the current code is more efficient, because it only does one database query (caching the results) instead of two. Given that Django can't know what else you'll be doing with the queryset, it chooses to assume that if you are instantiating and storing a full queryset, you probably intend to use its full result set at some point. If you don't intend to, then you shouldn't be passing around the whole queryset, you should be calling exists() (or count()) explicitly yourself.

@carljm carljm closed this Jan 13, 2015
@gormster
Copy link
Author

@carljm I wholeheartedly disagree. This approach is a tiny optimisation in return for laying a huge trap for developers and is completely unPythonic. See my line note above.

@carljm
Copy link
Member

carljm commented Jan 13, 2015

@gormster Halving the number of queries performed is not necessarily a tiny optimization. I don't think there's anything un-Pythonic about Django's behavior here: in works correctly and exactly as you'd expect it to. Django just optimizes for "if you are using a queryset, you want its full results", which is a fully consistent choice and, once understood, gives the developer better flexibility. If the behavior of __contains__ (and __len__ and __bool__) were changed as you request, there would be no way to use in with a queryset whose full result set you do intend to later use, without triggering an extra query. Now that's an un-Pythonic trap.

In any case, if you still disagree, this is a long-standing (and already many times discussed) design decision that definitely won't be changed via discussion on a pull request; you will need to bring it to the django-developers mailing list for sure.

@gormster
Copy link
Author

The major difference between if queryset/exists(), len(queryset)/count() and this pull, is that those methods actually exist. There is no method for testing membership in a QuerySet at the moment. At the very least this should be a .contains(), rather than the poxy .filter(pk=obj.pk).exists() garbage that uglies up all our code.

@carljm
Copy link
Member

carljm commented Jan 13, 2015

Adding this as a .contains() method to QuerySet is a reasonable feature request that's consistent with the current design philosophy; I don't have an issue with that (at least not one that comes to mind immediately). It would need a ticket.

@carljm
Copy link
Member

carljm commented Jan 13, 2015

The main argument against contains(), I think, would just be that it's rarely if ever needed. I can't remember the last time I did .filter(pk=obj.pk).exists(); not sure I ever have, so I think "uglies up all our code" might be a bit of an over-statement. If I have a model instance and a set of criteria, and I want to know if the instance matches those criteria, I would just check the criteria against the instance in Python, not do an extra query against the database.

I think I remember a while back that @ptone was working on something that could automatically do an entirely-in-Python check of a model instance against a QuerySet's filter criteria. Not sure if that ever went anywhere.

@gormster
Copy link
Author

Maybe it's a difference in our coding style - or (crazy this might be) we're working on very different kinds of project - but it seems like I have to use it all the time. ManyToMany relationships are generally the source of it, I have boatloads in my current project, often to User, so I've got a bunch of testing if request.user belongs to a specific M2M. It's actual membership I'm trying to test, not an arbitrary "set of criteria" - QuerySets are, after all, our main view into the database.

If you're used to working with very linear data models I can see why you wouldn't need it very often. My models are anything but linear.

Also, a quick search on SO will show that a lot of people have the same problem, so it can't be that unique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants