New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient QuerySet.__contains__ #3906
Conversation
Doing `x in qs` current involves iterating over every single row in the database. This allows the search to be done in the database with the indexed primary key.
@@ -223,6 +223,9 @@ def __or__(self, other): | |||
combined.query.combine(other.query, sql.OR) | |||
return combined | |||
|
|||
def __contains__(self, obj): | |||
return isinstance(obj, self.model) and self.filter(pk=obj.pk).exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will subtly change existing code. First, if the queryset is already evaluated, then currently contains checks against the cached results. Second, when calling x in qs, the queryset doesn't get evaluated any more.
My preference is to not change this unless there is large-scale support to change the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - I'll change it to use the cached results, which, you're right, it should probably do.
However - if anyone's currently relying on __contains__
to evaluate a QuerySet, they probably shouldn't. It's a little-known side effect and causes a lot more harm than good. The Pythonic way of testing membership is in
- every Python devs knows this and will use it naturally. The fact that using in
causes an enormous performance hit on list-like QuerySets is something most devs won't see coming, and in a lot of cases won't notice until it's far too late.
This could be held up till a major release - 1.9 or 2.0 - where devs could expect changes to core behaviour. But I'm strongly of the opinion that, sooner or later, it should be known that you can use in
for testing membership like you would any other Python collection, and that it has no side effects on your QuerySet.
The current behavior of In all of these cases, which approach is more efficient depends on whether the full results of the queryset will otherwise be used. If they will be, then the current code is more efficient, because it only does one database query (caching the results) instead of two. Given that Django can't know what else you'll be doing with the queryset, it chooses to assume that if you are instantiating and storing a full queryset, you probably intend to use its full result set at some point. If you don't intend to, then you shouldn't be passing around the whole queryset, you should be calling |
@carljm I wholeheartedly disagree. This approach is a tiny optimisation in return for laying a huge trap for developers and is completely unPythonic. See my line note above. |
@gormster Halving the number of queries performed is not necessarily a tiny optimization. I don't think there's anything un-Pythonic about Django's behavior here: In any case, if you still disagree, this is a long-standing (and already many times discussed) design decision that definitely won't be changed via discussion on a pull request; you will need to bring it to the django-developers mailing list for sure. |
The major difference between |
Adding this as a |
The main argument against I think I remember a while back that @ptone was working on something that could automatically do an entirely-in-Python check of a model instance against a QuerySet's filter criteria. Not sure if that ever went anywhere. |
Maybe it's a difference in our coding style - or (crazy this might be) we're working on very different kinds of project - but it seems like I have to use it all the time. ManyToMany relationships are generally the source of it, I have boatloads in my current project, often to If you're used to working with very linear data models I can see why you wouldn't need it very often. My models are anything but linear. Also, a quick search on SO will show that a lot of people have the same problem, so it can't be that unique. |
Doing
x in qs
current involves iterating over every single row in thedatabase. This allows the search to be done in the database with the
indexed primary key.