Efficient QuerySet.contains #3906

gormster · 2015-01-13T06:27:03Z

Doing x in qs current involves iterating over every single row in the
database. This allows the search to be done in the database with the
indexed primary key.

Doing `x in qs` current involves iterating over every single row in the database. This allows the search to be done in the database with the indexed primary key.

akaariai · 2015-01-13T06:31:19Z

django/db/models/query.py

@@ -223,6 +223,9 @@ def __or__(self, other):
        combined.query.combine(other.query, sql.OR)
        return combined

+    def __contains__(self, obj):
+        return isinstance(obj, self.model) and self.filter(pk=obj.pk).exists()


This will subtly change existing code. First, if the queryset is already evaluated, then currently contains checks against the cached results. Second, when calling x in qs, the queryset doesn't get evaluated any more.

My preference is to not change this unless there is large-scale support to change the behavior.

Fair enough - I'll change it to use the cached results, which, you're right, it should probably do.

However - if anyone's currently relying on __contains__ to evaluate a QuerySet, they probably shouldn't. It's a little-known side effect and causes a lot more harm than good. The Pythonic way of testing membership is in - every Python devs knows this and will use it naturally. The fact that using in causes an enormous performance hit on list-like QuerySets is something most devs won't see coming, and in a lot of cases won't notice until it's far too late.

This could be held up till a major release - 1.9 or 2.0 - where devs could expect changes to core behaviour. But I'm strongly of the opinion that, sooner or later, it should be known that you can use in for testing membership like you would any other Python collection, and that it has no side effects on your QuerySet.

carljm · 2015-01-13T06:39:52Z

The current behavior of __contains__ is intentional, and consistent with the behavior of e.g. if queryset and len(queryset), both of which also fully evaluate the queryset, fetching all rows.

In all of these cases, which approach is more efficient depends on whether the full results of the queryset will otherwise be used. If they will be, then the current code is more efficient, because it only does one database query (caching the results) instead of two. Given that Django can't know what else you'll be doing with the queryset, it chooses to assume that if you are instantiating and storing a full queryset, you probably intend to use its full result set at some point. If you don't intend to, then you shouldn't be passing around the whole queryset, you should be calling exists() (or count()) explicitly yourself.

gormster · 2015-01-13T06:42:52Z

@carljm I wholeheartedly disagree. This approach is a tiny optimisation in return for laying a huge trap for developers and is completely unPythonic. See my line note above.

carljm · 2015-01-13T06:48:17Z

@gormster Halving the number of queries performed is not necessarily a tiny optimization. I don't think there's anything un-Pythonic about Django's behavior here: in works correctly and exactly as you'd expect it to. Django just optimizes for "if you are using a queryset, you want its full results", which is a fully consistent choice and, once understood, gives the developer better flexibility. If the behavior of __contains__ (and __len__ and __bool__) were changed as you request, there would be no way to use in with a queryset whose full result set you do intend to later use, without triggering an extra query. Now that's an un-Pythonic trap.

In any case, if you still disagree, this is a long-standing (and already many times discussed) design decision that definitely won't be changed via discussion on a pull request; you will need to bring it to the django-developers mailing list for sure.

gormster · 2015-01-13T06:48:52Z

The major difference between if queryset/exists(), len(queryset)/count() and this pull, is that those methods actually exist. There is no method for testing membership in a QuerySet at the moment. At the very least this should be a .contains(), rather than the poxy .filter(pk=obj.pk).exists() garbage that uglies up all our code.

carljm · 2015-01-13T06:50:35Z

Adding this as a .contains() method to QuerySet is a reasonable feature request that's consistent with the current design philosophy; I don't have an issue with that (at least not one that comes to mind immediately). It would need a ticket.

carljm · 2015-01-13T07:04:11Z

The main argument against contains(), I think, would just be that it's rarely if ever needed. I can't remember the last time I did .filter(pk=obj.pk).exists(); not sure I ever have, so I think "uglies up all our code" might be a bit of an over-statement. If I have a model instance and a set of criteria, and I want to know if the instance matches those criteria, I would just check the criteria against the instance in Python, not do an extra query against the database.

I think I remember a while back that @ptone was working on something that could automatically do an entirely-in-Python check of a model instance against a QuerySet's filter criteria. Not sure if that ever went anywhere.

gormster · 2015-01-13T07:16:44Z

Maybe it's a difference in our coding style - or (crazy this might be) we're working on very different kinds of project - but it seems like I have to use it all the time. ManyToMany relationships are generally the source of it, I have boatloads in my current project, often to User, so I've got a bunch of testing if request.user belongs to a specific M2M. It's actual membership I'm trying to test, not an arbitrary "set of criteria" - QuerySets are, after all, our main view into the database.

If you're used to working with very linear data models I can see why you wouldn't need it very often. My models are anything but linear.

Also, a quick search on SO will show that a lot of people have the same problem, so it can't be that unique.

Efficient QuerySet.__contains__

9786a6d

Doing `x in qs` current involves iterating over every single row in the database. This allows the search to be done in the database with the indexed primary key.

akaariai reviewed Jan 13, 2015
View reviewed changes

carljm closed this Jan 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient QuerySet.contains #3906

Efficient QuerySet.contains #3906

gormster commented Jan 13, 2015

akaariai Jan 13, 2015

gormster Jan 13, 2015

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

carljm commented Jan 13, 2015

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

Efficient QuerySet.__contains__ #3906

Efficient QuerySet.__contains__ #3906

Conversation

gormster commented Jan 13, 2015

akaariai Jan 13, 2015

Choose a reason for hiding this comment

gormster Jan 13, 2015

Choose a reason for hiding this comment

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

carljm commented Jan 13, 2015

carljm commented Jan 13, 2015

gormster commented Jan 13, 2015

Efficient QuerySet.contains #3906

Efficient QuerySet.contains #3906