Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with chained filters and Django ORM #745

Closed
eduardocalil opened this issue Jul 4, 2017 · 22 comments
Closed

Problem with chained filters and Django ORM #745

eduardocalil opened this issue Jul 4, 2017 · 22 comments

Comments

@eduardocalil
Copy link

eduardocalil commented Jul 4, 2017

As related in the issue #537, the Django ORM generate duplicated INNER JOINS for each .filter() .

For example:
Model.objecs.filter(table1__attr1=value1).filter(table1__attr2=value2)

The built query is:

SELECT *
FROM model_table
INNER JOIN table1 ON (model_table.id = table1.model_table_id)
INNER JOIN table1 T5 ON (model_table.id = T5.model_table_id)
WHERE (table1.attr1=value1 AND T5.attr2=value2)

But the correct query should be:

SELECT *
FROM model_table
INNER JOIN table1 ON (model_table.id = table1.model_table_id)
WHERE (table1.attr1=value1 AND table1.attr2=value2)

The first query may return unwanted or unexpected results.

Reading the code I found the method def qs(self) in class BaseFilterSet from the file filterset.py and I realized the following workflow:

  1. Get all the model objects using qs = self.queryset.all()
  2. Iterates in the list of declared filters (fields)
  3. In other words, for each field is done the same as qs = self.queryset.filter(field_n=value_n) , what causes the described problem.
  4. When the iteration finishes, the queryset (qs) is returned

The solution that I've developed for the chained filters problem is:

    def qs(self):
        kwargs = {}
        if not hasattr(self, '_qs'):
            if not self.is_bound:
                self._qs = self.queryset.all()
                return self._qs

            if not self.form.is_valid():
                if self.strict == STRICTNESS.RAISE_VALIDATION_ERROR:
                    raise forms.ValidationError(self.form.errors)
                elif self.strict == STRICTNESS.RETURN_NO_RESULTS:
                    self._qs = self.queryset.none()
                    return self._qs
                # else STRICTNESS.IGNORE...  ignoring

            # start with all the results and filter from there
            qs = self.queryset.all()
            for name, filter_ in six.iteritems(self.filters):
                value = self.form.cleaned_data.get(name)

                if value is not None and value:  # valid & clean data
                    # If all o these conditions are true, it means that
                    # field (filter_) is filtering the queryset
                    # So, it must be added to the filtering dict
                    kwargs[filter_.name] = value

            # The same as qs.filter(filter_.name1=value1, filter_.name2=value2, ...)
            self._qs = qs.filter(**kwargs)

        return self._qs

Instead of making queries for each field, all of them are stored in a dictionary and the queryset is filtered only one time using this dict.

What do you think about this solution? Would it break other parts of the code?

@carltongibson
Copy link
Owner

What do you think about this solution?

Still not yet convinced this is a DF issue, rather than an ORM one — what do the Django people say about this? This behaviour is as old as the ORM itself, so it's come up before.

Would it break other parts of the code?

There's an easy way to check that. 🙂 — Open an PR and see what breaks.

@carltongibson
Copy link
Owner

carltongibson commented Jul 4, 2017

what do the Django people say

Also, what do the DB people say? If I make a double join, is that not a no-op? (Maybe it isn't — but, if not, why not? And, why is this a DF issue?)

@rpkilby
Copy link
Collaborator

rpkilby commented Jul 4, 2017

If I'm not mistaken, this specifically refers to filtering across to-many relationships, which returns different results for .filter(related__a=1).filter(related__b=2) and .filter(related__a=1, related__b=2). There are no issues for to-one relationships.

If I make a double join, is that not a no-op?

Nope - the second join is aliased as T5, which is what the second where clause applies to. This is different from the latter example, where both where clauses operate on table1. The docs topic give a good example of the effective differences between the two queries. While both are technically valid, 90% of the time the latter query is what's intended, while the former is what django-filter constructs.

why is this a DF issue?

There really isn't a trivial way to implement the desired behavior at the moment.

@rpkilby
Copy link
Collaborator

rpkilby commented Jul 4, 2017

Couple of side notes:

@eduardocalil
Copy link
Author

I think that I've finally found the solution for this issue. Inside the model QuerySet exists a method called _next_is_sticky(). It prevents the duplicated INNER JOINS. You will undestand it better after reading a very short documentation about this method in the following link:

https://kite.com/docs/python/django.db.models.query.QuerySet._next_is_sticky

Here is my Pull Request: #753

@eduardocalil
Copy link
Author

eduardocalil commented Aug 1, 2017

But it has not passed in django-rest tests. Can someone who understands how it works help me?

@josert
Copy link

josert commented Oct 3, 2017

As a quick win it would be really great to adapt this behavior until a more consistent feature was available (as a config parameter to determine if it is desired to work with stiky or not in the current filter).

Current behavior breaks the possibility to use this magnificient plugin in project where is neccessary to filter between relations as these, a real shame : (

@steve-kasica
Copy link

I agree that the _next_is_sticky() approach looks promising; however, I found the implementation in PR #753 brakes the desired behavior of ModelMultipleChoiceFilter where conjoined=True. As you might imagine, it executed an SQL query along these lines and returns the empty set.

SELECT *
FROM model_table
INNER JOIN table1 ON (model_table.id = table1.model_table_id)
WHERE (table1.attrID=value1 AND table1.attrID=value2)

Having a solution to this to-many chaining issue would greatly simplify my REST API that uses django-filter to filter results across many-to-many relationships. Please let me know how I can lend a hand.

@carltongibson
Copy link
Owner

I'm essentially -1 on making a change here.

First off, I'm not going to add anything on the basis of it being "a quick win" — that way lays trouble later.

Second, more to the point, it seems to me you should be handling this with a custom filter, custom field using a multi-widget. i.e. a single field expecting multiple values which is then used in a single filter call on the queryset.

I'd be interested in seeing what people come up with that way. No doubt there's something we can add, which doesn't involve setting state flags on the core filtering behaviour.

@rpkilby
Copy link
Collaborator

rpkilby commented Oct 9, 2017

For what it's worth, I've been making a lot of headway on the nested filtering issue. Basically, queries will go from

Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

to something like

Blog.objects.filter(
    entry=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

In addition to enabling the correct behavior for to-many relationships, this will allow us to enable request-based filtering on related queries. eg, those entries that are currently published or are in the process of being authored by the user.

def visible_entries(request):
    return Entry.objects.filter(Q(is_published=True) | Q(author=request.user))

Blog.objects.filter(
    entry=visible_entries(request).filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

That said, I'm beating my head against the wall in regards to some potential performance issues. Unsurprisingly, instantiating n FilterSets is n times as expensive.

@carltongibson
Copy link
Owner

@rpkilby Push your branch. Let's see what you've got. 🙂

I don't want anything too complex. The point of DF is to avoid common boilerplate. There's nothing wrong in writing the filtering logic by hand in non-core cases. I much prefer that to magic no-one can debug.

@carltongibson
Copy link
Owner

Closing this as per #753

@rpkilby
Copy link
Collaborator

rpkilby commented Oct 18, 2017

@carltongibson, I'm fairly close to pushing out a PR, but it's for the drf-filters project.

@carltongibson
Copy link
Owner

OK. Cool. Drop a note here when it’s ready. I’ll have a look.

@rpkilby
Copy link
Collaborator

rpkilby commented Oct 24, 2017

Okay, a first pass is up (see: philipn/django-rest-framework-filters#197).

@jonathan-golorry
Copy link

jonathan-golorry commented Jan 30, 2018

I'm also encountering this issue. What was wrong with the dictionary approach? It seems much better than mucking around with next_is_sticky or nested querysets.

Edit: Ah, I think I understand the issue rpkilby is describing in the "couple of side notes" post. There are a lot of django ORM features like "distinct" that are locked off by sticking to a kwargs dictionary or even a series of Q objects.

bruno-fs pushed a commit to tikal-tech/django-filter that referenced this issue Sep 20, 2018
bruno-fs pushed a commit to tikal-tech/django-filter that referenced this issue Sep 20, 2018
bruno-fs added a commit to tikal-tech/django-filter that referenced this issue Sep 20, 2018
bruno-fs added a commit to tikal-tech/django-filter that referenced this issue Sep 20, 2018
bruno-fs pushed a commit to tikal-tech/django-filter that referenced this issue Sep 21, 2018
bruno-fs added a commit to tikal-tech/django-filter that referenced this issue Sep 21, 2018
bruno-fs added a commit to tikal-tech/django-filter that referenced this issue Sep 21, 2018
bruno-fs pushed a commit to tikal-tech/django-filter that referenced this issue Sep 21, 2018
bruno-fs added a commit to tikal-tech/django-filter that referenced this issue Sep 21, 2018
@jacebrowning
Copy link

jacebrowning commented Nov 3, 2019

I'm also blocked by this issue. Including three or more queries of fields on a many-to-many relationship (e.g. /api/things?other__foo=1&other__bar=2&other__qux=3) cases my application to time out due to the repeated INNER JOIN statements.

Applying @eduardocalil's fix the the current version of django-filter is still a one-line change: citizenlabsgr@7e228f5

@pawelzar
Copy link

pawelzar commented Nov 20, 2019

Hey, I also encountered some issues related to filtering on the relationships. When I want to use 2 filters on a related field I always expect the filters to be applied on the same table. I wasn't able to find a real world example, where django's approach would be justified and seems like this is problematic for many users.

As @jacebrowning mentioned sometimes it generates so many JOINs that it's not even possible to run the query (e.g. for multi-word search term). This is still an issue in django admin's search when it uses the relationships. There was even an attempt to change it quite a long time ago, but unfortunately it was reverted.

Would it be possible to add this _next_is_sticky one-line fix which @jacebrowning mentioned to django-filter? It looks promising. Maybe we could check if there is a FILTERS_USE_STICKY = True option in django settings and then apply the _next_is_sticky to filters? This would preserve the current behavior, but would also provide a convenient solution for users, for whom the default behavior is not suitable. This shouldn't harm anyone. @carltongibson what do you think?

If this sounds reasonable I could prepare a PR with the adjustment + docs update and some tests.

@carltongibson
Copy link
Owner

Ok, so, the approach that would be worth looking at here is a multi-filter that wrapped up the whole multi-widget and single filter call business. That’s the solve it, but folks don’t know how to configure that themselves. It would be a nice addition.

@jonathan-golorry
Copy link

jonathan-golorry commented Dec 16, 2019

@carltongibson It's currently pretty difficult to set up because you need to subclass MultiValueField and either subclass MultiWidget or pass it widgets through the filter.

A MultiWidget by itself doesn't give you type conversion. If I run a widgets=[forms.TextInput, forms.NumberInput] widget through a CharFilter, both values are strings.

If I create a filter with field_class = forms.MultiValueField and pass it fields and a widget, then I still run into a NotImplementedError on compress.

I think you basically need to implement all of the following:

class GroupedWidget(SuffixedMultiWidget):
    suffixes = ["my_string", "my_number"]

    def __init__(self):
        super().__init__(widgets=[forms.TextInput, forms.NumberInput])


class GroupedField(forms.MultiValueField):
    widget = GroupedWidget

    def __init__(self, *args, **kwargs):
        fields = [forms.CharField(), forms.DecimalField()]
        super().__init__(fields=fields, *args, **kwargs)

    def compress(self, data_list):
        return data_list


class GroupedFilter(filters.Filter):
    field_class = GroupedField

I've written some helpers to automatically put the nested field's widgets onto a SuffixedMultiWidget and filter on multiple related values in a single call:

class SuffixedMultiField(forms.MultiValueField):
    widget_parent = SuffixedMultiWidget

    def __init__(self, suffixes, fields, *args, **kwargs):
        if "widget" not in kwargs:
            widgets = [f.widget for f in fields]

            class ChildWidget(self.widget_parent):
                def __init__(self, *args, **kwargs):
                    super().__init__(widgets=widgets, *args, **kwargs)

            ChildWidget.suffixes = suffixes

            kwargs["widget"] = ChildWidget
        super().__init__(fields=fields, *args, **kwargs)

    def compress(self, data_list):
        return data_list


class SuffixedMultiFilter(filters.Filter):
    field_class = SuffixedMultiField

    def filter(self, qs, value):
        lookups = {}
        widget = self.field.widget
        name = self.field_name
        for suffix, v in zip(widget.suffixes, value):
            if v not in EMPTY_VALUES:
                # lookup = widget.suffixed(name, suffix)
                lookup = "__".join([name, suffix]) if suffix else name
                lookups[lookup] = v

        if not lookups:
            return qs
        if self.distinct:
            qs = qs.distinct()

        qs = self.get_method(qs)(**lookups)
        return qs

This simplifies usage to:

    my_group = SuffixedMultiFilter(
        suffixes=["my_string", "my_number"],
        fields=[forms.CharField(), forms.DecimalField()],
        method="my_method",  # optional
    )

It's not perfect, but it's a lot more usable. Note that SuffixedMultiFilter.filter by default groups the lookups into a single filter/exclude, solving the discussed issue.

I actually subclass SuffixedMultiWidget to put 2 underscores between the name and suffix so that url parameters follow the same exact naming as django filters, but I left that configurable via parent_widget. Speaking of SuffixedMultiWidget, shouldn't the empty decompress return [None]*len(self.suffixes) instead of assuming a length of 2?

I also considered writing a method that could transform a filterset into a single grouped filter. This approach is cool, but we'd need to split the filter kwarg generation logic out into a separate function on all the existing filters for it to really be worth it.

@yuekui
Copy link

yuekui commented Mar 3, 2022

For anybody who's still looking for a solution of this issue, here's another simple alternative from my gist:

from django.db.models import Q
from django.db.models.constants import LOOKUP_SEP
from django_filters import rest_framework as django_filters
from django_filters.constants import EMPTY_VALUES


class GroupFilterSet(django_filters.FilterSet):
    def filter_queryset(self, queryset):
        """
        Group the fitlers by the first join table to
        reduce inner join queries for performance.
        This would avoid filter chaining like:
        `Model.objects.filter(table_foo__id__in=[xx,xx]).filter(table_foo__name__in=[xx,xx])`
        Instead, it would be grouped as:
        `Model.objects.filter(table_foo__id__in=[xx,xx], table_foo__name__in=[xx,xx])`
        Inspired by discussion at:
        https://github.com/carltongibson/django-filter/issues/745
        https://github.com/carltongibson/django-filter/pull/1167
        """
        groups = {}
        is_distincted = False
        for name, value in self.form.cleaned_data.items():
            if value in EMPTY_VALUES:
                continue
            f = self.filters[name]
            # Do not merge Qs for customized filter method due to complexity.
            if f._method or not f.__class__.filter == django_filters.Filter.filter:
                queryset = self.filters[name].filter(queryset, value)
                continue
            # Use the joined table name as key
            group_name = f.field_name.split(LOOKUP_SEP)[0]
            q = Q(**{LOOKUP_SEP.join([f.field_name, f.lookup_expr]): value})
            if f.exclude:
                q = ~q
            # Check if there's any join query with the same table
            if group_name in groups:
                groups[group_name] = groups[group_name] & q
            else:
                groups[group_name] = q
            if f.distinct:
                is_distincted = True
        for q in groups.values():
            queryset = queryset.filter(q)
        if is_distincted:
            queryset = queryset.distinct()
        return queryset

@mike667
Copy link

mike667 commented Aug 8, 2023

For anybody who's still looking for a solution of this issue, here's another simple alternative from my gist:

from django.db.models import Q
from django.db.models.constants import LOOKUP_SEP
from django_filters import rest_framework as django_filters
from django_filters.constants import EMPTY_VALUES


class GroupFilterSet(django_filters.FilterSet):
    def filter_queryset(self, queryset):
        """
        Group the fitlers by the first join table to
        reduce inner join queries for performance.
        This would avoid filter chaining like:
        `Model.objects.filter(table_foo__id__in=[xx,xx]).filter(table_foo__name__in=[xx,xx])`
        Instead, it would be grouped as:
        `Model.objects.filter(table_foo__id__in=[xx,xx], table_foo__name__in=[xx,xx])`
        Inspired by discussion at:
        https://github.com/carltongibson/django-filter/issues/745
        https://github.com/carltongibson/django-filter/pull/1167
        """
        groups = {}
        is_distincted = False
        for name, value in self.form.cleaned_data.items():
            if value in EMPTY_VALUES:
                continue
            f = self.filters[name]
            # Do not merge Qs for customized filter method due to complexity.
            if f._method or not f.__class__.filter == django_filters.Filter.filter:
                queryset = self.filters[name].filter(queryset, value)
                continue
            # Use the joined table name as key
            group_name = f.field_name.split(LOOKUP_SEP)[0]
            q = Q(**{LOOKUP_SEP.join([f.field_name, f.lookup_expr]): value})
            if f.exclude:
                q = ~q
            # Check if there's any join query with the same table
            if group_name in groups:
                groups[group_name] = groups[group_name] & q
            else:
                groups[group_name] = q
            if f.distinct:
                is_distincted = True
        for q in groups.values():
            queryset = queryset.filter(q)
        if is_distincted:
            queryset = queryset.distinct()
        return queryset

I expected this as default django-filter behaviour. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants