New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #27849 -- Added filtering support to aggregates. #8352

Merged
merged 1 commit into from Aug 12, 2017

Conversation

Projects
None yet
6 participants
@orf
Copy link
Contributor

orf commented Apr 12, 2017

Re: #8306, this MR is an initial stab at supporting .filter and .exclude on any Aggregate.

We do this in two ways: in the case of postgres we use the SQL 2003 FILTER syntax: AGG(field) FILTER (WHERE ...), and on other databases we emulate this with AGG(CASE ... THEN field ELSE NULL). This is currently only supported in Postgres.

The reason for supporting both, despite being functionally equivalent, is that the postgres syntax is faster.

I'm pretty sure the implementation might make someone more knowledgeable about django aggregates sick, but I think it's a good initial draft. It's slightly tricky because while the CASE syntax is nested within the aggregate the FILTER syntax appears outside, so I think we need a custom AggregateFilter expression to handle these two types.

Code quality is poor, I just wanted to use this as a proof of concept to spur a discussion about how (or even if) to best support this.

Made up example use-case.

one_week_ago = timezone.now().date() - timedelta(days=7)

Mailboxes.objects.annotate(
   read_emails=Count('emails').filter(unread=False),
   unread_emails=Count('emails').filter(unread=True),
   recent_emails=Count('emails').filter(received_date__lt=one_week_ago)
)
return compiler.compile(FilterWhere(self.value, self.condition, output_field=self.output_field))

def filter(self, **kwargs):
return AggregateFilter(value=self.value, condition=self.condition | Q(**kwargs))

This comment has been minimized.

@bluetech

bluetech Apr 12, 2017

Contributor

I think chaining filter/exclude should be an AND not an OR. It makes more sense to me and I'm pretty sure that what queryset filter/exclude do.

This comment has been minimized.

@charettes

charettes Apr 12, 2017

Member

This will also have to take into account filter(m2m__field=val1, criteria=m2m__otherfield=val2) != filter(m2m__field=val1).filter(m2m__otherfield=val2) as explained in spanning multi-valued relationships

This comment has been minimized.

@orf

orf Apr 12, 2017

Contributor

Hah, yeah, I'm not sure why I made it an OR. I've updated it. I'm not sure how to handle the spanning of relationships like that though :/


def test_double_filtered_aggregates(self):
agg = Sum('age').filter(name='test2').filter(name='test')

This comment has been minimized.

@bluetech

bluetech Apr 12, 2017

Contributor

i.e. here I would expect it to filter out everything.

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Apr 12, 2017

I'm still wondering what's the use case for chaining aggregate filters. Is there a reason why allowing aggregates to accept a filter parameter accepting Q objects is not enough?

What if we want to add an AggregateOrderBy as well? How would the template be defined to make sure both ordering and filtering be used at the same time?

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Apr 12, 2017

Hey @charettes,
I'm unclear of the nicest API for this feature. I personally quite like the idea of a .filter and .exclude method, it makes it feel quite natural and like usual querysets (and with the usual benefits). However if chaining them is too complex then maybe we can not support that - but then it's not quite like a queryset :(

My reasoning behind a separate method is also down to the implementation. As far as I can tell allowing aggregates to accept a Q object in the constructor wouldn't be possible without overriding the as_sql method of the Function expression. I wanted to avoid this as I couldn't work out how to do that nicely. Thus the filter (and exclude) methods return a separate Expression that wraps the inner aggregate.

Again, this is just a draft and is most likely not the best way to go implementation wise. If you think there is a better way and could share a gist or some more details that would be awesome!

@orf orf force-pushed the orf:filtered-aggregates branch from a8be734 to 8f3fed3 Apr 12, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Apr 12, 2017

Ahh, I see that the FILTER syntax is standardized in SQL:2003. That means the FilterWhere should be moved out of the contrib.postgres package and into the aggregate file itself.

As it's standardized, perhaps it should be the cornerstone of this. As in, the FilterWhere class itself is responsible for falling back to a CASE WHEN emulation depending on a database feature flag? Rather than the current code with a class that produces a FilterWhere for postgres or a CASE WHEN expression for all others.

Thoughts?

@orf orf force-pushed the orf:filtered-aggregates branch from 6b557a6 to 6373199 Apr 23, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Apr 23, 2017

Hey @charettes, I've changed my merge request to accept a Q object via a filter keyword now as discussed previously. Can you give me some feedback?

I'm not sure if I like adding WhereNode to the list of allowed types that When accepts, perhaps there is a better way? Also how should I test this, should I write two tests for each and every aggregate? Should I add them as a separate TestCase or inline them?

@charettes
Copy link
Member

charettes left a comment

I'm not sure if I like adding WhereNode to the list of allowed types that When accepts, perhaps there is a better way?

You shouldn't have to do that I believe, I'll give it a try during the week.

Also how should I test this, should I write two tests for each and every aggregate? Should I add them as a separate TestCase or inline them?

I don't think it's worth adding a test case for every existing aggregates. A single test case for the filtering behavior of Aggregate and and subclasses that override as_sql() should be enough (I haven't found any so far).


if not supports_filter:
case_statement = self.get_case_expression()
self.set_source_expressions([case_statement])

This comment has been minimized.

@charettes

charettes Apr 23, 2017

Member

I think you should should avoid altering self in this case, you should clone self and call set_source_expressions on the clone.

@orf orf force-pushed the orf:filtered-aggregates branch from c71be3c to 550141b Apr 24, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Apr 24, 2017

Hey, I've added a few more tests, including one that supports using both a CASE statement and a filter parameter. It produces some weird query akin to:

SUM(CASE WHEN x THEN CASE WHEN y THEN z ELSE NULL END ELSE NULL END)

I thought this would fail to be honest but it seems to work fine. We could try and merge the query with the inner CASE to produce a single statement, but to be honest I don't think it's worth it.

I also changed the __repr__ of Aggregate to include the query:

Sum(F(test)) WHERE (AND: ('name', 'test')).

Also using Count(*) with 'filter' isn't supported, at least with CASE statements. I think it would work with postgres and the FILTER (WHERE...), but it's easiest to just catch it in the __init__ for all cases I think.

How can I do a test run on Oracle? It would be good to just do one test run about now to see if it passes.

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Apr 24, 2017

Hey @orf,

We could try and merge the query with the inner CASE to produce a single statement, but to be honest I don't think it's worth it.

As you said, I don't think it's worth it.

Here's some adjustments that get rid of a lot boiler plate.

master...charettes:filter-aggregation

How can I do a test run on Oracle? It would be good to just do one test run about now to see if it passes.

I'll do that just now.

It would be great to have tests for aggregate that takes more than a single expression as well. E.g. a test GROUP_CONCAT might be working on all supported backend (GroupConcat(F('field'), Value(';'), filter=Q(field__contains='data')))

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Apr 24, 2017

buildbot, test on oracle.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from 0a4e2c0 to 9dec834 Apr 29, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Apr 29, 2017

Thank you very much for those changes @charettes, they look excellent. I have cherry picked them and rebased on master.

I agree about the tests, but I could not find any core aggregates that support multiple parameters. GROUP_CONCAT is supported on some backends but I'm not sure which ones. Do you want me to add a custom GroupConcat expression just for the tests, or am I getting the wrong end of the stick? I did add a test for the postgres StringAgg though.

I also cannot see the results of the oracle test, I had a look in Jenkins but could see no build triggered. Assuming that it passes, and that multiple parameter filtered expressions work as expected, shall I start adding some documentation?

Edit: I did change your __repr__ method to call super().__repr__ if no filter is specified. A few tests broke as it output filter=None, and I think it's best to only output filter=x if filter is specified. If you disagree I can fix up the tests to expect filter=x from the repr() calls.

@charettes
Copy link
Member

charettes left a comment

I would be great to have test coverage for the annotate() case and filtering across relations and m2m fields as well.

This will also require documentation.

from django.contrib.postgres.aggregates import StringAgg
q = Q(name='test')
agg = StringAgg(F('name'), delimiter=';', filter=q)
self.assertEqual(Author.objects.aggregate(name=agg)['name'], 'test')

This comment has been minimized.

@charettes

charettes Apr 29, 2017

Member

Since StringAgg is only implemented on PostgreSQL which uses FILTER and not the CASE(WHEN()) fallback this doesn't actually test the source_expressions[0] wrapping. Since we don't have any builtin aggregate with multiple argument this test might not be required after all.

@orf orf force-pushed the orf:filtered-aggregates branch from 418c439 to 9bc3cfd May 7, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented May 7, 2017

I added some extra tests for m2m and foreign key relations, and added an initial bit of documentation, however I found an issue with the current implementation. Running this query:

pages_annotate = Sum('book__pages', filter=Q(book__rating__gt=3))
age_agg = Sum('age', filter=Q(total_pages__gte=400))
Author.objects.annotate(total_pages=pages_annotate).aggregate(summed_age=age_agg)

will throw an error. I added a test called test_filtered_aggregate_on_annotate for it, and the reason it errors as far as I can tell is that when you call .aggregate() that relies on an annotated field it uses the source_expression of the annotation to get the field, which winds up throwing an error in Aggregate.resolve_expression as the .filter attribute is already resolved to a WhereNode.

If you bypass this error it will generate a query similar to this:

SELECT sum(CASE WHERE ... THEN sum(..)) FROM (... inner query)

Which throws a SQL error about misusing the sum() aggregate.

I've spent half a day trying to dig into this but I can't find a suitable solution. I will work on some more documentation this week.

@charettes

This comment has been minimized.

Copy link
Member

charettes commented May 7, 2017

@orf, thanks for your continued efforts on this, I'll try to have a look at the annotate().aggregate() issue this week!

@orf

This comment has been minimized.

Copy link
Contributor

orf commented May 18, 2017

@orf orf force-pushed the orf:filtered-aggregates branch from 9bc3cfd to d766c25 May 18, 2017

@charettes

This comment has been minimized.

Copy link
Member

charettes commented May 19, 2017

nice work @orf, I'll try to have a look at this issue this week-end.

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Jun 5, 2017

Hey @orf! I spent some time tonight trying to figure out what could be wrong and after 2 hours of hair pulling I identified the issue.

When the ORM performs annotation aggregation it pushes the annotation into a subquery and perform the aggrations on it. The code that takes care of the alias rewrites was simply not aware of the filter attribute and wasn't rewriting it's content.

I've submited a PR to your branch with the fix but it's really just a hack. I think we should try shoving being a better get_source_expressions()/set_source_expression() citizen instead and stuff filter in it but my initial attempts are breaking other stuff.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from d01a30e to 782d902 Jun 8, 2017

@charettes charettes changed the title [WIP] Support shorthand syntax for filtered aggregates Fixed #27849 -- Added filtering support to aggregates. Jun 9, 2017

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Jun 9, 2017

Thanks for merging the last adjustments!

I'll give it a review in the next few days but given it includes code from me as well I think it would be great if we could find someone else to review my changes.

Do you think you could take care of fixing the flake8/docs failures and the probable adjustments to the documentation for the remaining part of the review?

return [e._output_field_or_none for e in super().get_source_expressions()]

def get_source_expressions(self):
source_expressions = super(Aggregate, self).get_source_expressions()

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

This could be a bare super().

if self.filter:
self.filter = exprs[-1]
exprs = exprs[:-1]
return super(Aggregate, self).set_source_expressions(exprs)

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

This one as well.

if not summarize:
expressions = c.get_source_expressions()
expressions = super(Aggregate, c).get_source_expressions()

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

We might want to mention we use super() to avoid checking against self.filter.

@@ -226,6 +226,11 @@ Models
number of rows fetched by the Python database client when streaming results
from the database. For databases that don't support server-side cursors, it
controls the number of results Django fetches from the database adapter.
Aggregations

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

Needs a new line. Maybe drop the s?

Aggregations
~~~~~~~~~~~~

* Built in aggregates now accept a `filter` parameter that can be used to

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

Double `` around filter. You might want to mention it's using the SQL 2003 FILTER clause when the backend supports it and fallback to When(Case())?


.. versionchanged:: 2.0

The filter argument to annotations was added

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

Add a trailing .?

@@ -359,7 +379,7 @@ Given:

Here's an example with the ``Count`` aggregate::

>>> a, b = Publisher.objects.annotate(num_books=Count('book', distinct=True)).filter(book__rating__gt=3.0)
>>> a, b = Publisher.objects.annotate(num_books=Count('book', distinct=True)).filter(book__rating__gte=3.0)

This comment has been minimized.

@charettes

charettes Jun 9, 2017

Member

This change looks unrelated to our work here?

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Jul 20, 2017

I've made the changes as requested, and can add tests for the _get_repr_options stuff once #8793 is merged.

I used the {**x} syntax which is 3.5+, which should be OK for master, but the tests seem to be running on 3.4 which obviously fails. For now I will replace {**...} with dict(), but I'm loathe to do so.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from 0dd0df1 to 842f8e7 Jul 20, 2017

@timgraham

This comment has been minimized.

Copy link
Member

timgraham commented Jul 21, 2017

The decision about whether or not to support Python 3.4 for Django 2.0 is still up for debate.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from cff7d97 to 17914a8 Jul 21, 2017

self.assertEqual(repr(StdDev('a', filter=filter)), "StdDev(F(a), filter=(AND: ('a', 1)), sample=False)")
self.assertEqual(repr(Sum('a', filter=filter)), "Sum(F(a), filter=(AND: ('a', 1)))")
self.assertEqual(repr(Variance('a', sample=True, filter=filter)),
"Variance(F(a), filter=(AND: ('a', 1)), sample=True)")

This comment has been minimized.

@felixxm

felixxm Jul 21, 2017

Member

Please use hanging indentation due to the coding-style:

self.assertEqual(
    repr(Variance('a', sample=True, filter=filter)),
    "Variance(F(a), filter=(AND: ('a', 1)), sample=True)",
)
def test_aggregate_str(self):
q = Q(name='test')
agg = Sum('test', filter=q)
self.assertIn(str(q), str(agg))

This comment has been minimized.

@felixxm

felixxm Jul 21, 2017

Member

I think test_aggregate_str can be removed.

@@ -15,6 +15,74 @@
from .models import Author, Book, Publisher, Store


class FilteredAggregateTests(TestCase):

This comment has been minimized.

@felixxm

felixxm Jul 21, 2017

Member

IMO it would be worth to add another Author e.g.:

cls.a3 = Author.objects.create(name='name', age=40)
cls.a1.friends.add(cls.a3)

because in most cases you test only single row aggregations. Maybe e.g.

def test_filtered_aggregates(self):
    agg = Sum('age', filter=Q(name__startswith='test'))
    self.assertEqual(Author.objects.aggregate(age=agg)['age'], 100)

or

def test_case_aggregate(self):
    agg = Sum(Case(When(friends__age=40, then=F('friends__age'))), filter=Q(friends__name='test'))
    self.assertEqual(Author.objects.aggregate(age=agg)['age'], 40)

instead of current versions.

This comment has been minimized.

@orf

orf Jul 21, 2017

Contributor

Good point, I've changed the tests to try and rely on more than one row.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from 26b1ed9 to cc90a17 Jul 21, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Jul 27, 2017

I've rebased on master, and squashed all the changes from various reviews into a separate commit. Please let me know if there are any other changes anyone thinks would be good to make.

or aggregation, it is more efficient to use ``.filter()`` on the
queryset to exclude rows. The aggregation ``filter`` argument is
only useful when using two or more aggregations over the same
relations with differing conditionals.

This comment has been minimized.

@felixxm

felixxm Jul 27, 2017

Member

Wrap docs everywhere at 79 characters, e.g.:

    Be careful of using the ``filter`` argument with a single annotation or
    aggregation, it is more efficient to use ``.filter()`` on the queryset to
    exclude rows. The aggregation ``filter`` argument is only useful when using
    two or more aggregations over the same relations with differing
    conditionals.

This comment has been minimized.

@orf

orf Jul 27, 2017

Contributor

Good point, I've gone through and wrapped everywhere at 79 characters in the latest commit.

@orf orf force-pushed the orf:filtered-aggregates branch from cc90a17 to 309c30b Jul 27, 2017

@timgraham

This comment has been minimized.

Copy link
Member

timgraham commented Jul 30, 2017

I didn't do a complete review yet but please update the "Aggregation functions" section in ref/models/querysets.txt with the filter argument.

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Jul 31, 2017

Done.

I also noticed that the StatAggregate classes in contrib.postgres didn't accept the filter argument, for completeness I added it to their __init__ functions and to the documentation. I wasn't sure if I should, or how much I should add to the docs. I can remove that commit if needed.

@orf orf force-pushed the orf:filtered-aggregates branch 2 times, most recently from 04bfff9 to c7c9991 Aug 7, 2017

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Aug 7, 2017

I fixed the failing tests and rebased on master 👍

I've got some time for make any changes needed by a review in the next couple of weeks, I'm not sure about after that.

@timgraham timgraham force-pushed the orf:filtered-aggregates branch from c7c9991 to 09c876a Aug 12, 2017

@django django deleted a comment from atombrella Aug 12, 2017

@timgraham

This comment has been minimized.

Copy link
Member

timgraham commented Aug 12, 2017

Want to look at my edits commit before I squash it?

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Aug 12, 2017

@timgraham timgraham force-pushed the orf:filtered-aggregates branch from 09c876a to d9d51f9 Aug 12, 2017

@timgraham timgraham force-pushed the orf:filtered-aggregates branch from d9d51f9 to b78d100 Aug 12, 2017

@timgraham timgraham merged commit b78d100 into django:master Aug 12, 2017

17 checks passed

docs Build #18414 ended
Details
flake8 Build #18526 ended
Details
isort Build #18542 succeeded in 20 sec
Details
pull-requests-javascript Build #14909 ended
Details
pull-requests-trusty/database=mysql,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=mysql,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-trusty/database=mysql_gis,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=mysql_gis,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-trusty/database=postgis,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=postgis,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-trusty/database=postgres,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=postgres,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-trusty/database=spatialite,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=spatialite,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-trusty/database=sqlite3,label=trusty-pr,python=python3.4 Build #14144 ended
Details
pull-requests-trusty/database=sqlite3,label=trusty-pr,python=python3.6 Build #14144 ended
Details
pull-requests-windows/database=sqlite3,label=windows,python=Python35 Build #10459 ended
Details
@orf

This comment has been minimized.

Copy link
Contributor

orf commented Aug 12, 2017

Thanks! And thank you @charettes for the hairy bits. Wouldn't have been possible without you.

@charettes

This comment has been minimized.

Copy link
Member

charettes commented Aug 13, 2017

Np @orf, thanks for your continued efforts on pushing this forward! 🎉

@orf orf deleted the orf:filtered-aggregates branch Aug 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment