Basic SUM/GROUP aggregation failure #147

Open
gardenia opened this Issue Aug 3, 2012 · 11 comments

Comments

Projects
None yet
4 participants

gardenia commented Aug 3, 2012

Given the basic model:

class PersonSpend(models.Model):
person_id = models.IntegerField()
when = models.DateTimeField()
spend = models.IntegerField()

And inserting the following rows:

PersonSpend.objects.create(person_id=1, spend=112, when=datetime(2012,1,1)) PersonSpend.objects.create(person_id=1, spend=74, when=datetime(2012,1,2))
PersonSpend.objects.create(person_id=2, spend=92, when=datetime(2012,1,1))

Running:
PersonSpend.objects.values('person_id').annotate(ttl_spend=Sum('spend'))

Yields:
[{'person_id': 1}, {'person_id': 1}, {'person_id': 2}]

Instead of the expected:
[{'person_id': 1, 'ttl_spend': 186}, {'person_id': 2, 'ttl_spend': 92}]

Is my expectation incorrect somehow? I do get the output I expect on sqlite which makes me think it is a bug. Does mongodb-engine support this workflow?

I will attach a failing test case which does the above.

gardenia commented Aug 3, 2012

diff --git a/tests/aggregations/models.py b/tests/aggregations/models.py
index c9439bc..3462ffb 100644
--- a/tests/aggregations/models.py
+++ b/tests/aggregations/models.py
@@ -3,3 +3,8 @@ from django.db import models
 class Person(models.Model):
     age = models.IntegerField()
     birthday = models.DateTimeField()
+
+class PersonSpend(models.Model):
+    person_id = models.IntegerField()
+    when = models.DateTimeField()
+    spend = models.IntegerField()
diff --git a/tests/aggregations/tests.py b/tests/aggregations/tests.py
index ebd63b1..703c841 100644
--- a/tests/aggregations/tests.py
+++ b/tests/aggregations/tests.py
@@ -1,9 +1,9 @@
 from datetime import datetime
 from django.db.models.aggregates import Count, Sum, Max, Min, Avg
 from .utils import TestCase
-from models import Person
+from models import Person, PersonSpend
 
-class SimpleTest(TestCase):
+class AggregationTest(TestCase):
     def test_aggregations(self):
         for age, birthday in (
             [4,  (2007, 12, 25)],
@@ -26,3 +26,11 @@ class SimpleTest(TestCase):
             'age__avg': 6.0,
             'id__count': 4
         })
+
+    def test_group_by_sum_aggregate(self):
+        PersonSpend.objects.create(person_id=1, spend=112, when=datetime(2012,1,1))
+        PersonSpend.objects.create(person_id=1, spend=74, when=datetime(2012,1,2))
+        PersonSpend.objects.create(person_id=2, spend=92, when=datetime(2012,1,1))
+
+        aggregates = PersonSpend.objects.values('person_id').annotate(ttl_spend=Sum('spend'))
+        self.assertEqual(sorted(aggregates), [{'person_id': 1, 'ttl_spend': 186}, {'person_id': 2, 'ttl_spend': 92}])

gardenia commented Aug 7, 2012

I need this functionality for a project and in particular for portability between sqlite and mongodb in production. I'm happy to take a stab at fixing it myself but I would appreciate a leg-up on what roughly needs done here from someone who understands this code better than me. I tried digging around the SQL update compiler which generates mongo specific code for updates and it looks like something similar might be needed here but I'm lacking in context so would appreciate some pointers.

Member

charettes commented Aug 7, 2012

@gardenia I can try to help you out if you find yourself blocked, I just have no time to write the patch myself. Just make sure to base your pull request against the develop branch and include the test you've written in it.

Contributor

jonashaag commented Aug 7, 2012

Is this even representable in MongoDB?

gardenia commented Aug 8, 2012

@jonashaag: yes:

MongoDB shell version: 2.0.4
connecting to: foo

db.bar.insert({ person_id: 1, spend: 112 })
db.bar.insert({ person_id: 1, spend: 74 })
db.bar.insert({ person_id: 2, spend: 92 })
db.bar.group({ key: { person_id: true }, cond: {}, reduce: function(obj,prev) { prev.ttl_spend += obj.spend }, initial: { ttl_spend: 0 } })
[
{
"person_id" : 1,
"ttl_spend" : 186
},
{
"person_id" : 2,
"ttl_spend" : 92
}
]

The above aggregates all rows in the collection but "cond" can match a subset of rows (as per find() synrax).

Caveat: (from the manual) "Note: currently one must use map/reduce instead of group() in sharded MongoDB configurations."

The "new aggregation framework" may have a more concise syntax for this:
(http://blog.mongodb.org/post/16015854270/operations-in-the-new-aggregation-framework)

Contributor

jonashaag commented Aug 8, 2012

We don't require 2.0 yet but I think it's okay to assume 2.0 anyway. Because at the time we'll do the release everyone should be on 2.0 already. (If we ever get to make a release...)

This looks like a bigger task to me because you need to rewrite all of the aggregation stuff to use the new group feature. #105

gardenia commented Aug 8, 2012

group() isn't a new feature, AFAICS. It seems to have been around since 1.x. The "new aggregation framework" is something different and is indeed only available in 2.x

In fact group() is already used in django_mongodb_engine/compiler.py so feature #105 may not be required to resolve this.

My debugging tells me that in the error case, in MongoQuery__init__ the following line:

self._mongo_query = getattr(compiler.query, 'raw_query', {})

... ends up with self._mongo_query being {} because compiler.query has no 'raw_query' attribute defined. However, if I stringify the compiler.query object I do get reasonable looking SQL back:

SELECT portal_api_personspend.person_id, SUM(portal_api_personspend.spend) AS ttl_spend FROM portal_api_personspend GROUP BY portal_api_personspend.person_id, portal_api_personspend.person_id

What should I expect self._mongo_query to be here? Should I expect it to look like a mongo query spec at this point? or should I expect it to look like SQL and some later post-processing converts it to a mongo query spec. I'm not clear on the intention so this is why I'm unclear how to proceed.

Contributor

jonashaag commented Aug 10, 2012

It should be ok for it to be the empty dict here. That raw_query stuff is solely for the purpose of implementing .raw_query(...). For queries done using Django syntax (.filter(...)) have a look at add_filter(s). But most of the code for aggregates is in execute_sql.

Some observations:

  • In my failing test case the annotate() seems to be a noop. I get the same result when I entirely remove the annotate() call from the mix. This suggests to me that annotate() is not wired up properly.
  • in my failing test case execute_sql is never called. In the existing passing test case (test_aggregations) I can see execute_sql is called. Since the only code which calls the mongo group() call is in execute_sql I can only assume something is not wired up correctly. I tried to establish why one and not the other and got lost somewhere between the django-nonrel sources and django_mongodb_engine/compiler.py.
  • In djangotoolbox/db/basecompiler.py

There is a build_query method which does stuff like this:

    query = self.query_class(self, fields)
    query.add_filters(self.query.where)
    query.order_by(self._get_ordering())

Is it possible that there is some missing logic here to wire in the "group by" case?

Mostly I'm just guessing. There are too many layers of code I lack context on here. Can someone someone who grocks this code provide input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment