Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

querybuilder: qb.count() != len(qb.all()) #1600

Closed
ltalirz opened this issue May 27, 2018 · 11 comments
Closed

querybuilder: qb.count() != len(qb.all()) #1600

ltalirz opened this issue May 27, 2018 · 11 comments
Assignees
Milestone

Comments

@ltalirz
Copy link
Member

ltalirz commented May 27, 2018

I just stumbled over this - probably an edge case because I am adding Node twice (was actually not intentional), but still the result is unexpected:

In [35]: from aiida_raspa.calculations import RaspaCalculation
    ...: qb=QueryBuilder()
    ...:
    ...: qb.append(CifData, filters={'uuid': '06b938f5-30da-430d-b23a-d5fc602923be'}, tag='c
    ...: if')
    ...: qb.append(Node, descendant_of='cif')
    ...: qb.append(Node, descendant_of='cif', filters={'type': RaspaCalculation._plugin_type
    ...: _string})
    ...:
    ...: print(qb.count())
    ...: len(qb.all())
    ...:
32
Out[35]: 2

Should this be fixed?

@giovannipizzi
Copy link
Member

@lekah do you know why?

@lekah
Copy link
Contributor

lekah commented May 31, 2018

It's probably some unexpected behavior in the underlying ORM.
So, I guess from this query you get 2 distinct results (i.e. 2 Raspa calculations), but there are 32 nodes attached to the corresponding cif-files, so the join with Node produces 32 distinct nodes.
@ltalirz can you confirm this by running the queries (2nd and 3rd append) individually.

@giovannipizzi
Copy link
Member

Explanation of what happens:

If I do this:

qb = QueryBuilder()

qb.append(Node, filters={'id': 224276}, tag='c')
qb.append(Node, descendant_of='c')
#qb.append(Node, descendant_of='c')
#qb.append(Node, descendant_of='c')

print(qb.count())
print(qb.all())

In this case I get

5
[[<RemoteData: uuid: d52cd273-d07f-4520-a778-700e291129ac (pk: 224277)>], [<FolderData: uuid: fd8c650d-8118-482f-a86e-55cf5e9e9622 (pk: 224278)>], [<BandsData: uuid: c5ca0baf-db9a-4e72-84de-9b97c52be53a (pk: 224279)>], [<ParameterData: uuid: cfc8a7f2-6fa1-4774-a843-a4e60f4177f7 (pk: 224280)>], [<ArrayData: uuid: 45f39f91-6496-488f-a3f5-f84f19f19d3b (pk: 224281)>]]```

which is expected (5 results, and indeed qb.all() has length 5).

However if I uncomment one or two lines, I still get the same list, but I get 25 or 125.
I think is that we are not projecting on anything, but still you get all possible pairs/triplets of results. For some reason the .all() removes duplicates, but .count() doesn't.

Note that this is a sqlalchemy behaviour: if you store in a variable the SQLAlchemy query (qqq = qb.get_query()) then qqq.count() returns (without comments) 125 but len(qqq.all()) is 5).

Apparently, one should do DISTINCT to make it work correctly and always return 5 (also via count():

q2 = qqq.distinct()
print(q2.count())
print(len(q2.all())

returns

5
5

@giovannipizzi
Copy link
Member

Question for @lekah: is it safe/can we add always a .distinct() e.g. to QueryBuilder._build()?

@giovannipizzi giovannipizzi added this to the v1.0.0 milestone Dec 3, 2018
@lekah
Copy link
Contributor

lekah commented Dec 3, 2018

No, I don't think we should add distinct by default. Maybe the user wants to see duplicates. In addition, the distinct doesn't really work that well for attributes or extras, or to clarify, it doesn't work the same way for different backends.
If you add multiple entities, you make a cartesian product of joins when executing all. Why the count is different has to do with the backend, I guess. I'd rather overwrite our count to return len(self.all()), instead of delegating to the backend...

@giovannipizzi
Copy link
Member

As described above, apparently this is the behaviour of SQLAlchemy.
SQL however would return 125 rows for the result of the underlying query.
SQLAlchemy internally calls (when calling qqq.all()) the function qqq._execute_and_instances that internally checks the value of qqq._has_mapper_entities; if True, it removes duplicates (in the sqlalchemy.orm.loading.instances function.

Indeed, setting

query._has_mapper_entities = True

would then return a list of 125 elements.

To understand what the _has_mapper_entities really does.

@giovannipizzi
Copy link
Member

To understand the difference:

s.query(dummy_model.DbNode).join(dummy_model.DbLink, dummy_model.DbLink.input_id==dummy_model.DbNode.id).filter(dummy_model.DbNode.id==224276).all()

returns

[<aiida.orm.implementation.django.dummy_model.DbNode at 0x7fe9d7f2c0d0>]

(1 result) while

s.query(dummy_model.DbNode.id).join(dummy_model.DbLink, dummy_model.DbLink.input_id==dummy_model.DbNode.id).filter(dummy_model.DbNode.id==224276).all()

returns

[(224276), (224276), (224276), (224276), (224276)]

(5 results), consistent with the code (the second would have a _has_mapper_entities to False because it asks for a id and not for an entity).

@giovannipizzi
Copy link
Member

That's actually what also the author of sqlalchemy says we should do, see this sqlalchemy/sqlalchemy#4143 (comment)

@lekah
Copy link
Contributor

lekah commented Dec 4, 2018

We'll follow the advice given in sqlalchemy/sqlalchemy#4395 (comment) and use the hidden _has_member_entities flag to deduplicate results.
One additional test should be written here as well, testing for exactly that case of sqlalchemy deduplicating rows without the distinct having been triggered.

lekah added a commit to lekah/aiida_core that referenced this issue Dec 4, 2018
…te rows. Therefore, the backend (psql) behavior is more closely reproduced.
@sphuber
Copy link
Contributor

sphuber commented Dec 4, 2018

Fixed in PR #2281

@sphuber sphuber closed this as completed Dec 4, 2018
@giovannipizzi
Copy link
Member

Just a comment for the future that there is work in progress to change the behaviour in SQLAlchemy, and this will appear in v2.0 of SQLAlchemy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants