Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix performance issue when exporting many groups #3681

Merged
merged 1 commit into from
Dec 17, 2019

Conversation

ltalirz
Copy link
Member

@ltalirz ltalirz commented Dec 17, 2019

When providing groups to verdi export, it was looping over all groups
and using the Group.nodes iterator to retrieve the nodes contained in
the groups. This results (at least) in one query per group, and is
therefore every inefficient for large numbers of groups. The new
implementation replaces this by a single query.

@lekah
Copy link
Contributor

lekah commented Dec 17, 2019

Is there any benchmark where you compare times to get the result? It should depend on what group.nodes is exactly doing, but in the end I guess the performance bottleneck is loading the nodes, not the query. Just out of interest (since the title is "fix performance issue")

# Could project on ['id', 'uuid', 'node_type'] for further performance enhancement
group_qb.append(orm.Node, with_group='groups', project=['*'])

for row in group_qb.all():
Copy link
Contributor

@lekah lekah Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use group_qb.iterall() here? It might lead to further improvement because the results would be fetched in batches in such case and not all together as is done now. If the subsequent code also takes time this could speed up the time-to-solution.

@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

Is there any benchmark where you compare times to get the result?

Here come some actual numbers.

Test set: export of 67k groups containing 5 nodes each.
Old implementation: 204s
New implementation with qb.all(): 64s
New implementation with qb.iterall(): 59s

It should depend on what group.nodes is exactly doing, but in the end I guess the performance bottleneck is loading the nodes, not the query.

My guess was that the bottleneck are the N queries you make for N groups and I think the benchmark results back this up.

However, as you say and as I point out in the comment, there might still be significant speedup from avoiding to construct the ORM objects altogether, as well as memory savings.

@sphuber: Do you happen to know whether there is an easy check to replace these checks

if issubclass(entry.__class__, orm.Data):
given_data_entry_ids.add(entry.pk)
elif issubclass(entry.__class__, orm.ProcessNode):
given_calculation_entry_ids.add(entry.pk)

by checks on the node_type?

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2019

Yes, you can select those sets of nodes directly on the node_type.
For Data nodes:

SELECT * FROM db_dbnode WHERE node_type LIKE = 'data.%';

and for ProcessNode nodes, i.e. all other

SELECT * FROM db_dbnode WHERE node_type LIKE = 'process.%';

In those particular query builder definitions you can simply make two builders querying on the specific node class

builder_data = orm.QueryBuilder().append(
    orm.Group, filters={'id': {'in': given_group_entry_ids}}, tag='groups').append(
    orm.Data, with_group='groups', project='*')
builder_process = orm.QueryBuilder().append(
    orm.Group, filters={'id': {'in': given_group_entry_ids}}, tag='groups').append(
    orm.ProcessNode, with_group='groups', project='*')

@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

Thanks @sphuber - the latest implementation now takes 11s on the test set (down from 204s on develop).


data_results = orm.QueryBuilder(**qh_groups).append(orm.Data, project=['id', 'uuid'], with_group='groups').all()

from builtins import zip # pylint: disable=redefined-builtin
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I get an error that zip is a module and not callable ( we have a module called zip in the same folder).
If you want me to rename the module instead, let me know.

When providing groups to `verdi export`, it was looping over all groups
and using the `Group.nodes` iterator to retrieve the nodes contained in
the groups. This results (at least) in one query per group, and is
therefore every inefficient for large numbers of groups. The new
implementation replaces this by two queries, one for Data nodes and one
for Process nodes.
It also no longer constructs the ORM objects since they are unnecessary.
@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

@sphuber This is ready for review

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, thanks a lot @ltalirz !

@ltalirz
Copy link
Member Author

ltalirz commented Dec 17, 2019

Do you want me to write the commit message?
Or are you waiting for another review?

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2019

Do you want me to write the commit message?
Or are you waiting for another review?

No, I was just leaving the honors to yourself, commit message looks great :)

@ltalirz ltalirz merged commit ff02e0a into aiidateam:develop Dec 17, 2019
}, tag='groups'
).queryhelp

# Delete this import once the dbexport.zip module has been renamed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future: This should be changed in #3242

@sphuber sphuber deleted the speed-up-groups-query branch December 18, 2019 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants