fix performance issue when exporting many groups #3681

ltalirz · 2019-12-17T12:57:16Z

When providing groups to verdi export, it was looping over all groups
and using the Group.nodes iterator to retrieve the nodes contained in
the groups. This results (at least) in one query per group, and is
therefore every inefficient for large numbers of groups. The new
implementation replaces this by a single query.

lekah · 2019-12-17T15:42:42Z

Is there any benchmark where you compare times to get the result? It should depend on what group.nodes is exactly doing, but in the end I guess the performance bottleneck is loading the nodes, not the query. Just out of interest (since the title is "fix performance issue")

lekah · 2019-12-17T15:44:10Z

aiida/tools/importexport/dbexport/__init__.py

+        # Could project on ['id', 'uuid', 'node_type'] for further performance enhancement
+        group_qb.append(orm.Node, with_group='groups', project=['*'])
+
+        for row in group_qb.all():


Could you use group_qb.iterall() here? It might lead to further improvement because the results would be fetched in batches in such case and not all together as is done now. If the subsequent code also takes time this could speed up the time-to-solution.

ltalirz · 2019-12-17T16:29:15Z

Is there any benchmark where you compare times to get the result?

Here come some actual numbers.

Test set: export of 67k groups containing 5 nodes each.
Old implementation: 204s
New implementation with qb.all(): 64s
New implementation with qb.iterall(): 59s

It should depend on what group.nodes is exactly doing, but in the end I guess the performance bottleneck is loading the nodes, not the query.

My guess was that the bottleneck are the N queries you make for N groups and I think the benchmark results back this up.

However, as you say and as I point out in the comment, there might still be significant speedup from avoiding to construct the ORM objects altogether, as well as memory savings.

@sphuber: Do you happen to know whether there is an easy check to replace these checks

aiida-core/aiida/tools/importexport/dbexport/__init__.py

Lines 232 to 235 in 9172c60

    
           if issubclass(entry.__class__, orm.Data): 
        
               given_data_entry_ids.add(entry.pk) 
        
           elif issubclass(entry.__class__, orm.ProcessNode): 
        
               given_calculation_entry_ids.add(entry.pk)

by checks on the node_type?

sphuber · 2019-12-17T16:37:21Z

Yes, you can select those sets of nodes directly on the node_type.
For Data nodes:

SELECT * FROM db_dbnode WHERE node_type LIKE = 'data.%';

and for ProcessNode nodes, i.e. all other

SELECT * FROM db_dbnode WHERE node_type LIKE = 'process.%';

In those particular query builder definitions you can simply make two builders querying on the specific node class

builder_data = orm.QueryBuilder().append(
    orm.Group, filters={'id': {'in': given_group_entry_ids}}, tag='groups').append(
    orm.Data, with_group='groups', project='*')
builder_process = orm.QueryBuilder().append(
    orm.Group, filters={'id': {'in': given_group_entry_ids}}, tag='groups').append(
    orm.ProcessNode, with_group='groups', project='*')

ltalirz · 2019-12-17T17:25:51Z

Thanks @sphuber - the latest implementation now takes 11s on the test set (down from 204s on develop).

ltalirz · 2019-12-17T17:27:54Z

aiida/tools/importexport/dbexport/__init__.py

+
+        data_results = orm.QueryBuilder(**qh_groups).append(orm.Data, project=['id', 'uuid'], with_group='groups').all()
+
+        from builtins import zip  # pylint: disable=redefined-builtin


Without this I get an error that zip is a module and not callable ( we have a module called zip in the same folder).
If you want me to rename the module instead, let me know.

When providing groups to `verdi export`, it was looping over all groups and using the `Group.nodes` iterator to retrieve the nodes contained in the groups. This results (at least) in one query per group, and is therefore every inefficient for large numbers of groups. The new implementation replaces this by two queries, one for Data nodes and one for Process nodes. It also no longer constructs the ORM objects since they are unnecessary.

ltalirz · 2019-12-17T21:17:20Z

@sphuber This is ready for review

sphuber

Good stuff, thanks a lot @ltalirz !

ltalirz · 2019-12-17T21:53:08Z

Do you want me to write the commit message?
Or are you waiting for another review?

sphuber · 2019-12-17T21:55:25Z

Do you want me to write the commit message?
Or are you waiting for another review?

No, I was just leaving the honors to yourself, commit message looks great :)

CasperWA · 2019-12-18T09:19:54Z

aiida/tools/importexport/dbexport/__init__.py

+            }, tag='groups'
+        ).queryhelp
+
+        # Delete this import once the dbexport.zip module has been renamed


For the future: This should be changed in #3242

ltalirz requested review from lekah and CasperWA December 17, 2019 12:57

lekah reviewed Dec 17, 2019

View reviewed changes

ltalirz force-pushed the speed-up-groups-query branch from 88e406d to 4a001ef Compare December 17, 2019 17:25

ltalirz commented Dec 17, 2019

View reviewed changes

ltalirz force-pushed the speed-up-groups-query branch from 4a001ef to b3cdfa5 Compare December 17, 2019 17:42

ltalirz force-pushed the speed-up-groups-query branch from b3cdfa5 to f6c05d4 Compare December 17, 2019 17:43

sphuber approved these changes Dec 17, 2019

View reviewed changes

ltalirz merged commit ff02e0a into aiidateam:develop Dec 17, 2019

CasperWA reviewed Dec 18, 2019

View reviewed changes

sphuber deleted the speed-up-groups-query branch December 18, 2019 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix performance issue when exporting many groups #3681

fix performance issue when exporting many groups #3681

ltalirz commented Dec 17, 2019

lekah commented Dec 17, 2019

lekah Dec 17, 2019 •

edited

Loading

ltalirz commented Dec 17, 2019 •

edited

Loading

sphuber commented Dec 17, 2019 •

edited

Loading

ltalirz commented Dec 17, 2019 •

edited

Loading

ltalirz Dec 17, 2019

ltalirz commented Dec 17, 2019

sphuber left a comment

ltalirz commented Dec 17, 2019

sphuber commented Dec 17, 2019

CasperWA Dec 18, 2019


		data_results = orm.QueryBuilder(**qh_groups).append(orm.Data, project=['id', 'uuid'], with_group='groups').all()

		from builtins import zip # pylint: disable=redefined-builtin

fix performance issue when exporting many groups #3681

fix performance issue when exporting many groups #3681

Conversation

ltalirz commented Dec 17, 2019

lekah commented Dec 17, 2019

lekah Dec 17, 2019 • edited Loading

Choose a reason for hiding this comment

ltalirz commented Dec 17, 2019 • edited Loading

sphuber commented Dec 17, 2019 • edited Loading

ltalirz commented Dec 17, 2019 • edited Loading

ltalirz Dec 17, 2019

Choose a reason for hiding this comment

ltalirz commented Dec 17, 2019

sphuber left a comment

Choose a reason for hiding this comment

ltalirz commented Dec 17, 2019

sphuber commented Dec 17, 2019

CasperWA Dec 18, 2019

Choose a reason for hiding this comment

lekah Dec 17, 2019 •

edited

Loading

ltalirz commented Dec 17, 2019 •

edited

Loading

sphuber commented Dec 17, 2019 •

edited

Loading

ltalirz commented Dec 17, 2019 •

edited

Loading