Solution for the fast node addition to group for v0.12 #2471

szoupanos · 2019-02-11T20:31:49Z

This PR solves issue #1319 and provides a fast and non-ORM way to add nodes to a group. Import/export and more specifically import under SQLA should benefit from this.

@ltalirz Could you, please, verify that it works as expected at one of your databases (the corresponding tests pass)?

@giovannipizzi We should decide which version we will use at aiida/orm/implementation/sqlalchemy/group.py
I currently use the RAW SQLA query which is the fastest. I commented out the version that uses SQLA to construct the query since it needs SQLA 1.1+ and this is not supported at that version of AiiDA. Let's decide if it is worth it to make AiiDA compatible with SQLA 1.1+ or to keep the raw SQL version for 0.12

If we keep the raw SQL approach, maybe, we should do the group addition in batches and launch several SQL queries - I am not sure if I can add 1M of node ids in one INSERT SQL.

coveralls · 2019-02-11T21:24:34Z

Coverage increased (+7.9%) to 56.073% when pulling b14e551 on szoupanos:issue_1319_for_v0.12 into bbfc73c on aiidateam:release_v0.12.3.

szoupanos · 2019-02-12T10:31:23Z

Also, @sphuber let me know how you think that the flag (to skip the ORM) for the group node addition should be passed to the Backend layer at the provenance redesign.

E.g. Maybe we could make it only available to the backend and when someone wants to use it, should use this interface?

We can discuss in person when you have time.

aiida/backends/sqlalchemy/models/group.py

giovannipizzi · 2019-02-12T12:41:15Z

I think it's ok to use raw SQL for 0.12.3, but we should use the commented version for 1.0.
Maybe also good to briefly discuss the interface before merging, as long as it does not take too long (if we manage to have the same interface in 0.12.3 and in 1.0 it's better, otherwise people will have to do things in two different ways).

ltalirz · 2019-02-12T12:41:45Z

@szoupanos
I tested the following:

adding existing nodes to a group: now definitely fast enough for me, thanks a lot!
exporting a group with 5800 ParameterData nodes (dicts are empty): 76.4s on my machine.
That's not great but acceptable (btw perhaps check that this isn't using group additions with the orm)
importing an export file containing 5800 ParameterData nodes (5800_parameterdatas.zip): I stopped waiting after 5 minutes
EDIT: I need to mention that adding the nodes + attributes took about 15s, and I was waiting for 5 minutes at STORING GROUP ELEMENTS...

In conclusion: I'm very happy with the improvement concerning adding nodes to groups.
The import problem remains unsolved. You can either try to address it in this PR or open a separate one for it (let me know) but this is very important to address.

sphuber · 2019-02-12T12:58:08Z

@ltalirz Does your last bullet point mean that the biggest part of the import is spent into adding the already imported nodes to a group? Does the add_nodes call in the import just not use the new skip_orm flag or is there something else that is slowing it down?

ltalirz · 2019-02-12T13:04:05Z

I think that's for @szoupanos to answer...

szoupanos · 2019-02-12T13:49:36Z

I didn't touch the export and it is a common function for both backends (it's worth to see if there are any performance differences between the backends and if there are, if they are significant - but this is a different discussion).

The import of SQLA uses (or at least it should use the new code). Let me have a close look...

szoupanos · 2019-02-12T14:26:35Z

I missed to pass the flag to one of the group.add_nodes.

The current timings for the import of the given export file:

For Django:

real	0m14.009s
user	0m7.374s
sys	0m2.030s

For SQLA:

real	0m27.320s
user	0m17.540s
sys	0m2.720s

(aiidapy2) aiida@ubuntu_vm1_new:~/leo_import_export_exp$ verdi -p issue_1759_dj group list -A
  PK  GroupName      NumNodes  User
----  -----------  ----------  ----------------------
   1  test-group         5881  leopold.talirz@epfl.ch

(aiidapy2) aiida@ubuntu_vm1_new:~/leo_import_export_exp$ verdi -p issue_1759_sqla group list -A
  PK  GroupName      NumNodes  User
----  -----------  ----------  ----------------------
   1  test-group         5881  leopold.talirz@epfl.ch

ltalirz · 2019-02-12T14:27:51Z

That would be great - please push the fix and I will confirm

aiida/orm/implementation/sqlalchemy/group.py

szoupanos · 2019-02-12T14:38:55Z

If we are all happy for 0.12 with this fix (RAW SQL). I would like to clean a bit the group.add_nodes code.

And for the export, it is as I expected (I exported the imported group for every backend):

SQLA

real	1m37.500s
user	1m24.828s
sys	0m2.518s

Django

real	1m36.823s
user	1m22.461s
sys	0m2.602s

ltalirz · 2019-02-12T14:47:44Z

If we are all happy for 0.12 with this fix (RAW SQL).

I think we are.

And for the export, it is as I expected

As mentioned before, I think this is acceptable for the moment, and there are other, more important problems in sqla [1]. If the export does not involve adding nodes to groups (I guess it doesn't), we should just leave it as it is.

[1] E.g. the following code still scales super-linearly with the number of nodes (I think this goes back to the expire_on_commit problem):

from aiida.orm.data.parameter import ParameterData
N = 10000
node_list = [ParameterData(dict={}).store() for i in range(N)]

But this can be fixed in a separate PR.

giovannipizzi · 2019-02-12T15:07:23Z

Also, I would leave, commented, the code using the SQLA ORM, with a comment note saying why we are using raw SQL here (i.e. the version of SQLA is too old and this is fixed in 1.0)

szoupanos · 2019-02-12T15:39:13Z

OK, I will do the necessary cleaning and I will update the PR.

@ltalirz
Regarding, export we can discuss, in person, more in order to understand: E.g. if the export time for both backends is identical, I don't see an issue for SQLA, or something to be improved. Maybe we can improve the efficiency of the export code that will affect both backends but this is a different discussion.

Yes, [1] should be related to the expire_on_commit. I think that it is worth seeing the performance of [1] in provenance_redesign with expire_on_commit=True/False and vs Django and then depending on the available time, developments (Django JSONB) and importance of [1], re-discuss.

… the group addition procedure and import method of SQLA should benefit from it

ltalirz

I've tested it and confirm that the verdi import problem is resolved - yay!

Good to merge for me.

@sphuber @giovannipizzi any last comments?

giovannipizzi · 2019-02-13T05:45:06Z

Just going back to the new migration here - I was wondering, if we insert only here, since it's not in 1.0.0, I have the feeling that it means that a person going to 0.12.3 cannot then migrate to 1.0? (also, I think here we are not putting all the logic that we have in 1.0 to deduplicate UUID, if this happens). What would be the timing without the migration, would this be worse?

sphuber · 2019-02-13T08:24:17Z

Yeah I would be very careful with this. We might be breaking the migration for people who want to go from 0.12* to 1.0.0. If it is not necessary we probably shouldn't, but if we do, we should really check how this would work after we have to merge 0.12* into develop and how people can migrate to 1.0.0

szoupanos · 2019-02-13T08:34:06Z

The migration adds the constraint to the database, a constraint that it is used by the query to avoid duplicate check at the python level. This constraint is also added to the models.

So if we keep it at the models and not in the database, we will have inconsistencies and tests would fail (there are tests in SQLA that check that).

If I am not mistaken, Martin had added this constraint long time ago "somewhere" and maybe it is already at 1.0.*
If it is the case, maybe, we could just copy paste the SQLA migrations until that point to 0.12.*, if they are just small fixes - but of course this needs a closer (and a careful) look

sphuber · 2019-02-13T08:37:20Z

Yes it is in 1.0.0. Was added in this commit

szoupanos · 2019-02-13T11:12:58Z

So I did some tests related to the discussion that we had and as far the unique constraint is named differently in the two migartions, everything works fine - but, of course you end up with 2 constraints with the same name.

E.g.

7a6587e16f4c is the migration of this PR
59edaf8a8b79 is the conflicting migration of develop/provenance_redesign

Revision history

7a6587e16f4c -> 59edaf8a8b79 (head), Adding indexes and constraints to the dbnode-dbgroup relationship table
35d4ee9a1b0e -> 7a6587e16f4c, Unique constraints for the db_dbgroup_dbnodes table
89176227b25 -> 35d4ee9a1b0e, Migrating 'hidden' properties from DbAttribute to DbExtra for code.Code. nodes
a6048f0ffca8 -> 89176227b25, Add indexes to dbworkflowdata table
70c7d732f1b2 -> a6048f0ffca8, Updating link types - This is a copy of the Django migration script
e15ef2630a1b -> 70c7d732f1b2, Deleting dbpath table and triggers
<base> -> e15ef2630a1b, Initial schema

aiidadb_issue_1759_sqla=# \d db_dbgroup_dbnodes
                              Table "public.db_dbgroup_dbnodes"
   Column   |  Type   | Collation | Nullable |                    Default
------------+---------+-----------+----------+------------------------------------------------
 id         | integer |           | not null | nextval('db_dbgroup_dbnodes_id_seq'::regclass)
 dbnode_id  | integer |           |          |
 dbgroup_id | integer |           |          |
Indexes:
    "db_dbgroup_dbnodes_pkey" PRIMARY KEY, btree (id)
    "db_dbgroup_dbnodes_dbgroup_id_dbnode_id_key" UNIQUE CONSTRAINT, btree (dbgroup_id, dbnode_id)
    "uix_dbnode_id_dbgroup_id" UNIQUE CONSTRAINT, btree (dbnode_id, dbgroup_id)
    "db_dbgroup_dbnodes_dbgroup_id_idx" btree (dbgroup_id)
    "db_dbgroup_dbnodes_dbnode_id_idx" btree (dbnode_id)
Foreign-key constraints:
    "db_dbgroup_dbnodes_dbgroup_id_fkey" FOREIGN KEY (dbgroup_id) REFERENCES db_dbgroup(id) DEFERRABLE INITIALLY DEFERRED
    "db_dbgroup_dbnodes_dbnode_id_fkey" FOREIGN KEY (dbnode_id) REFERENCES db_dbnode(id) DEFERRABLE INITIALLY DEFERRED

As soon as you use the same name, then you have collisions during the migration, and Alembic complains that there is already a constraint with the same name).

So what I would propose is to silently add the current migration (7a6587e16f4c) to develop and provenance_redesign at the right place (migration order) and then modify the migration 59edaf8a8b79 to also drop the old constraint.

We can also add a check to 59edaf8a8b79, if possible, to verify that the old constraint is there before dropping it. This will cover the users that use develop, stopped at a version just before 59edaf8a8b79 and try to apply 59edaf8a8b79 without having applied 7a6587e16f4c - this is a very rare case scenario but OK...

Let me know

sphuber · 2019-02-13T12:41:50Z

So you would insert 59edaf8a8b79 in develop just before 7a6587e16f4c and then change the latter to first drop the constraints that will be added by 59edaf8a8b79, is that correct? Alembic only keeps the current revision of a database, and not some history of revisions it took to arrive at the current state right? If it doesn't then this might work, but let's involve @giovannipizzi as well

szoupanos · 2019-02-13T13:08:51Z

I would insert 7a6587e16f4c just after 35d4ee9a1b0e (as it should be for 0.12) and just before 0aebbeab274d of develop/provenance_redesign that currently has 35d4ee9a1b0e as a previous revision.

szoupanos · 2019-02-13T13:11:09Z

59edaf8a8b79 will stay at its place, a few migrations later in develop/provenance redesign, and it will be modified a bit to also remove the constraint of 7a6587e16f4c if it is there before adding it with another name.

szoupanos · 2019-02-21T18:32:18Z

I checked alembic migrations and even if they support merging, my understanding is that it will not cover our case
https://alembic.sqlalchemy.org/en/latest/branches.html#branch-dependencies

                            -- ae1027a6acf -->
                           /                   \
<base> --> 1975ea83b712 -->                      --> mergepoint
                           \                   /
                            -- 27c6a30d7c24 -->

We can have multiple branches but at the mergepoint all the migrations (from both branches) are applied.

Let me know if I missed something.

I will proceed with the solution that we discussed in person the previous days.

szoupanos · 2019-02-21T19:46:12Z

The complementary PR for provenance_redesign/develop is the #2514

ltalirz · 2019-02-21T21:18:28Z

Is it ready to merge?
Do you want @giovannipizzi to give a final look?

szoupanos · 2019-02-22T09:21:54Z

@ltalirz @giovannipizzi @sphuber
For me it is OK to proceed (I had a second look today).
Please merge #2514 first (to provenance_redesign) and then this one to release_v0.12.3

szoupanos · 2019-02-22T10:59:34Z

I missed to pass the flag to one of the group.add_nodes.

The current timings for the import of the given export file:

For Django:
real	0m14.009s
user	0m7.374s
sys	0m2.030s
For SQLA:
real	0m27.320s
user	0m17.540s
sys	0m2.720s

@ltalirz
And just for the history the old import time for SQLA for the given dataset was 27 mins (the one which dropped to 27 secs).
(I wanted to see the old performance and I waited until the end of the import)

sphuber · 2019-02-22T12:19:59Z

I merged PR #2514 for the backport of the migration included in this PR, so whenever this is ready, I guess it is good to be merged

szoupanos requested review from ltalirz and giovannipizzi February 11, 2019 20:31

ltalirz reviewed Feb 12, 2019

View reviewed changes

aiida/backends/sqlalchemy/models/group.py Show resolved Hide resolved

ltalirz reviewed Feb 12, 2019

View reviewed changes

aiida/orm/implementation/sqlalchemy/group.py Outdated Show resolved Hide resolved

szoupanos force-pushed the issue_1319_for_v0.12 branch from c45289c to 5449949 Compare February 12, 2019 14:36

Providing a non-ORM way to add nodes to a group. This speeds up a lot…

14c135b

… the group addition procedure and import method of SQLA should benefit from it

szoupanos force-pushed the issue_1319_for_v0.12 branch from 5449949 to 14c135b Compare February 12, 2019 17:59

ltalirz self-requested a review February 12, 2019 21:06

ltalirz approved these changes Feb 12, 2019

View reviewed changes

szoupanos mentioned this pull request Feb 21, 2019

Add migration 7a6587e16f4c of release v0.12.3 to provenance_redesign branch #2514

Merged

szoupanos changed the title ~~[Don't merge] Solution for the fast node addition to group for v0.12~~ Solution for the fast node addition to group for v0.12 Feb 21, 2019

Merge branch 'release_v0.12.3' into issue_1319_for_v0.12

b14e551

ltalirz merged commit 28eea8c into aiidateam:release_v0.12.3 Feb 22, 2019

ltalirz mentioned this pull request Feb 28, 2019

Adding nodes to group optimisation using SQLA code to generate SQL INSERT #2518

Merged

szoupanos deleted the issue_1319_for_v0.12 branch December 12, 2019 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solution for the fast node addition to group for v0.12 #2471

Solution for the fast node addition to group for v0.12 #2471

szoupanos commented Feb 11, 2019

coveralls commented Feb 11, 2019 •

edited

Loading

szoupanos commented Feb 12, 2019

giovannipizzi commented Feb 12, 2019

ltalirz commented Feb 12, 2019 •

edited

Loading

sphuber commented Feb 12, 2019

ltalirz commented Feb 12, 2019

szoupanos commented Feb 12, 2019

szoupanos commented Feb 12, 2019

ltalirz commented Feb 12, 2019

szoupanos commented Feb 12, 2019

ltalirz commented Feb 12, 2019 •

edited

Loading

giovannipizzi commented Feb 12, 2019

szoupanos commented Feb 12, 2019 •

edited

Loading

ltalirz left a comment

giovannipizzi commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019 •

edited

Loading

szoupanos commented Feb 13, 2019

szoupanos commented Feb 21, 2019

szoupanos commented Feb 21, 2019

ltalirz commented Feb 21, 2019

szoupanos commented Feb 22, 2019

szoupanos commented Feb 22, 2019

sphuber commented Feb 22, 2019

Solution for the fast node addition to group for v0.12 #2471

Solution for the fast node addition to group for v0.12 #2471

Conversation

szoupanos commented Feb 11, 2019

coveralls commented Feb 11, 2019 • edited Loading

szoupanos commented Feb 12, 2019

giovannipizzi commented Feb 12, 2019

ltalirz commented Feb 12, 2019 • edited Loading

sphuber commented Feb 12, 2019

ltalirz commented Feb 12, 2019

szoupanos commented Feb 12, 2019

szoupanos commented Feb 12, 2019

ltalirz commented Feb 12, 2019

szoupanos commented Feb 12, 2019

ltalirz commented Feb 12, 2019 • edited Loading

giovannipizzi commented Feb 12, 2019

szoupanos commented Feb 12, 2019 • edited Loading

ltalirz left a comment

Choose a reason for hiding this comment

giovannipizzi commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019

sphuber commented Feb 13, 2019

szoupanos commented Feb 13, 2019 • edited Loading

szoupanos commented Feb 13, 2019

szoupanos commented Feb 21, 2019

szoupanos commented Feb 21, 2019

ltalirz commented Feb 21, 2019

szoupanos commented Feb 22, 2019

szoupanos commented Feb 22, 2019

sphuber commented Feb 22, 2019

coveralls commented Feb 11, 2019 •

edited

Loading

ltalirz commented Feb 12, 2019 •

edited

Loading

ltalirz commented Feb 12, 2019 •

edited

Loading

szoupanos commented Feb 12, 2019 •

edited

Loading

szoupanos commented Feb 13, 2019 •

edited

Loading