🐛 FIX: Import archive into large DB #5740

chrisjsewell · 2022-11-02T19:00:46Z

As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query. This size can be easily reached, if filtering by nodes in a large database, leading to:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.node_type, db_dbnode_1.process_type, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.ctime, db_dbnode_1.mtime, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, ?)) AS anon_1, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, ?)) AS anon_2, db_dbnode_1.repository_metadata, db_dbnode_1.dbcomputer_id, db_dbnode_1.user_id 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE ? ESCAPE '\' AND (db_dbnode_1.uuid NOT IN (?, ?, ....

Therefore, this commit changes the filtering of UUIDs to be on the client side, then batches queries for the full node fields by a fixed number (filter_size).

As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query. This size can be easily reached, if filtering by nodes in a large database. Therefore, this commit changes the filtering of UUIDs to be on the client side, then batches queries for the full nodes by a fixed number.

sphuber

Thanks @chrisjsewell apart from the comments in the code, some other questions:

Given that this is fixing a problem for importing from large sqlite archives, does the same problem not also likely exist for exporting to/from sqlite database? I think that since these are often temporary databases the exporting-from is unlikely, but exporting-to would actually be a common use case, would it not?
I presume you have tested this manually? Would it be impossible or too costly to implement a test? Or can we have a test that reproduces this, simply by creating one with >999 nodes and importing that?

sphuber · 2022-11-03T20:57:51Z

aiida/tools/archive/imports.py

 def import_archive(
    path: Union[str, Path],
    *,
    archive_format: Optional[ArchiveFormatAbstract] = None,
+    filter_size: int = 999,


Might be useful to put a comment here (or in the commit message) saying that this is the default limit in versions prior to 3.32.0 (2020-05-22)

aiida/tools/archive/imports.py

sphuber · 2022-11-03T21:02:43Z

aiida/tools/archive/imports.py

@@ -117,6 +129,7 @@ def import_archive(
    type_check(test_run, bool)
    backend = backend or get_manager().get_profile_storage()
    type_check(backend, StorageBackend)
+    qparams = QueryParams(batch_size=batch_size, filter_size=filter_size)


nitpick: why not just have query_params for increased legibility. The few extra characters shouldn't hurt, should they?

sphuber · 2022-11-03T21:05:55Z

aiida/tools/archive/imports.py

-    ).iterdict(batch_size=batch_size)
+
+    # collect the unique entities from the input backend to be added to the output backend
+    ufields = []


Do I understand correctly that now here you read in all the unique fields for this entity into memory from the input backend, whereas before this was directly streamed into the bulk_insert. Could this lead to memory problems, given that this very fix is to deal with large imports?

Well only the those that are not already in the output backend, e.g. it will be a list of all node UUIDs that are in the archive but not in the profile.
So no I don't think this is a particularly big hit on memory usage, because you are only reading the UUIDs, as opposed to the full node content (attributes etc)

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

chrisjsewell · 2022-11-04T14:55:09Z

Given that this is fixing a problem for importing from large sqlite archives

To clarify, this is a problem importing into large profiles, not from large archives

for more information, see https://pre-commit.ci

sphuber · 2022-11-04T15:56:36Z

To clarify, this is a problem importing into large profiles, not from large archives

I see, I didn't get that clearly. Could you perhaps add more information in the comments in the relevant code changes and the commit message?

And if it is an sqlite limitation, does that mean there is something with a large sqlite database that they are importing data to? This should anyway not be recommended, should it? Where did you come across this bug?

Looking at the code, I don't really understand why the problem is with the target backend. The only point where the target backend (backend_to) is concerned is in the backend_to.bulk_insert call, but that code is untouched. Just before it, you change the batch_iter to batch on filter_size instead of batch_size but that still operates on backend_from. On top of that, filter_size is just 1 smaller compared to batch_size. So if the problem was there, was it really the problem that batch_size=1000 was 1 too big?

chrisjsewell · 2022-11-04T16:02:59Z

This reproduces the problem, and is fixed after the PR, for python 3.8 from conda-forge, including sqlite v3.39.4 (h9ae0607_0). you'll note that I couldn't actually recreate it until there were 300,000 nodes (at least 200,000 was fine), so perhaps SQLITE_MAX_VARIABLE_NUMBER is bigger here 🤷

The test takes ~105 seconds to run, so maybe not great to include

from aiida import orm, manage, tools
from aiida.tools.archive import create_archive, import_archive


def test_import_into_large_profile(aiida_profile_clean, tmp_path):
    backend = manage.get_manager().get_profile_storage()
    user_id = backend.default_user.id
    pks = backend.bulk_insert(orm.EntityTypes.NODE, [{"user_id": user_id} for _ in range(300_000)], allow_defaults=True)
    assert orm.QueryBuilder().append(orm.Node).count() == 300_000
    create_archive(None, filename=tmp_path / 'archive.aiida')
    tools.delete_nodes([pks[0]], dry_run=False)
    assert orm.QueryBuilder().append(orm.Node).count() == 300_000 - 1
    import_archive(tmp_path / 'archive.aiida')

chrisjsewell · 2022-11-04T16:07:55Z

Looking at the code, I don't really understand why the problem is with the target backend.

@sphuber its not a problem with the target backend, its when the target backend and input backend have lots of nodes in common, so that backend_unique_id below is very long, and you hit the parameter limit for the (input) SQLITE db query

       filters={
            unique_field: {
                '!in': list(backend_unique_id)
            }
        } if backend_unique_id else {},

sphuber · 2022-11-04T16:19:53Z

This reproduces the problem, and is fixed after the PR, for python 3.8 from conda-forge

Cheers, appreciate that. Confirmed that it also breaks for me locally. I agree that we don't want to add this in the unit tests 😅

its not a problem with the target backend, its when the target backend and input backend have lots of nodes in common, so that backend_unique_id below is very long, and you hit the parameter limit for the (input) SQLITE db query

Thanks, that really helps, I now understand the problem. Think it would be great to add a comment at the line with for nrows, ufields_batch in batch_iter(ufields, qparams.filter_size): saying something to the effect of:

Batch the bulk insert in batch sizes of query_params.filter_size since the query is filtered on ufields_batch. If ufields is large and would not be batched, the maximum number of query variables that certain database backends impose (such as sqlite) can be exceeded.

Last thing: I noticed you change qparams to query_params (thanks for that) but only did a few. If you can grep-replace the remaining and add the comment described above, I will approve.

chrisjsewell · 2022-11-04T16:20:59Z

Where did you come across this bug?

it was reported to me independently by @Crivella and @azadoks

sphuber

All good, thanks @chrisjsewell

chrisjsewell · 2022-11-04T21:20:56Z

Thanks for the review 😄

chrisjsewell added 2 commits November 2, 2022 19:55

Merge branch 'main' into fix-archive-import

fc05e03

sphuber requested changes Nov 3, 2022

View reviewed changes

Update aiida/tools/archive/imports.py

1440a60

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

chrisjsewell and others added 3 commits November 4, 2022 15:55

Merge branch 'main' into fix-archive-import

7b78f96

qparams to query_params

19bd3b3

[pre-commit.ci] auto fixes from pre-commit.com hooks

86b97ac

for more information, see https://pre-commit.ci

chrisjsewell added 2 commits November 4, 2022 18:22

Update imports.py

43c2ba1

Update imports.py

c0cd3c6

chrisjsewell requested a review from sphuber November 4, 2022 17:29

sphuber approved these changes Nov 4, 2022

View reviewed changes

chrisjsewell merged commit 899471a into aiidateam:main Nov 4, 2022

chrisjsewell deleted the fix-archive-import branch November 4, 2022 18:15

GeigerJ2 mentioned this pull request May 17, 2024

(sqlite3.OperationalError) too many SQL variables exception #6402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 FIX: Import archive into large DB #5740

🐛 FIX: Import archive into large DB #5740

chrisjsewell commented Nov 2, 2022

sphuber left a comment

sphuber Nov 3, 2022

sphuber Nov 3, 2022

sphuber Nov 3, 2022

chrisjsewell Nov 4, 2022

chrisjsewell commented Nov 4, 2022

sphuber commented Nov 4, 2022

chrisjsewell commented Nov 4, 2022 •

edited

Loading

chrisjsewell commented Nov 4, 2022 •

edited

Loading

sphuber commented Nov 4, 2022

chrisjsewell commented Nov 4, 2022

sphuber left a comment

chrisjsewell commented Nov 4, 2022

🐛 FIX: Import archive into large DB #5740

🐛 FIX: Import archive into large DB #5740

Conversation

chrisjsewell commented Nov 2, 2022

sphuber left a comment

Choose a reason for hiding this comment

sphuber Nov 3, 2022

Choose a reason for hiding this comment

sphuber Nov 3, 2022

Choose a reason for hiding this comment

sphuber Nov 3, 2022

Choose a reason for hiding this comment

chrisjsewell Nov 4, 2022

Choose a reason for hiding this comment

chrisjsewell commented Nov 4, 2022

sphuber commented Nov 4, 2022

chrisjsewell commented Nov 4, 2022 • edited Loading

chrisjsewell commented Nov 4, 2022 • edited Loading

sphuber commented Nov 4, 2022

chrisjsewell commented Nov 4, 2022

sphuber left a comment

Choose a reason for hiding this comment

chrisjsewell commented Nov 4, 2022

chrisjsewell commented Nov 4, 2022 •

edited

Loading

chrisjsewell commented Nov 4, 2022 •

edited

Loading