limit batch size for bulk_create operations #3713

ltalirz · 2020-01-15T18:31:21Z

partially addresses #3712

postgresql has a "MaxAllocSize" that defaults to 1 GB [1].
If you try to insert more than that in one go (e.g. during import of a
large AiiDA export file), you encounter the error

psycopg2.errors.ProgramLimitExceeded: out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1257443654 more bytes.

This commit avoids this issue (for the django backend) by setting a
batch size for bulk_create operations.

[1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40

Open questions:

What should the batch size be? I guess it is in units of the django objects you are passing (see e..g here ), so the current value of 100k would allow each object to generate up to an average of 10 kB of strings per object (it did fix the issue for me)
~~Didn't look into sqla side, pointers welcome~~ let's fix this in another PR
From the tests it seems the BATCH_SIZE variable should be moved somewhere else than the settings.py. Happy to move it wherever you guys think it should go.

ltalirz · 2020-01-21T19:49:52Z

@sphuber I think Giovanni is quite busy now - would you mind having a look?
I think something like this should probably be in 1.1.0

sphuber · 2020-01-22T12:38:23Z

Thanks @ltalirz good start, but couple things remain. I think we should also add this for SqlAlchemy. There is a page on the documentation about bulk insertions, the various available solutions and their performance. More detailed information on these methods can be found here. Currently, the SqlAlchemy import implementation just uses the session.add method, i.e. no bulk/batching whatsoever. I can have a look at some point if I can adapt it but not sure when I will have the time.

Since this functionality will have to be implemented for both backends, I think it makes sense to have the batch size be configured the same way. This would also allow us to make it configurable through verdi config. This solves your first checkbox, we simply go with that value for now and if that stops working for someone, they can simply adapt it through configuration. It also solves the second checkbox: the value will simply be taken from the configuration, for both backends.

ltalirz · 2020-01-22T14:12:46Z

Putting it into the config makes sense to me. Shall we say AIIDADB_BATCH_SIZE?

As for adding it to sqla as well - in order to get at least the django fix into the code, I suggest I go ahead with django only for the moment and open an issue with pointers on how to proceed for sqla. Sounds good?

ltalirz · 2020-02-11T12:24:55Z

I've updated the PR to make it a config option.
In the interest of time, I suggest we leave the sqla implementation to later.

sphuber

Two minor things

sphuber · 2020-02-11T14:07:13Z

aiida/tools/importexport/dbimport/backends/django/__init__.py

@@ -471,9 +472,9 @@ def import_data_dj(
                if 'mtime' in [field.name for field in model._meta.local_fields]:
                    with models.suppress_auto_now([(model, ['mtime'])]):
                        # Store them all in once; however, the PK are not set in this way...
-                        model.objects.bulk_create(objects_to_create)
+                        model.objects.bulk_create(objects_to_create, batch_size=get_config_option('db.batch_size'))


probably better to just fetch the value once and assign to local variable that is to be reused

hm yes... I was thinking there might be cases where you are creating a new profile, where you don't really want to use the setting of the current profile, so I thought it might be safer to start with a global option.
Anyhow, I think making it not only global is fine

I think you meant this to be a reply to the other comment ;) Anyway, that is not really what the global_only means. When an option is global_only it means it can only be set instance-wide and not per profile. Currently this only applies to the user_* options used for making repeated profile creation easier, so there it doesn't make sense as profile specific. The resolution order of configuration settings is profile -> global -> default. So if an existing profile defines a specific db.batch_size for itself, a new profile would still get the default, because there is nothing set globally.

sphuber · 2020-02-11T14:08:52Z

aiida/manage/configuration/options.py

+        'description':
+        'Batch size for bulk CREATE operations in the database. Avoids hitting MaxAllocSize of PostgreSQL'
+        '(1GB) when creating large numbers of database records in one go.',
+        'global_only': True,


What is the reason to limit it to global option only? Maybe not the most common use case to have different batch sizes, but it is possible to have multiple profiles in one installation that connect to different databases on different machines with different memory limitations. Best to leave unconstrained I would say

postgresql has a "MaxAllocSize" that defaults to 1 GB [1]. If you try to insert more than that in one go (e.g. during import of a large AiiDA export file), you encounter the error psycopg2.errors.ProgramLimitExceeded: out of memory DETAIL: Cannot enlarge string buffer containing 0 bytes by 1257443654 more bytes. This commit avoids this issue (for the django backend) by setting a batch size for bulk_create operations (via a verdi config option). [1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40 max alloc" size

ltalirz · 2020-02-11T18:47:42Z

Sorry for the long wait - was in meetings, meetings, meetings ;-)

sphuber · 2020-02-12T09:09:03Z

Sorry for the long wait - was in meetings, meetings, meetings ;-)

Not to worry, thanks a lot for the improvement 👍

ltalirz requested a review from giovannipizzi January 15, 2020 18:31

ltalirz force-pushed the issue_3712_max_postgres branch from 5f21fbc to f55598d Compare February 11, 2020 12:08

ltalirz requested review from sphuber and removed request for giovannipizzi February 11, 2020 12:09

ltalirz mentioned this pull request Feb 11, 2020

verdi import exceeds maximum postgres string size #3712

Closed

sphuber requested changes Feb 11, 2020

View reviewed changes

ltalirz force-pushed the issue_3712_max_postgres branch from f55598d to 5e470ac Compare February 11, 2020 18:38

Merge branch 'develop' into issue_3712_max_postgres

0c8a448

sphuber approved these changes Feb 12, 2020

View reviewed changes

sphuber merged commit b7bcf96 into aiidateam:develop Feb 12, 2020

sphuber deleted the issue_3712_max_postgres branch February 12, 2020 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limit batch size for bulk_create operations #3713

limit batch size for bulk_create operations #3713

ltalirz commented Jan 15, 2020 •

edited

Loading

ltalirz commented Jan 21, 2020

sphuber commented Jan 22, 2020

ltalirz commented Jan 22, 2020

ltalirz commented Feb 11, 2020

sphuber left a comment

sphuber Feb 11, 2020

ltalirz Feb 11, 2020

sphuber Feb 12, 2020

sphuber Feb 11, 2020

ltalirz commented Feb 11, 2020

sphuber commented Feb 12, 2020

limit batch size for bulk_create operations #3713

limit batch size for bulk_create operations #3713

Conversation

ltalirz commented Jan 15, 2020 • edited Loading

ltalirz commented Jan 21, 2020

sphuber commented Jan 22, 2020

ltalirz commented Jan 22, 2020

ltalirz commented Feb 11, 2020

sphuber left a comment

Choose a reason for hiding this comment

sphuber Feb 11, 2020

Choose a reason for hiding this comment

ltalirz Feb 11, 2020

Choose a reason for hiding this comment

sphuber Feb 12, 2020

Choose a reason for hiding this comment

sphuber Feb 11, 2020

Choose a reason for hiding this comment

ltalirz commented Feb 11, 2020

sphuber commented Feb 12, 2020

ltalirz commented Jan 15, 2020 •

edited

Loading