-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
limit batch size for bulk_create operations #3713
Conversation
@sphuber I think Giovanni is quite busy now - would you mind having a look? |
Thanks @ltalirz good start, but couple things remain. I think we should also add this for SqlAlchemy. There is a page on the documentation about bulk insertions, the various available solutions and their performance. More detailed information on these methods can be found here. Currently, the SqlAlchemy import implementation just uses the Since this functionality will have to be implemented for both backends, I think it makes sense to have the batch size be configured the same way. This would also allow us to make it configurable through |
Putting it into the As for adding it to sqla as well - in order to get at least the django fix into the code, I suggest I go ahead with django only for the moment and open an issue with pointers on how to proceed for sqla. Sounds good? |
5f21fbc
to
f55598d
Compare
I've updated the PR to make it a config option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor things
@@ -471,9 +472,9 @@ def import_data_dj( | |||
if 'mtime' in [field.name for field in model._meta.local_fields]: | |||
with models.suppress_auto_now([(model, ['mtime'])]): | |||
# Store them all in once; however, the PK are not set in this way... | |||
model.objects.bulk_create(objects_to_create) | |||
model.objects.bulk_create(objects_to_create, batch_size=get_config_option('db.batch_size')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to just fetch the value once and assign to local variable that is to be reused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm yes... I was thinking there might be cases where you are creating a new profile, where you don't really want to use the setting of the current profile, so I thought it might be safer to start with a global option.
Anyhow, I think making it not only global is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant this to be a reply to the other comment ;) Anyway, that is not really what the global_only
means. When an option is global_only
it means it can only be set instance-wide and not per profile. Currently this only applies to the user_*
options used for making repeated profile creation easier, so there it doesn't make sense as profile specific. The resolution order of configuration settings is profile
-> global
-> default
. So if an existing profile defines a specific db.batch_size
for itself, a new profile would still get the default, because there is nothing set globally.
'description': | ||
'Batch size for bulk CREATE operations in the database. Avoids hitting MaxAllocSize of PostgreSQL' | ||
'(1GB) when creating large numbers of database records in one go.', | ||
'global_only': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to limit it to global option only? Maybe not the most common use case to have different batch sizes, but it is possible to have multiple profiles in one installation that connect to different databases on different machines with different memory limitations. Best to leave unconstrained I would say
postgresql has a "MaxAllocSize" that defaults to 1 GB [1]. If you try to insert more than that in one go (e.g. during import of a large AiiDA export file), you encounter the error psycopg2.errors.ProgramLimitExceeded: out of memory DETAIL: Cannot enlarge string buffer containing 0 bytes by 1257443654 more bytes. This commit avoids this issue (for the django backend) by setting a batch size for bulk_create operations (via a verdi config option). [1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40 max alloc" size
f55598d
to
5e470ac
Compare
Sorry for the long wait - was in meetings, meetings, meetings ;-) |
Not to worry, thanks a lot for the improvement 👍 |
partially addresses #3712
postgresql has a "MaxAllocSize" that defaults to 1 GB [1].
If you try to insert more than that in one go (e.g. during import of a
large AiiDA export file), you encounter the error
This commit avoids this issue (for the django backend) by setting a
batch size for bulk_create operations.
[1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40
Open questions:
Didn't look into sqla side, pointers welcomelet's fix this in another PRBATCH_SIZE
variable should be moved somewhere else than thesettings.py
. Happy to move it wherever you guys think it should go.