New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #23646 -- Added QuerySet.bulk_update() to update many models efficiently. #9606

Merged
merged 1 commit into from Sep 18, 2018

Conversation

@orf
Copy link
Contributor

orf commented Jan 20, 2018

Ticket: https://code.djangoproject.com/ticket/23646

ToDo:

  • Add tests for things like JSONField, ArrayField, GIS fields etc

@orf orf force-pushed the orf:29037-bulk-save branch from 5d04533 to ec075d5 Jan 20, 2018

@orf orf changed the title Fixes #29037 - Add a bulk_save method to models Fixed #29037 - Add a bulk_save method to models Jan 20, 2018

@orf orf force-pushed the orf:29037-bulk-save branch 3 times, most recently from e96a5d7 to e54f3ee Jan 23, 2018

pks = [obj.pk for obj in batch_objs]
update_kwargs = {
field: Case(
*(When(pk=obj.pk, then=Value(getattr(obj, field)))

This comment has been minimized.

@jarshwah

jarshwah Jan 23, 2018

Member

Careful - Value only works with basic types that the adapter itself knows how to convert to SQL. You could try passing the fields Field as the output_field param to Value which will get it to translate the value using the fields get_db_prep_* methods.

Unfortunately Fields are not themselves Expressions, which is why you've chosen to wrap the field in a Value.

We could consider the following options:

  1. Implement the Expression API on fields directly
  2. Add a FieldExpression that wraps a field and translates the Expressions API onto the Field API

Personally, I've been thinking about 1 being useful a lot. Fields should just be another type of expression. If I were doing fields from scratch, they would be. If just passing the fields Field instance to output_field doesn't work for more complex types, then you might need to consider the options above.

This comment has been minimized.

@orf

orf Jan 25, 2018

Contributor

Thanks for the feedback! Ok, I will add some tests for more complex fields and try passing them as an output_field. Can you give me some examples of such complesx fields? Do you mean fields related to GIS, or postgres specific ones?

This would indeed be a limitation. Implementing the expression API on fields seems like the cleaner way to do things.

This comment has been minimized.

@loic

loic Jan 27, 2018

Member

I agree with Implement the Expression API on fields directly, is there a ticket for that already?

Going off a tangent, but we should keep in mind that not every field inherits from models.Field (e.g. contrib.contenttypes), I wonder if we should introduce a BaseField class where things like expression support would be added.

@@ -470,6 +470,37 @@ def bulk_create(self, objs, batch_size=None):

return objs

def bulk_save(self, objs, fields, batch_size=None):

This comment has been minimized.

@jarshwah

jarshwah Jan 23, 2018

Member

Since this only works for instances with an pk, do you think that bulk_update would be a better name? The regular save() method can either create or update depending on pk status which may confuse users here.

This comment has been minimized.

@orf

orf Jan 24, 2018

Contributor

I considered this, but queryset.update() is the best 'bulk update' method. I didn't want to confuse the two, this is more about saving multiple model fields with multiple differing values, gene bulk_save. Open to changing it though.

This comment has been minimized.

@adamchainz

adamchainz Sep 15, 2018

Member

Separate rename: I think fields -> update_fields to keep in-line with Model.save() is probably good for sanity

@orf orf force-pushed the orf:29037-bulk-save branch 2 times, most recently from 342413e to b65feab Jan 25, 2018

@orf orf force-pushed the orf:29037-bulk-save branch 2 times, most recently from 7d245ac to 2e9d616 Feb 4, 2018

assert batch_size is None or batch_size > 0
if not fields:
raise ValueError('Specific field names must be given to bulk_save')
objs = list(objs)

This comment has been minimized.

@michaeljohnbarr

michaeljohnbarr Feb 22, 2018

I have a concern with forcing the QuerySet to evaluate here. If there are a very large amount of records (millions), this could be detrimental to the memory of whatever system it is running on because it will populate the model's cache. Is there not a way that we can support iterators here?

I understand that this reduces the amount of time it takes, but we also have to be mindful of why someone such as myself would use it; which is that I have a ton of data that I need to update efficiently. The below is not the greatest example, but it lends itself to what @freakboy3742 mentioned in his comment on the ticket.

queryset = Books.objects.values("pk").annotate(price=models.Sum("fk__value")).values("pk", "price").iterator()
# [{'pk': 1, 'price': 10}, {'pk': 2, 'price': 25}]
Books.objects.bulk_save(values=queryset)

This comment has been minimized.

@orf

orf Feb 22, 2018

Contributor

Hey Michael,
Thanks for your feedback, but there seems to be some confusion. In this case objs is an iterable of model instances, not a queryset. The API is very similar to bulk create, and in fact so is the code. The list cast is just to ensure that if someone passes in a generator we can use len() on it to create the batches.

This comment has been minimized.

@michaeljohnbarr

michaeljohnbarr Feb 23, 2018

Oh yes, there definitely was some confusion. When looking at the code, I saw references to obj.pk and thought we were potentially dealing with a QuerySet (which I guess that technically we could be, but that wouldn't be the right way to use it). My apologies! Carry on. 😋

with transaction.atomic(using=self.db, savepoint=False):
for batch_objs in [objs[i:i + batch_size] for i in range(0, len(objs), batch_size)]:
pks = [obj.pk for obj in batch_objs]
update_kwargs = {

This comment has been minimized.

@schinckel

schinckel Mar 9, 2018

Contributor

My approach to solving this was slightly different: http://schinckel.net/2017/06/30/django-bulk_update-without-upsert/

Basically, instead of a CASE WHEN for each field/row, it puts the data into a VALUES list, and joins that in.

 UPDATE my_table
    SET foo = upsert_source.column2,
        column2 = upsert_source.column3
   FROM (
     VALUES (...), (...)
   ) AS upsert_source
  WHERE upsert_source.column1 = my_table.id;

I'd be interested to see how these two different approaches compare in terms of performance.

The big advantage of your approach is that it uses the ORM "more".

This comment has been minimized.

@orf

orf Mar 10, 2018

Contributor

How widely supported is that method? I know Postgres supports it and I was intending to work on adding database specific methods at a later point (mainly for performance reasons more than anything), but as far as I can tell the currently implemented method will run on any database.

(Github added this as a separate comment rather than replying to this thread, so I'm copy+pasting it here)

@orf orf force-pushed the orf:29037-bulk-save branch from 2e9d616 to d00997a Apr 10, 2018

@orf orf changed the title Fixed #29037 - Add a bulk_save method to models Fixed #23646 - Add a bulk_save method to models Apr 13, 2018

@orf orf force-pushed the orf:29037-bulk-save branch from 27a1d83 to 42bd665 Apr 13, 2018

@orf orf force-pushed the orf:29037-bulk-save branch 3 times, most recently from 5c1809c to b720a50 Jun 3, 2018

@orf

This comment has been minimized.

Copy link
Contributor

orf commented Jun 4, 2018

I've added tests for updating more complex Postgres fields, but I'm not sure how to write tests for the GIS features. It seems the runtests.py has special handling for the gis_tests directory, and I'm not sure what I would have to do to create a single GIS-enabled model for just a few tests.

@orf orf force-pushed the orf:29037-bulk-save branch 3 times, most recently from e284b84 to ac42a88 Aug 7, 2018

@timgraham
Copy link
Member

timgraham left a comment

Add versionadded annotation.

@@ -188,6 +188,8 @@ Models
:class:`~django.db.models.DateTimeField` to the Monday of a week.

* Query expressions can now be negated using a minus sign.
* The new :meth:`~django.db.models.query.QuerySet.bulk_save` method allows

This comment has been minimized.

@timgraham

timgraham Aug 23, 2018

Member
:meth:`.QuerySet.bulk_save`
class BulkSaveTests(TestCase):
def setUp(self):
self.town = Town.objects.create(name='Saffron Walden')

This comment has been minimized.

@timgraham

timgraham Aug 23, 2018

Member

Use less blank lines

field_names = [f.attname for f in fields]

ops = connections[self.db].ops
# We use PK twice in the resulting update query, once in the filter

This comment has been minimized.

@timgraham

timgraham Aug 23, 2018

Member

Avoid use of "we" in comments per our style guide.

if batch_size is not None and batch_size < 0:
raise ValueError('Batch size must be a positive integer')
if not fields:
raise ValueError('Specific field names must be given to bulk_save')

This comment has been minimized.

@timgraham

timgraham Aug 23, 2018

Member

When referring to bulk_save() in messages, include parenthesis (and periods).

* The model's ``save()`` method will not be called, and the ``pre_save`` and
``post_save`` signals will not be sent.
* If updating a large number of columns in a large number of rows the SQL
generated can be very large. Specifying a suitable batch_size can avoid this.

This comment has been minimized.

@timgraham

timgraham Aug 23, 2018

Member

Backticks around batch_size

@timgraham timgraham changed the title Fixed #23646 - Add a bulk_save method to models Fixed #23646 -- Added QuerySet.bulk_save() to update many models efficiently. Aug 23, 2018

@orf orf force-pushed the orf:29037-bulk-save branch 4 times, most recently from bcd8f9f to f8269f6 Aug 23, 2018

@django django deleted a comment from orf Aug 24, 2018

@orf orf force-pushed the orf:29037-bulk-save branch from 0e804e6 to 20ebb85 Aug 25, 2018

@django django deleted a comment from orf Aug 26, 2018

@orf orf force-pushed the orf:29037-bulk-save branch 4 times, most recently from 9c2e3ca to 360ea35 Sep 1, 2018

@django django deleted a comment from orf Sep 5, 2018

@@ -194,6 +194,9 @@ Models
:class:`~django.db.models.DateTimeField`, and the new :lookup:`iso_year`
lookup allows querying by an ISO-8601 week-numbering year.

* The new:meth:`.QuerySet.bulk_save` method allows efficiently saving specific

This comment has been minimized.

@pope1ni

pope1ni Sep 5, 2018

Contributor

Missing a space character after "new".

@orf orf force-pushed the orf:29037-bulk-save branch from 360ea35 to 8b69bdf Sep 5, 2018

database in an efficient manner (generally only one query)::

>>> objects = [
... Entry.objects.create(headline='Entry 1')

This comment has been minimized.

@felixxm

felixxm Sep 5, 2018

Member

Comma is missing.

@orf orf force-pushed the orf:29037-bulk-save branch from 8b69bdf to 5783070 Sep 5, 2018


>>> objects = [
... Entry.objects.create(headline='Entry 1'),
... Entry.objects.create(headline='Entry 2')

This comment has been minimized.

@pope1ni

pope1ni Sep 5, 2018

Contributor

To really nitpick, there could be a comma here too... 😛

This comment has been minimized.

@orf

orf Sep 5, 2018

Contributor

😂 fixed

@orf orf force-pushed the orf:29037-bulk-save branch from 5783070 to c14594c Sep 5, 2018

* If updating a large number of columns in a large number of rows the SQL
generated can be very large. Specifying a suitable ``batch_size`` can avoid
this.
* Updating fields defined on multi-table inheritance ancestors will incur an

This comment has been minimized.

@timgraham

timgraham Sep 14, 2018

Member

Is this correct (didn't see a test for it and didn't spot why this would be the case when Inspecting the implementation)? Is it a limitation of QuerySet.update()?

This comment has been minimized.

@orf

orf Sep 14, 2018

Contributor

I see you made some edits, but yes that is correct (it actually does 3 sometimes, if it has to look up the content type I think?). I did have a test but I believe it got lost when I moved/refactored the tests

def test_update_primary_key(self):
for note in self.notes:
note.pk += 1
Note.objects.bulk_save(self.notes, ['id'])

This comment has been minimized.

@timgraham

timgraham Sep 14, 2018

Member

Is it intentional not to have any assertions?

This comment has been minimized.

@orf

orf Sep 14, 2018

Contributor

Nope, I must have forgotten that as well. I can add one this evening.

This comment has been minimized.

@orf

orf Sep 15, 2018

Contributor

I'm glad you flagged this because this test does not make any sense in retrospect. We cannot update the primary key using this method, so this needs to raise an exception and be documented.

We do a case statement based on the primary key value, so if you change that to an arbitrary value how did I expect the case statement to be valid or work at all?

@timgraham timgraham force-pushed the orf:29037-bulk-save branch 2 times, most recently from 642868a to 0b7502d Sep 14, 2018

ops = connections[self.db].ops
# PK is used twice in the resulting update query, once in the filter
# and once in the WHEN. Each field will also have one CAST.
max_batch_size = ops.bulk_batch_size(['pk', 'pk'] + fields, objs)

This comment has been minimized.

@raphaelm

raphaelm Sep 14, 2018

Contributor

I might be wrong (currently struggling to wrap my head around bulk_bath_size), but isn't it one WHEN clause and therefore a variable for pk per field, so this should read more like ['pk'] + 2 * fields?

This comment has been minimized.

@felixxm

felixxm Sep 14, 2018

Member

@raphaelm This line is correct. We use pk twice for each object, once in a WHEN id = ... THEN clause and once in a WHERE id IN (...) clause to filter rows for update.

case_statement = Case(*when_statements, output_field=field)
if requires_casting:
case_statement = Cast(case_statement, output_field=field)
update_kwargs[field.attname] = case_statement

This comment has been minimized.

@adamchainz

adamchainz Sep 14, 2018

Member

Pre-computing the updates by moving lines 501-513 before line 500 would reduce the length of a time the transaction is open for, which is a good thing for database throughput

@orf orf force-pushed the orf:29037-bulk-save branch from 2968679 to 1a6b6ff Sep 15, 2018

than iterating through the list of models and calling ``save()`` on each of
them, but it has a few caveats:

* You cannot update the models primary key.

This comment has been minimized.

@pope1ni

pope1ni Sep 15, 2018

Contributor

model's

if any(not f.concrete or (f.is_relation and f.many_to_many) for f in fields):
raise ValueError('bulk_save() can only be used with concrete fields.')
if any(f.primary_key for f in fields):
raise ValueError('bulk_save() cannot be used with primary_key fields.')

This comment has been minimized.

@pope1ni

pope1ni Sep 15, 2018

Contributor

primary_keyprimary key

@orf orf changed the title Fixed #23646 -- Added QuerySet.bulk_save() to update many models efficiently. Fixed #23646 -- Added QuerySet.bulk_update() to update many models efficiently. Sep 16, 2018

@orf orf force-pushed the orf:29037-bulk-save branch from 3b70ada to e78fafe Sep 16, 2018

@charettes
Copy link
Member

charettes left a comment

Out of curiosity.

Did we explore using the VALUES() syntax instead of CASE(WHEN)?

UPDATE foo SET
foo.a = foo_update.a
foo.b = foo_update.b
FROM
VALUES(
    ('123', 1, '---'),
    ('345', 2, '+++')  
) AS updates(id, a, b)
WHERE foo.id IN (...) AND foo.id = foo_update.id -- the IN filtering makes a big difference.

This by no mean a blocker as I'm not sure if it's supported on SQLite and Oracle but it might be great to investigate if adapting UpdateQuery and SQLUpdateCompiler to enable the optimization in the future as I suspect it will perform significantly better on backends supporting it then the large CASE(WHEN(...)) we're currently generating.

Thanks for pushing this forward by the way. 🎉

if not all(obj.pk for obj in objs):
raise ValueError('All objects must have a primary key set.')
fields = [self.model._meta.get_field(name) for name in fields]
if any(not f.concrete or (f.is_relation and f.many_to_many) for f in fields):

This comment has been minimized.

@charettes

charettes Sep 16, 2018

Member

Is there any cases of not f.is_relation and f.many_to_many? If not then f.many_to_many should be sufficient.

This comment has been minimized.

@orf

orf Sep 16, 2018

Contributor

Good point, fixed 👍

@@ -2089,6 +2089,42 @@ instance (if the database normally supports it).

The ``ignore_conflicts`` parameter was added.

``bulk_update()``
~~~~~~~~~~~~~~~

This comment has been minimized.

@charettes

charettes Sep 16, 2018

Member

~~

@timgraham timgraham force-pushed the orf:29037-bulk-save branch from 66f162c to 90366da Sep 18, 2018

@timgraham timgraham merged commit 9cbdb44 into django:master Sep 18, 2018

19 checks passed

docs Build #16494 ended
Details
flake8 Build #16602 ended
Details
isort Build #16634 succeeded in 27 sec
Details
pr-mariadb/database=mysql,label=mariadb,python=python3.7 Build #1104 ended
Details
pr-mariadb/database=mysql_gis,label=mariadb,python=python3.7 Build #1104 ended
Details
pull-requests-bionic/database=mysql,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=mysql,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-bionic/database=mysql_gis,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=mysql_gis,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-bionic/database=postgis,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=postgis,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-bionic/database=postgres,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=postgres,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-bionic/database=spatialite,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=spatialite,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-bionic/database=sqlite3,label=bionic-pr,python=python3.5 Build #1598 ended
Details
pull-requests-bionic/database=sqlite3,label=bionic-pr,python=python3.7 Build #1598 ended
Details
pull-requests-javascript Build #12994 ended
Details
pull-requests-windows/database=sqlite3,label=windows,python=Python35 Build #8651 ended
Details

@orf orf deleted the orf:29037-bulk-save branch Sep 18, 2018

@adamchainz

This comment has been minimized.

Copy link
Member

adamchainz commented Sep 18, 2018

🎺 Nice one @orf 🎺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment