Mjl index speedup #154

mjl · 2019-01-30T13:28:02Z

This pull request addresses the problems when indexing large data sets, both in memory consumption and run time, as discussed in detail in issue #137.

Those changes have been running in a production environment for several months now without any problems.

… get rid of unnecessarily cached data asap to allow larger data sets to be indexed

…at finagling once and cache it for subsequent model instance prepares.

…bulk during indexing on or off.

barseghyanartur · 2019-04-06T20:36:44Z

@sabricot:

+1 for merge!

@mjl:

Thank you!

@mjl

Using prefetch_related to avoid on-demand queries for the nested object hierarchy, which reduces the total runtime by roughly a factor of 4, which will hopefully be enough until @mjl’s upstream pull-request is merged: django-es/django-elasticsearch-dsl#154 queryset_pagination seemed like a potential area for improvement but in practice was not faster. Using selected_related was faster than nothing but about half as fast as prefetch_related since it retrieves so much duplicate data.

… mjl-index-speedup

# Conflicts: # django_elasticsearch_dsl/documents.py

pySilver · 2019-08-27T21:50:47Z

any updates on this?

barseghyanartur · 2019-08-27T22:00:14Z

Come on, guys, this is one of the best improvements out there! It's a shame it's not merged yet.

safwanrahman · 2019-08-27T22:24:03Z

Sorry for taking so long. @mjl Can you remove the merge commit and rebase your changes upon master?

safwanrahman · 2019-08-27T22:27:05Z

django_elasticsearch_dsl/documents.py

+            # devolve into multi-second run time at large offsets.
+            if self.chunk_size:
+                if last_max_pk is not None:
+                    current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size]


What about when the primary key is not integer? This approach can not be generalized as pk can be uuid also!

safwanrahman · 2019-08-27T22:30:02Z

I have gone through the code and it seems like it needs some improvement.

@barseghyanartur I understand your frustration, but this thing works for you does not mean will work for everyone. While maintaining packages, you need to make it generalized. So its not so easy to merge anything!

barseghyanartur · 2019-08-27T22:34:22Z

@safwanrahman:

I haven't seen breaking changes on the API level here. Everything stays on as it was.

safwanrahman · 2019-08-27T22:54:20Z

Its not breaking change in API level, but you need to ensure that it works properly for all the people who are using it. So its not a easy call

mjl · 2019-08-28T09:16:34Z

What about when the `primary key` is not integer?

It doesn't matter what type the primary key is, as long as it's sortable by the database. Integer, Strings, guuid, all should work fine. Also, that feature is opt-in, so if one really runs into problems, just don't enable it?

safwanrahman · 2019-08-28T09:20:57Z

@mjl You always can not ensure with uuid that the later one is always greater as its not incremental. Moreover, if its unique string, you always can not ensure that the later one will be greater. As you are making gte query, the later one will not be indexed.

mjl · 2019-08-28T09:40:00Z

@mjl You always can not ensure with `uuid` that the later one is always greater as its not incremental. Moreover, if its unique string, you always can not ensure that the later one will be greater. As you are making `gte` query, the later one will not be indexed.

That is only a problem if you are adding data while doing a full reindex and expect the full reindex to include those entries. That, to be blunt, I would call user error, as iterating over changing data sets is always brittle. And note that won't even work as you expect with the current code: Anything that is added while the index is rebuilt will not be included in the index as the sql query is done at a certain point in time (and thus its result set frozen); anything added after that time will not be indexed by the rebuild command. You might catch that case by using signals or whatnot, but that is outside the scope of the indexing code. But fair enough, perhaps let's add some warning in the docs? What should it say?

safwanrahman · 2019-08-28T09:46:44Z

That is only a problem if you are adding data while doing a full
reindex and expect the full reindex to include those entries. That, to
be blunt, I would call user error, as iterating over changing data sets
is always brittle. And note that won't even work as you expect with the
current code: Anything that is added while the index is rebuilt will not
be included in the index as the sql query is done at a certain point in
time (and thus its result set frozen); anything added after that time
will not be indexed by the rebuild command. You might catch that case by
using signals or whatnot, but that is outside the scope of the indexing
code.

But fair enough, perhaps let's add some warning in the docs? What
should it say?

@mjl I understand your point. I think adding documentation will be better idea.

I have a feeling to not change the get_queryset method. As its a public method, we should not change it.
According to documentation, you could use iterator with chunk_size. Django would manage everything that you are going to do actually. Either it will use the database level cursor, or load it in memory.

safwanrahman · 2019-08-28T09:49:56Z

django_elasticsearch_dsl/documents.py

+                if last_max_pk is not None:
+                    current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size]
+                else:
+                    current_qs = small_cache_qs[:self.chunk_size]


Whats the reason to slice before running the iterator? Django handle it by default if you pass certain chunk_size`

safwanrahman · 2019-08-28T09:54:22Z

django_elasticsearch_dsl/documents.py

+    def _bulk(self, *args, **kwargs):
+        """Helper for switching between normal and parallel bulk operation"""
+        parallel = kwargs.pop('parallel', False)
+        if parallel:


We should find a way where while updating a single object the parallel is always false. We need to use it only while indexing large number of objects. When a model object is updated, the django signal call the same method for updating the index.

pySilver · 2019-10-02T11:25:11Z

Yes, I'll handle this in a few mins. Is there anything you'd like me to check particularly?

pySilver · 2019-10-02T11:25:33Z

if there a bench script – I can run it

mjl · 2019-10-02T11:27:04Z

No, just try whatever you feel you need to try. You seem to have pretty involved model structure and queries, so if it works for you, it probably works for pretty much everybody else :-)

Let me know how it goes, and if it helps, speedwise. Thanks!

pySilver · 2019-10-02T11:39:35Z

@mjl it work perfectly, thank you!

@sabricot isn't it ready to be merged into master?

safwanrahman · 2019-10-02T14:02:11Z

@pySilver I have given some issues that need to be fixed for this PR, once its fixed, it will be able to merge! :D

safwanrahman · 2019-10-02T14:34:25Z

@mjl do you need any kind of support to fix the issues I have mentioned? Feel free to let me know.

pySilver · 2019-10-17T15:30:26Z

@mjl @safwanrahman

I've checked current code and 2 out of 3 issues mentioned by @safwanrahman looks outdated at the moment. The one left is about django_elasticsearch_dsl.documents.DocType._bulk which apparently is called for single item's updates (signals). Is it still relevant? Maybe simply propagate some flag from management command there?

mjl · 2019-10-17T17:02:09Z

The one left is about `django_elasticsearch_dsl.documents.DocType._bulk`

That's a coincidence, I just looked into that... _bulk() only switches to parallel mode if a parallel=True kwarg is present. This comes through update() and is copied there from the document's options. In order to make that only set for a full reindex, I'd have to pass that parameter down from management/commands/search_index _populate(), which sounds not too bad. Basically, three smallish changes. Would that do the trick @sabricot? ``` @@ -172,7 +172,7 @@ class DocType(DSLDocument): else: return self.bulk(*args, **kwargs) - def update(self, thing, refresh=None, action='index', **kwargs): + def update(self, thing, refresh=None, action='index', may_parallelise=False, **kwargs): """ Update each document in ES for a model, iterable of models or queryset """ @@ -188,7 +188,7 @@ class DocType(DSLDocument): return self._bulk( self._get_actions(object_list, action), - parallel=self.django.parallel_indexing, + parallel=may_parallelise and self.django.parallel_indexing, **kwargs ) diff --git a/django_elasticsearch_dsl/management/commands/search_index.py b/django_elasticsearch_dsl/management/commands/search_index.py index e4cc7ef..00d25a9 100644 --- a/django_elasticsearch_dsl/management/commands/search_index.py +++ b/django_elasticsearch_dsl/management/commands/search_index.py @@ -89,7 +89,7 @@ class Command(BaseCommand): doc().get_queryset().count(), doc.django.model.__name__) ) qs = doc().get_indexing_queryset() - doc().update(qs) + doc().update(qs, may_parallelise=True) ```

safwanrahman · 2019-10-17T17:20:27Z

@mjl I think it can be done. But I have mentioned couple of issues above that need to be fixed.
@pySilver There are some more issue to fix like removing the slicing and remove the extreme complexity. Can you explain how they are outdated?

safwanrahman · 2019-10-17T17:26:51Z

I think its better to pass a argument while reindexing rathar than setting parallel index in document level.

mjl · 2019-10-17T17:27:12Z

@mjl I think it can be done. But I have mentioned couple of issues above that need to be fixed.

I went over this thread and all I think all the issues you mentioned have been resolved (by ripping out the PagingQuerysetProxy). Did I miss some?

mjl · 2019-10-17T17:28:54Z

I think its better to pass a argument while reindexing rathar than setting parallel index in document level.

But then you would have to set that argument every time you do a reindex? I think a parameter makes sense in addition (so you can "try it out" without having to change things), but it's kinda nice you can set a default in the document...?

safwanrahman · 2019-10-17T17:34:05Z

@mjl Its only used the time when anyone run populate or reindex management command. I think, it does not make sense to have it in the document level because its not related to document. If we add the configuration in document level, people may think that its used while indexing single object also, but according to our approach, its not the case.

safwanrahman · 2019-10-17T17:36:01Z

@mjl Following work is needed actually, then I can give another review.

Removing PagingQuerysetProxy
Fixing the queryset slicing
management command argument for paraller indexing

mjl · 2019-10-17T17:37:26Z

Okay, I see you point. Still I like the "set a default and forget" convenience; so perhaps rename it to parallel_only_used_on_full_reindex :-) or somesuch?

mjl · 2019-10-17T17:38:11Z

- Removing `PagingQuerysetProxy` - Fixing the queryset slicing

Those are done.

- management command argument for paraller indexing

Give me a minute :-)

safwanrahman · 2019-10-17T17:40:58Z

Okay, I see you point. Still I like the "set a default and forget"
convenience; so perhaps rename it to parallel_only_used_on_full_reindex
:-) or somesuch?

What about adding a value in settings where people can add the default one? I think adding a settings makes more sense if you want something like add one time and forget! 😉

mjl · 2019-10-17T18:10:17Z

Fair enough! I've added ELASTICSEARCH_DSL_PARALLEL to settings and --parallel to the search_index management command. Let me know if anything else needs adjusting!

safwanrahman · 2019-10-17T18:11:23Z

@mjl Have you pushed your changes? I dont see it updated!

safwanrahman · 2019-10-17T18:12:15Z

I think you have pushed to other branch, not this branch!

mjl · 2019-10-17T18:14:31Z

Ah sorry, I'll incorporate the changes in this branch in a sec.

pySilver · 2019-10-17T18:40:51Z

I agree with @safwanrahman that parallel does not belong on document level but management command. And I think it should be optional flag to follow what ES does (it does not imply bulk indexing by default). Having a config flag for "set and forget" sounds as a good idea.

safwanrahman · 2019-10-17T18:43:10Z

@pySilver, @safwanrahman and @sabricot are different person! 😅

…_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue django-es#154 for performance analysis and details.

mjl · 2019-10-17T19:07:35Z

I had to delete and re-create my branch because it would not rebase easily; a new pull request was necessary -> #213

* Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue #154 for performance analysis and details. * Move collection of prepare functions to __init__, where it's conceptually cleaner. Also shaves off a test per object. * Minor cleanup: Move prepare cache to Document object instead of Model, as it's conceptually possible to have several indices on the same model. Also remove forced ordering that is a remnant of earlier code. * chunk_size parameter for queryset.iterator() appeared in Django 2 * Do not crash in init_prepare when no fields have been defined * Crank up diff size to see what is going on * Adapt test to changed call pattern * Adapt tests to changed call patterns * Mark pagination test as expected failure for now. * Define _prepared_fields as attribute in class so to_dict() won't pick it up as document field * remove debugging * Add parameter no to do a count(*) before indexing, as for complex querysets that might be expensive. * Fixing example application * Correctly clean up after test run (delete indices with the right name). * Remove paginator test. Add tests for usage of init_prepare() and _prepared_fields. Add tests for correct calling of bulk/parallel_bulk. * Make sure we compare w/ stable order * Adjust for different types for methods/partials in py2 * Correct es dependency (was conflicting with requirements.txt) * Pass queryset_pagination as chunk_size into parallel_bulk too. * Add explanation why we use deque() * Correct typo in explanation of test * Remove leftover instrumentation print * Better formatting to avoid backslash-continuation line

Martin J. Laubach added 7 commits December 5, 2018 10:51

Introduce a PagingQuerysetProxy that fetches data in chunks and tries…

895eb06

… get rid of unnecessarily cached data asap to allow larger data sets to be indexed

Use elasticsearch's parallel_bulk for indexing

ede0185

Instead of finding out which methods to call to prepare fields, do th…

b315663

…at finagling once and cache it for subsequent model instance prepares.

Remove debugging print

08a3f69

Simplify

0bd99a2

Add Meta option parallel_indexing to selectively turn using parallel_…

4e0050f

…bulk during indexing on or off.

py2.7 syntax fix

1953e1c

Martin J. Laubach added 5 commits June 3, 2019 13:26

Merge branch 'master' of github.com:mjl/django-elasticsearch-dsl into…

a706cce

… mjl-index-speedup

Adapt changes to latest release

e2a8aa7

Merge branch 'master' into mjl-index-speedup

28c1f8a

Merge branch 'master' into mjl-index-speedup

6a18331

# Conflicts: # django_elasticsearch_dsl/documents.py

Merge branch 'master' into mjl-index-speedup

aef0c62

safwanrahman reviewed Aug 27, 2019

View reviewed changes

safwanrahman closed this Aug 27, 2019

safwanrahman reopened this Aug 27, 2019

safwanrahman reviewed Aug 28, 2019

View reviewed changes

safwanrahman self-assigned this Aug 28, 2019

safwanrahman mentioned this pull request Oct 13, 2019

How to efficiently bulk index documents #205

Closed

mjl closed this Oct 17, 2019

mjl deleted the mjl-index-speedup branch October 17, 2019 18:33

mjl mentioned this pull request Oct 17, 2019

Mjl index speedup #213

Merged

Mjl index speedup #154

Mjl index speedup #154

Conversation

mjl commented Jan 30, 2019

barseghyanartur commented Apr 6, 2019

pySilver commented Aug 27, 2019

barseghyanartur commented Aug 27, 2019

safwanrahman commented Aug 27, 2019

safwanrahman Aug 27, 2019 • edited Loading

Choose a reason for hiding this comment

safwanrahman commented Aug 27, 2019

barseghyanartur commented Aug 27, 2019

safwanrahman commented Aug 27, 2019 • edited Loading

mjl commented Aug 28, 2019 via email

safwanrahman commented Aug 28, 2019

mjl commented Aug 28, 2019 via email

safwanrahman commented Aug 28, 2019 • edited Loading

safwanrahman Aug 28, 2019

Choose a reason for hiding this comment

safwanrahman Aug 28, 2019

Choose a reason for hiding this comment

pySilver commented Oct 2, 2019

pySilver commented Oct 2, 2019

mjl commented Oct 2, 2019

pySilver commented Oct 2, 2019

safwanrahman commented Oct 2, 2019 • edited Loading

safwanrahman commented Oct 2, 2019

pySilver commented Oct 17, 2019

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 17, 2019 • edited Loading

safwanrahman commented Oct 17, 2019

mjl commented Oct 17, 2019 via email

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 17, 2019

safwanrahman commented Oct 17, 2019 • edited Loading

mjl commented Oct 17, 2019 via email

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 17, 2019

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 17, 2019

safwanrahman commented Oct 17, 2019

mjl commented Oct 17, 2019 via email

pySilver commented Oct 17, 2019 • edited Loading

safwanrahman commented Oct 17, 2019 • edited Loading

mjl commented Oct 17, 2019 via email

safwanrahman Aug 27, 2019 •

edited

Loading

safwanrahman commented Aug 27, 2019 •

edited

Loading

safwanrahman commented Aug 28, 2019 •

edited

Loading

safwanrahman commented Oct 2, 2019 •

edited

Loading

safwanrahman commented Oct 17, 2019 •

edited

Loading

safwanrahman commented Oct 17, 2019 •

edited

Loading

pySilver commented Oct 17, 2019 •

edited

Loading

safwanrahman commented Oct 17, 2019 •

edited

Loading