-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mjl index speedup #154
Mjl index speedup #154
Conversation
… get rid of unnecessarily cached data asap to allow larger data sets to be indexed
…at finagling once and cache it for subsequent model instance prepares.
…bulk during indexing on or off.
Using prefetch_related to avoid on-demand queries for the nested object hierarchy, which reduces the total runtime by roughly a factor of 4, which will hopefully be enough until @mjl’s upstream pull-request is merged: django-es/django-elasticsearch-dsl#154 queryset_pagination seemed like a potential area for improvement but in practice was not faster. Using selected_related was faster than nothing but about half as fast as prefetch_related since it retrieves so much duplicate data.
… mjl-index-speedup
# Conflicts: # django_elasticsearch_dsl/documents.py
any updates on this? |
Come on, guys, this is one of the best improvements out there! It's a shame it's not merged yet. |
Sorry for taking so long. @mjl Can you remove the merge commit and rebase your changes upon master? |
# devolve into multi-second run time at large offsets. | ||
if self.chunk_size: | ||
if last_max_pk is not None: | ||
current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when the primary key
is not integer? This approach can not be generalized as pk
can be uuid
also!
I have gone through the code and it seems like it needs some improvement. @barseghyanartur I understand your frustration, but this thing works for you does not mean will work for everyone. While maintaining packages, you need to make it generalized. So its not so easy to merge anything! |
I haven't seen breaking changes on the API level here. Everything stays on as it was. |
Its not breaking change in API level, but you need to ensure that it works properly for all the people who are using it. So its not a easy call |
What about when the `primary key` is not integer?
It doesn't matter what type the primary key is, as long as it's
sortable by the database. Integer, Strings, guuid, all should work fine.
Also, that feature is opt-in, so if one really runs into problems,
just don't enable it?
|
@mjl You always can not ensure with |
@mjl You always can not ensure with `uuid` that the later one is
always greater as its not incremental. Moreover, if its unique string,
you always can not ensure that the later one will be greater. As you
are making `gte` query, the later one will not be indexed.
That is only a problem if you are adding data while doing a full
reindex and expect the full reindex to include those entries. That, to
be blunt, I would call user error, as iterating over changing data sets
is always brittle. And note that won't even work as you expect with the
current code: Anything that is added while the index is rebuilt will not
be included in the index as the sql query is done at a certain point in
time (and thus its result set frozen); anything added after that time
will not be indexed by the rebuild command. You might catch that case by
using signals or whatnot, but that is outside the scope of the indexing
code.
But fair enough, perhaps let's add some warning in the docs? What
should it say?
|
@mjl I understand your point. I think adding documentation will be better idea. I have a feeling to not change the |
if last_max_pk is not None: | ||
current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size] | ||
else: | ||
current_qs = small_cache_qs[:self.chunk_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the reason to slice before running the iterator
? Django handle it by default if you pass certain chunk_size`
def _bulk(self, *args, **kwargs): | ||
"""Helper for switching between normal and parallel bulk operation""" | ||
parallel = kwargs.pop('parallel', False) | ||
if parallel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way where while updating a single object the parallel
is always false. We need to use it only while indexing large number of objects. When a model object is updated, the django signal call the same method for updating the index.
Yes, I'll handle this in a few mins. Is there anything you'd like me to check particularly? |
if there a bench script – I can run it |
No, just try whatever you feel you need to try. You seem to have pretty involved model structure and queries, so if it works for you, it probably works for pretty much everybody else :-) Let me know how it goes, and if it helps, speedwise. Thanks! |
@pySilver I have given some issues that need to be fixed for this PR, once its fixed, it will be able to merge! :D |
@mjl do you need any kind of support to fix the issues I have mentioned? Feel free to let me know. |
I've checked current code and 2 out of 3 issues mentioned by @safwanrahman looks outdated at the moment. The one left is about |
The one left is about
`django_elasticsearch_dsl.documents.DocType._bulk`
That's a coincidence, I just looked into that... _bulk() only switches
to parallel mode if a parallel=True kwarg is present. This comes through
update() and is copied there from the document's options. In order to
make that only set for a full reindex, I'd have to pass that parameter
down from management/commands/search_index _populate(), which sounds not
too bad.
Basically, three smallish changes. Would that do the trick @sabricot?
```
@@ -172,7 +172,7 @@ class DocType(DSLDocument):
else:
return self.bulk(*args, **kwargs)
- def update(self, thing, refresh=None, action='index', **kwargs):
+ def update(self, thing, refresh=None, action='index',
may_parallelise=False, **kwargs):
"""
Update each document in ES for a model, iterable of models or
queryset
"""
@@ -188,7 +188,7 @@ class DocType(DSLDocument):
return self._bulk(
self._get_actions(object_list, action),
- parallel=self.django.parallel_indexing,
+ parallel=may_parallelise and self.django.parallel_indexing,
**kwargs
)
diff --git
a/django_elasticsearch_dsl/management/commands/search_index.py
b/django_elasticsearch_dsl/management/commands/search_index.py
index e4cc7ef..00d25a9 100644
--- a/django_elasticsearch_dsl/management/commands/search_index.py
+++ b/django_elasticsearch_dsl/management/commands/search_index.py
@@ -89,7 +89,7 @@ class Command(BaseCommand):
doc().get_queryset().count(),
doc.django.model.__name__)
)
qs = doc().get_indexing_queryset()
- doc().update(qs)
+ doc().update(qs, may_parallelise=True)
```
|
I think its better to pass a argument while reindexing rathar than setting parallel index in document level. |
@mjl I think it can be done. But I have mentioned couple of issues
above that need to be fixed.
I went over this thread and all I think all the issues you mentioned
have been resolved (by ripping out the PagingQuerysetProxy). Did I miss
some?
|
I think its better to pass a argument while reindexing rathar than
setting parallel index in document level.
But then you would have to set that argument every time you do a
reindex? I think a parameter makes sense in addition (so you can "try it
out" without having to change things), but it's kinda nice you can set a
default in the document...?
|
@mjl Its only used the time when anyone run |
@mjl Following work is needed actually, then I can give another review.
|
Okay, I see you point. Still I like the "set a default and forget"
convenience; so perhaps rename it to parallel_only_used_on_full_reindex
:-) or somesuch?
|
- Removing `PagingQuerysetProxy`
- Fixing the queryset slicing
Those are done.
- management command argument for paraller indexing
Give me a minute :-)
|
What about adding a value in settings where people can add the default one? I think adding a settings makes more sense if you want something like add one time and forget! 😉 |
Fair enough! I've added ELASTICSEARCH_DSL_PARALLEL to settings and
--parallel to the search_index management command.
Let me know if anything else needs adjusting!
|
@mjl Have you pushed your changes? I dont see it updated! |
I think you have pushed to other branch, not this branch! |
Ah sorry, I'll incorporate the changes in this branch in a sec.
|
I agree with @safwanrahman that parallel does not belong on document level but management command. And I think it should be optional flag to follow what ES does (it does not imply bulk indexing by default). Having a config flag for "set and forget" sounds as a good idea. |
@pySilver, @safwanrahman and @sabricot are different person! 😅 |
…_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue django-es#154 for performance analysis and details.
I had to delete and re-create my branch because it would not rebase easily;
a new pull request was necessary -> #213
|
* Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue #154 for performance analysis and details. * Move collection of prepare functions to __init__, where it's conceptually cleaner. Also shaves off a test per object. * Minor cleanup: Move prepare cache to Document object instead of Model, as it's conceptually possible to have several indices on the same model. Also remove forced ordering that is a remnant of earlier code. * chunk_size parameter for queryset.iterator() appeared in Django 2 * Do not crash in init_prepare when no fields have been defined * Crank up diff size to see what is going on * Adapt test to changed call pattern * Adapt tests to changed call patterns * Mark pagination test as expected failure for now. * Define _prepared_fields as attribute in class so to_dict() won't pick it up as document field * remove debugging * Add parameter no to do a count(*) before indexing, as for complex querysets that might be expensive. * Fixing example application * Correctly clean up after test run (delete indices with the right name). * Remove paginator test. Add tests for usage of init_prepare() and _prepared_fields. Add tests for correct calling of bulk/parallel_bulk. * Make sure we compare w/ stable order * Adjust for different types for methods/partials in py2 * Correct es dependency (was conflicting with requirements.txt) * Pass queryset_pagination as chunk_size into parallel_bulk too. * Add explanation why we use deque() * Correct typo in explanation of test * Remove leftover instrumentation print * Better formatting to avoid backslash-continuation line
This pull request addresses the problems when indexing large data sets, both in memory consumption and run time, as discussed in detail in issue #137.
Those changes have been running in a production environment for several months now without any problems.