Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mjl index speedup #154

Closed
wants to merge 12 commits into from
Closed

Mjl index speedup #154

wants to merge 12 commits into from

Conversation

mjl
Copy link
Contributor

@mjl mjl commented Jan 30, 2019

This pull request addresses the problems when indexing large data sets, both in memory consumption and run time, as discussed in detail in issue #137.

Those changes have been running in a production environment for several months now without any problems.

Martin J. Laubach added 7 commits December 5, 2018 10:51
@barseghyanartur
Copy link
Contributor

@sabricot:

+1 for merge!

@mjl:

Thank you!

acdha added a commit to LibraryOfCongress/concordia that referenced this pull request Apr 9, 2019
Using prefetch_related to avoid on-demand queries
for the nested object hierarchy, which reduces the
total runtime by roughly a factor of 4, which will
hopefully be enough until @mjl’s upstream 
pull-request is merged:

django-es/django-elasticsearch-dsl#154

queryset_pagination seemed like a potential area
for improvement but in practice was not faster.

Using selected_related was faster than nothing but
about half as fast as prefetch_related since it
retrieves so much duplicate data.
@pySilver
Copy link
Contributor

any updates on this?

@barseghyanartur
Copy link
Contributor

Come on, guys, this is one of the best improvements out there! It's a shame it's not merged yet.

@safwanrahman
Copy link
Collaborator

Sorry for taking so long. @mjl Can you remove the merge commit and rebase your changes upon master?

# devolve into multi-second run time at large offsets.
if self.chunk_size:
if last_max_pk is not None:
current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size]
Copy link
Collaborator

@safwanrahman safwanrahman Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when the primary key is not integer? This approach can not be generalized as pk can be uuid also!

@safwanrahman
Copy link
Collaborator

I have gone through the code and it seems like it needs some improvement.

@barseghyanartur I understand your frustration, but this thing works for you does not mean will work for everyone. While maintaining packages, you need to make it generalized. So its not so easy to merge anything!

@barseghyanartur
Copy link
Contributor

@safwanrahman:

I haven't seen breaking changes on the API level here. Everything stays on as it was.

@safwanrahman
Copy link
Collaborator

safwanrahman commented Aug 27, 2019

Its not breaking change in API level, but you need to ensure that it works properly for all the people who are using it. So its not a easy call

@mjl
Copy link
Contributor Author

mjl commented Aug 28, 2019 via email

@safwanrahman
Copy link
Collaborator

@mjl You always can not ensure with uuid that the later one is always greater as its not incremental. Moreover, if its unique string, you always can not ensure that the later one will be greater. As you are making gte query, the later one will not be indexed.

@mjl
Copy link
Contributor Author

mjl commented Aug 28, 2019 via email

@safwanrahman
Copy link
Collaborator

safwanrahman commented Aug 28, 2019

That is only a problem if you are adding data while doing a full
reindex and expect the full reindex to include those entries. That, to
be blunt, I would call user error, as iterating over changing data sets
is always brittle. And note that won't even work as you expect with the
current code: Anything that is added while the index is rebuilt will not
be included in the index as the sql query is done at a certain point in
time (and thus its result set frozen); anything added after that time
will not be indexed by the rebuild command. You might catch that case by
using signals or whatnot, but that is outside the scope of the indexing
code.

But fair enough, perhaps let's add some warning in the docs? What
should it say?

@mjl I understand your point. I think adding documentation will be better idea.

I have a feeling to not change the get_queryset method. As its a public method, we should not change it.
According to documentation, you could use iterator with chunk_size. Django would manage everything that you are going to do actually. Either it will use the database level cursor, or load it in memory.

if last_max_pk is not None:
current_qs = small_cache_qs.filter(pk__gt=last_max_pk)[:self.chunk_size]
else:
current_qs = small_cache_qs[:self.chunk_size]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the reason to slice before running the iterator? Django handle it by default if you pass certain chunk_size`

def _bulk(self, *args, **kwargs):
"""Helper for switching between normal and parallel bulk operation"""
parallel = kwargs.pop('parallel', False)
if parallel:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way where while updating a single object the parallel is always false. We need to use it only while indexing large number of objects. When a model object is updated, the django signal call the same method for updating the index.

@safwanrahman safwanrahman self-assigned this Aug 28, 2019
@pySilver
Copy link
Contributor

pySilver commented Oct 2, 2019

Yes, I'll handle this in a few mins. Is there anything you'd like me to check particularly?

@pySilver
Copy link
Contributor

pySilver commented Oct 2, 2019

if there a bench script – I can run it

@mjl
Copy link
Contributor Author

mjl commented Oct 2, 2019

No, just try whatever you feel you need to try. You seem to have pretty involved model structure and queries, so if it works for you, it probably works for pretty much everybody else :-)

Let me know how it goes, and if it helps, speedwise. Thanks!

@pySilver
Copy link
Contributor

pySilver commented Oct 2, 2019

@mjl it work perfectly, thank you!

@sabricot isn't it ready to be merged into master?

@safwanrahman
Copy link
Collaborator

safwanrahman commented Oct 2, 2019

@pySilver I have given some issues that need to be fixed for this PR, once its fixed, it will be able to merge! :D

@safwanrahman
Copy link
Collaborator

@mjl do you need any kind of support to fix the issues I have mentioned? Feel free to let me know.

@pySilver
Copy link
Contributor

@mjl @safwanrahman

I've checked current code and 2 out of 3 issues mentioned by @safwanrahman looks outdated at the moment. The one left is about django_elasticsearch_dsl.documents.DocType._bulk which apparently is called for single item's updates (signals). Is it still relevant? Maybe simply propagate some flag from management command there?

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@safwanrahman
Copy link
Collaborator

safwanrahman commented Oct 17, 2019

@mjl I think it can be done. But I have mentioned couple of issues above that need to be fixed.
@pySilver There are some more issue to fix like removing the slicing and remove the extreme complexity. Can you explain how they are outdated?

@safwanrahman
Copy link
Collaborator

I think its better to pass a argument while reindexing rathar than setting parallel index in document level.

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@safwanrahman
Copy link
Collaborator

@mjl Its only used the time when anyone run populate or reindex management command. I think, it does not make sense to have it in the document level because its not related to document. If we add the configuration in document level, people may think that its used while indexing single object also, but according to our approach, its not the case.

@safwanrahman
Copy link
Collaborator

safwanrahman commented Oct 17, 2019

@mjl Following work is needed actually, then I can give another review.

  • Removing PagingQuerysetProxy
  • Fixing the queryset slicing
  • management command argument for paraller indexing

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@safwanrahman
Copy link
Collaborator

Okay, I see you point. Still I like the "set a default and forget"
convenience; so perhaps rename it to parallel_only_used_on_full_reindex
:-) or somesuch?

What about adding a value in settings where people can add the default one? I think adding a settings makes more sense if you want something like add one time and forget! 😉

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@safwanrahman
Copy link
Collaborator

@mjl Have you pushed your changes? I dont see it updated!

@safwanrahman
Copy link
Collaborator

I think you have pushed to other branch, not this branch!

@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

@mjl mjl closed this Oct 17, 2019
@mjl mjl deleted the mjl-index-speedup branch October 17, 2019 18:33
@pySilver
Copy link
Contributor

pySilver commented Oct 17, 2019

I agree with @safwanrahman that parallel does not belong on document level but management command. And I think it should be optional flag to follow what ES does (it does not imply bulk indexing by default). Having a config flag for "set and forget" sounds as a good idea.

@safwanrahman
Copy link
Collaborator

safwanrahman commented Oct 17, 2019

@pySilver, @safwanrahman and @sabricot are different person! 😅

mjl pushed a commit to mjl/django-elasticsearch-dsl that referenced this pull request Oct 17, 2019
…_PARALLEL default setting and parameters to management command.

Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant.
Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares.

See issue django-es#154 for performance analysis and details.
@mjl mjl mentioned this pull request Oct 17, 2019
@mjl
Copy link
Contributor Author

mjl commented Oct 17, 2019 via email

safwanrahman pushed a commit that referenced this pull request Oct 28, 2019
* Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL_PARALLEL default setting and parameters to management command.
Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant.
Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares.

See issue #154 for performance analysis and details.

* Move collection of prepare functions to __init__, where it's conceptually cleaner. Also shaves off a test per object.

* Minor cleanup: Move prepare cache to Document object instead of Model, as it's conceptually possible to have several indices on the same model.
Also remove forced ordering that is a remnant of earlier code.

* chunk_size parameter for queryset.iterator() appeared in Django 2

* Do not crash in init_prepare when no fields have been defined

* Crank up diff size to see what is going on

* Adapt test to changed call pattern

* Adapt tests to changed call patterns

* Mark pagination test as expected failure for now.

* Define _prepared_fields as attribute in class so to_dict() won't pick it up as document field

* remove debugging

* Add parameter no to do a count(*) before indexing, as for complex querysets that might be expensive.

* Fixing example application

* Correctly clean up after test run (delete indices with the right name).

* Remove paginator test.
Add tests for usage of init_prepare() and _prepared_fields.
Add tests for correct calling of bulk/parallel_bulk.

* Make sure we compare w/ stable order

* Adjust for different types for methods/partials in py2

* Correct es dependency (was conflicting with requirements.txt)

* Pass queryset_pagination as chunk_size into parallel_bulk too.

* Add explanation why we use deque()

* Correct typo in explanation of test

* Remove leftover instrumentation print

* Better formatting to avoid backslash-continuation line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants