New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mjl index speedup #213
Mjl index speedup #213
Conversation
…_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue django-es#154 for performance analysis and details.
@@ -124,6 +145,12 @@ def to_field(cls, field_name, model_field): | |||
def bulk(self, actions, **kwargs): | |||
return bulk(client=self._get_connection(), actions=actions, **kwargs) | |||
|
|||
def parallel_bulk(self, actions, **kwargs): | |||
deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use deque
here? should not parallel_bulk
is enough?
Why use `deque` here? should not `parallel_bulk` is enough?
No, parallel_bulk is a generator and thus is only started when started.
I stole the "deque" method from here
https://discuss.elastic.co/t/helpers-parallel-bulk-in-python-not-working/39498/2
|
@mjl Thanks for the update. I will give a closer review tomorrow and let you know the update |
I was thinking how a user can configure the parameters of |
I understand. as |
I would not want to expose too many configuration options in settings,
it just confuses 99% of the people.
If someone really really wants to override those parameters, they can
always subclass Document and overload the parallel_bulk method to stuff
those parameters into kwargs.
|
|
||
def prepare(self, instance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change the prepare function?
It calls init_prepare() on first call to construct/gather prepare functions for all fields.
|
Actually, it might be better to do the init_prepare() in __init__(). Let
me change that real quick...
|
…ally cleaner. Also shaves off a test per object.
…, as it's conceptually possible to have several indices on the same model. Also remove forced ordering that is a remnant of earlier code.
The tests fails seems relative. Can you fix it? |
Oh my, I didn't expect that much breakage. I'll start mopping up right now.
|
I need some guidance.
There is a test for pagination. However, pagination has been totally
replaced with using qs.iterator(). Should I remove the test, or adapt it
to match current query patterns, or re-add pagination on top of
iterator() (which doesn't make a lot of sense to me and probably negates
using iterator in the first place, but...)?
|
""" | ||
qs = self.get_queryset() | ||
kwargs = {} | ||
if DJANGO_VERSION >= (2,) and self.django.queryset_pagination: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle the usecase of people who are using django< 2
version. They should be able to paginate the queryset like before.
tests/test_integration.py
Outdated
@@ -135,6 +135,7 @@ def test_get_doc_with_many_to_many_relationships(self): | |||
]) | |||
|
|||
def test_doc_to_dict(self): | |||
self.maxDiff = None # XXX: mjl temporary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it
I think paginator can not be totally replaced if we want to keep support of @mjl Will you consider adding some tests for your workflow? I know its hard to add some tests for this workflow, but I dont want things get broken in future. |
I think paginator can not be totally replaced if we want to keep support of django<2.0.
I just checked the django commit history. dj<2 does not have the chunk_size parameter but instead hardcodes it at 2000, so it's fine. dj<1.11 does not have server side cursor support in iterator so it would need manual pagination support (it should still work, but blow up with large datasets, with or without pagination -- that was part of the reason I started working on this in the first place).
Considering that official support for dj 1.10 ended in december 2017, is it worth the complexity?
|
Django 1.10 is not supported, so its not needed. But django 1.11 is a important LTS release and its the last release for python 2.x. So I would consider supporting the |
Ok, I added/adapted some tests. I fear it might be a bit brittle because it tests the inner workings of the indexing process, but hey... |
@@ -124,6 +148,12 @@ def to_field(cls, field_name, model_field): | |||
def bulk(self, actions, **kwargs): | |||
return bulk(client=self._get_connection(), actions=actions, **kwargs) | |||
|
|||
def parallel_bulk(self, actions, **kwargs): | |||
deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjl Can you add the chunk_size
parameter here which will consist the value of queryset_pagination
?
tests/test_documents.py
Outdated
car2 = Car() | ||
car3 = Car() | ||
with patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk, \ | ||
patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have this
patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk: | |
with (patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk, | |
patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk):``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk: | |
bulk = "django_elasticsearch_dsl.documents.bulk" | |
parallel_bulk = "django_elasticsearch_dsl.documents.parallel_bulk" | |
with patch(bulk) as mock_parallel_bulk, patch(parallel_bulk) as mock_parallel_bulk: |
@@ -124,6 +148,12 @@ def to_field(cls, field_name, model_field): | |||
def bulk(self, actions, **kwargs): | |||
return bulk(client=self._get_connection(), actions=actions, **kwargs) | |||
|
|||
def parallel_bulk(self, actions, **kwargs): | |||
deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0) | |
bulk_actions = parallel_bulk(client=self._get_connection(), actions=actions, | |
chunk_size=self.queryset_pagination, **kwargs), | |
# As the `parallel_bulk` is lazy, we need to get it into `deque` to run it instantly | |
deque(bulk_actions, maxlen=0) |
@mjl Thanks a lot for the work you have done. I have added couple of reviews, after fixing this thing, I think its good to merge. 👍 |
```suggestion
with (patch('django_elasticsearch_dsl.documents.bulk') as
mock_bulk,
patch('django_elasticsearch_dsl.documents.parallel_bulk')
as mock_parallel_bulk):```
Doesn't work, this isn't valid syntax according to my python 3.7. I
incorporated the rest of your suggestions.
|
name: prep_func(instance) | ||
for name, field, prep_func in self._prepared_fields | ||
} | ||
# print("-> %s" % data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this?
# print("-> %s" % data) |
@@ -124,6 +147,17 @@ def to_field(cls, field_name, model_field): | |||
def bulk(self, actions, **kwargs): | |||
return bulk(client=self._get_connection(), actions=actions, **kwargs) | |||
|
|||
def parallel_bulk(self, actions, **kwargs): | |||
if self.django.queryset_pagination and 'chunk_size' not in kwargs: | |||
kwargs['chunk_size'] = self.django.queryset_pagination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the indent.
That is weired. Can you check how to break the line without using |
+ def parallel_bulk(self, actions, **kwargs):
+ if self.django.queryset_pagination and 'chunk_size' not in kwargs:
+ kwargs['chunk_size'] = self.django.queryset_pagination
fix the indent.
ERR_CONFUSED, I don't understand? Looks fine to me?
|
Oh. sorry. it did not catch my eye that its a seperate block. Sorry |
That is weired. Can you check how to break the line without using `\`.
`\` backslach is not very much readable.
Agreed, but there is no way from a syntax POV. I could nest two with
statements, but that doesn't really help much.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice! Thanks a lot for the awesome work @mjl. It is really nice to get this kind of significant contributions get merged in Open source projects. r++ 💯
It is such a amazing that after many rounds of review and long wait time, this things are in a state to have it merged.
Congratulations to you both! Thank you for your time and hard work! |
Thanks to all involved, was great working with you guys. See you on the
next pull request! :-)
|
Just released a new branch |
@safwanrahman The release notes description is not quite complete. I'm not sure that is of interest to end users to include in the release notes though. Perhaps that is too much detail. |
@mjl Oops. I have updated the release note. Check it and let me know if anything need to be changed. |
Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL_PARALLEL default setting and parameters to management command.
Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant.
Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares.
See issue #154 for performance analysis and details.