Mjl index speedup #213

mjl · 2019-10-17T19:03:14Z

Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL_PARALLEL default setting and parameters to management command.
Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant.
Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares.

See issue #154 for performance analysis and details.

…_PARALLEL default setting and parameters to management command. Use qs.iterator() for fetching data during reindex, as this is much more memory efficient and performant. Instead of finding out which methods to call to prepare fields, do that finagling once and cache it for subsequent model instance prepares. See issue django-es#154 for performance analysis and details.

safwanrahman · 2019-10-17T19:10:55Z

django_elasticsearch_dsl/documents.py

@@ -124,6 +145,12 @@ def to_field(cls, field_name, model_field):
    def bulk(self, actions, **kwargs):
        return bulk(client=self._get_connection(), actions=actions, **kwargs)

+    def parallel_bulk(self, actions, **kwargs):
+        deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0)


Why use deque here? should not parallel_bulk is enough?

mjl · 2019-10-17T19:13:27Z

Why use `deque` here? should not `parallel_bulk` is enough?

No, parallel_bulk is a generator and thus is only started when started. I stole the "deque" method from here https://discuss.elastic.co/t/helpers-parallel-bulk-in-python-not-working/39498/2

safwanrahman · 2019-10-17T19:13:36Z

@mjl Thanks for the update. I will give a closer review tomorrow and let you know the update

safwanrahman · 2019-10-17T19:18:01Z

I was thinking how a user can configure the parameters of parallel_bulk like thread_count=4, chunk_size=500 etc. @mjl Do you have any idea?

safwanrahman · 2019-10-17T19:25:44Z

No, parallel_bulk is a generator and thus is only started when started.

I understand. as parallel_bulk works lazyly, we can use it to optimize it in more way! I will give more review tomorrow.

mjl · 2019-10-17T19:25:50Z

I would not want to expose too many configuration options in settings, it just confuses 99% of the people. If someone really really wants to override those parameters, they can always subclass Document and overload the parallel_bulk method to stuff those parameters into kwargs.

safwanrahman · 2019-10-17T19:30:53Z

django_elasticsearch_dsl/documents.py


+    def prepare(self, instance):


Why do we need to change the prepare function?

mjl · 2019-10-17T21:12:44Z

It calls init_prepare() on first call to construct/gather prepare functions for all fields.

mjl · 2019-10-17T21:44:52Z

Actually, it might be better to do the init_prepare() in __init__(). Let me change that real quick...

…ally cleaner. Also shaves off a test per object.

…, as it's conceptually possible to have several indices on the same model. Also remove forced ordering that is a remnant of earlier code.

safwanrahman · 2019-10-18T02:34:43Z

The tests fails seems relative. Can you fix it?

mjl · 2019-10-18T08:04:09Z

Oh my, I didn't expect that much breakage. I'll start mopping up right now.

mjl · 2019-10-18T09:32:32Z

I need some guidance. There is a test for pagination. However, pagination has been totally replaced with using qs.iterator(). Should I remove the test, or adapt it to match current query patterns, or re-add pagination on top of iterator() (which doesn't make a lot of sense to me and probably negates using iterator in the first place, but...)?

… it up as document field

safwanrahman · 2019-10-19T02:09:47Z

django_elasticsearch_dsl/documents.py

+        """
+        qs = self.get_queryset()
+        kwargs = {}
+        if DJANGO_VERSION >= (2,) and self.django.queryset_pagination:


I think we need to handle the usecase of people who are using django< 2 version. They should be able to paginate the queryset like before.

safwanrahman · 2019-10-19T02:09:56Z

tests/test_integration.py

@@ -135,6 +135,7 @@ def test_get_doc_with_many_to_many_relationships(self):
        ])

    def test_doc_to_dict(self):
+        self.maxDiff = None  # XXX: mjl temporary


safwanrahman · 2019-10-19T02:14:49Z

I need some guidance. There is a test for pagination. However, pagination has been totally replaced with using qs.iterator(). Should I remove the test, or adapt it to match current query patterns, or re-add pagination on top of iterator() (which doesn't make a lot of sense to me and probably negates using iterator in the first place, but...)?

I think paginator can not be totally replaced if we want to keep support of django<2.0. So you should use paginator in the django version where chunk_size is not present in iterator and use iterator with chunk_size where its present. Regarding tests, can you fix the tests so it cover the case of iterator and Pagination both?

@mjl Will you consider adding some tests for your workflow? I know its hard to add some tests for this workflow, but I dont want things get broken in future.

mjl · 2019-10-19T03:31:25Z

I think paginator can not be totally replaced if we want to keep support of django<2.0.

I just checked the django commit history. dj<2 does not have the chunk_size parameter but instead hardcodes it at 2000, so it's fine. dj<1.11 does not have server side cursor support in iterator so it would need manual pagination support (it should still work, but blow up with large datasets, with or without pagination -- that was part of the reason I started working on this in the first place). Considering that official support for dj 1.10 ended in december 2017, is it worth the complexity?

safwanrahman · 2019-10-19T03:47:44Z

Considering that official support for dj 1.10 ended in december 2017, is it worth the complexity?

Django 1.10 is not supported, so its not needed. But django 1.11 is a important LTS release and its the last release for python 2.x. So I would consider supporting the django 1.11 for long time. As the pagination can be configured in the document, we must support it for django 1.11 also.

mjl · 2019-10-28T08:58:02Z

Ok, I added/adapted some tests. I fear it might be a bit brittle because it tests the inner workings of the indexing process, but hey...

safwanrahman · 2019-10-28T09:19:32Z

django_elasticsearch_dsl/documents.py

@@ -124,6 +148,12 @@ def to_field(cls, field_name, model_field):
    def bulk(self, actions, **kwargs):
        return bulk(client=self._get_connection(), actions=actions, **kwargs)

+    def parallel_bulk(self, actions, **kwargs):
+        deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0)


@mjl Can you add the chunk_size parameter here which will consist the value of queryset_pagination?

safwanrahman · 2019-10-28T09:22:00Z

tests/test_documents.py

+        car2 = Car()
+        car3 = Car()
+        with patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk, \
+             patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk:


Better to have this

Suggested change

patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk:

with (patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk,

patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk):```

What about this?

Suggested change

patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk:

bulk = "django_elasticsearch_dsl.documents.bulk"

parallel_bulk = "django_elasticsearch_dsl.documents.parallel_bulk"

with patch(bulk) as mock_parallel_bulk, patch(parallel_bulk) as mock_parallel_bulk:

safwanrahman · 2019-10-28T09:25:06Z

django_elasticsearch_dsl/documents.py

@@ -124,6 +148,12 @@ def to_field(cls, field_name, model_field):
    def bulk(self, actions, **kwargs):
        return bulk(client=self._get_connection(), actions=actions, **kwargs)

+    def parallel_bulk(self, actions, **kwargs):
+        deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0)


Suggested change

deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0)

bulk_actions = parallel_bulk(client=self._get_connection(), actions=actions,

chunk_size=self.queryset_pagination, **kwargs),

# As the `parallel_bulk` is lazy, we need to get it into `deque` to run it instantly

deque(bulk_actions, maxlen=0)

safwanrahman · 2019-10-28T09:27:24Z

@mjl Thanks a lot for the work you have done. I have added couple of reviews, after fixing this thing, I think its good to merge. 👍

mjl · 2019-10-28T09:53:48Z

```suggestion with (patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk, patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk):```

Doesn't work, this isn't valid syntax according to my python 3.7. I incorporated the rest of your suggestions.

safwanrahman · 2019-10-28T10:06:32Z

django_elasticsearch_dsl/documents.py

+            name: prep_func(instance)
+                for name, field, prep_func in self._prepared_fields
+            }
+        # print("-> %s" % data)


can you remove this?

Suggested change

# print("-> %s" % data)

safwanrahman · 2019-10-28T10:07:05Z

django_elasticsearch_dsl/documents.py

@@ -124,6 +147,17 @@ def to_field(cls, field_name, model_field):
    def bulk(self, actions, **kwargs):
        return bulk(client=self._get_connection(), actions=actions, **kwargs)

+    def parallel_bulk(self, actions, **kwargs):
+        if self.django.queryset_pagination and 'chunk_size' not in kwargs:
+            kwargs['chunk_size'] = self.django.queryset_pagination


fix the indent.

safwanrahman · 2019-10-28T10:10:42Z

Doesn't work, this isn't valid syntax according to my python 3.7. I
incorporated the rest of your suggestions.

That is weired. Can you check how to break the line without using \. \ backslach is not very much readable.

mjl · 2019-10-28T10:16:28Z

+ def parallel_bulk(self, actions, **kwargs): + if self.django.queryset_pagination and 'chunk_size' not in kwargs: + kwargs['chunk_size'] = self.django.queryset_pagination fix the indent.

ERR_CONFUSED, I don't understand? Looks fine to me?

safwanrahman · 2019-10-28T10:18:28Z

ERR_CONFUSED, I don't understand? Looks fine to me?

Oh. sorry. it did not catch my eye that its a seperate block. Sorry

mjl · 2019-10-28T10:18:31Z

That is weired. Can you check how to break the line without using `\`. `\` backslach is not very much readable.

Agreed, but there is no way from a syntax POV. I could nest two with statements, but that doesn't really help much.

safwanrahman

Looks really nice! Thanks a lot for the awesome work @mjl. It is really nice to get this kind of significant contributions get merged in Open source projects. r++ 💯

It is such a amazing that after many rounds of review and long wait time, this things are in a state to have it merged.

pySilver · 2019-10-28T13:24:22Z

Congratulations to you both! Thank you for your time and hard work!

mjl · 2019-10-28T14:55:42Z

Thanks to all involved, was great working with you guys. See you on the next pull request! :-)

safwanrahman · 2019-10-28T19:31:56Z

Just released a new branch 7.1.0 with the changes. https://pypi.org/project/django-elasticsearch-dsl/7.1.0/#description

mjl · 2019-10-29T14:35:26Z

@safwanrahman The release notes description is not quite complete.
Most of the speed improvement comes from using queryset.iterator(), which also uses way less memory. Then comes precomputing prepare functions to move all those comparisons out of the criticial path; and then there is the parallel indexing.

I'm not sure that is of interest to end users to include in the release notes though. Perhaps that is too much detail.

safwanrahman · 2019-10-29T14:42:27Z

@mjl Oops. I have updated the release note. Check it and let me know if anything need to be changed.

mjl changed the title ~~Use elasticsearch's parallel_bulk for indexing, add ELASTICSEARCH_DSL…~~ Mjl index speedup Oct 17, 2019

mjl mentioned this pull request Oct 17, 2019

Mjl index speedup #154

Closed

safwanrahman reviewed Oct 17, 2019

View reviewed changes

django_elasticsearch_dsl/documents.py

def prepare(self, instance):

Copy link

Collaborator

safwanrahman Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the prepare function?

Martin J. Laubach added 2 commits October 17, 2019 23:46

Move collection of prepare functions to __init__, where it's conceptu…

e5cd2df

…ally cleaner. Also shaves off a test per object.

Minor cleanup: Move prepare cache to Document object instead of Model…

861f704

…, as it's conceptually possible to have several indices on the same model. Also remove forced ordering that is a remnant of earlier code.

Martin J. Laubach added 4 commits October 18, 2019 10:10

chunk_size parameter for queryset.iterator() appeared in Django 2

827800d

Do not crash in init_prepare when no fields have been defined

1773ed6

Crank up diff size to see what is going on

acda7e7

Adapt test to changed call pattern

8279746

Martin J. Laubach added 3 commits October 18, 2019 11:52

Adapt tests to changed call patterns

c4f230d

Mark pagination test as expected failure for now.

1dcc013

Define _prepared_fields as attribute in class so to_dict() won't pick…

505406e

… it up as document field

safwanrahman reviewed Oct 19, 2019

View reviewed changes

remove debugging

93f7d7c

Correct es dependency (was conflicting with requirements.txt)

45f62e7

safwanrahman reviewed Oct 28, 2019

View reviewed changes

Martin J. Laubach added 3 commits October 28, 2019 10:38

Pass queryset_pagination as chunk_size into parallel_bulk too.

c933fb7

Add explanation why we use deque()

96c3883

Correct typo in explanation of test

9c138d6

safwanrahman reviewed Oct 28, 2019

View reviewed changes

Remove leftover instrumentation print

9ec41a0

Better formatting to avoid backslash-continuation line

5086396

safwanrahman approved these changes Oct 28, 2019

View reviewed changes

safwanrahman merged commit 13a4d5c into django-es:master Oct 28, 2019

mjl deleted the mjl-index-speedup branch October 29, 2019 12:01

This was referenced Oct 31, 2019

Index build memory and speed issue redux #137

Closed

Indexing large amounts of data can cause memory leaks #118

Closed

manage.py search_index -rebuild takes a very long time! #130

Closed

Adding documentation for parallel indexing. #221

Merged

	patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk:
	with (patch('django_elasticsearch_dsl.documents.bulk') as mock_bulk,
	patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk):```

-             patch('django_elasticsearch_dsl.documents.parallel_bulk') as mock_parallel_bulk:
+             bulk = "django_elasticsearch_dsl.documents.bulk"
+             parallel_bulk = "django_elasticsearch_dsl.documents.parallel_bulk"
+             with patch(bulk) as mock_parallel_bulk, patch(parallel_bulk) as mock_parallel_bulk:

-        deque(parallel_bulk(client=self._get_connection(), actions=actions, **kwargs), maxlen=0)
+        bulk_actions = parallel_bulk(client=self._get_connection(), actions=actions,
+                                         chunk_size=self.queryset_pagination, **kwargs),
+        # As the `parallel_bulk` is lazy, we need to get it into `deque` to run it instantly
+        deque(bulk_actions, maxlen=0)

Mjl index speedup #213

Mjl index speedup #213

Conversation

mjl commented Oct 17, 2019 • edited

Choose a reason for hiding this comment

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 17, 2019

safwanrahman commented Oct 17, 2019 • edited

safwanrahman commented Oct 17, 2019

mjl commented Oct 17, 2019 via email

Choose a reason for hiding this comment

mjl commented Oct 17, 2019 via email

mjl commented Oct 17, 2019 via email

safwanrahman commented Oct 18, 2019

mjl commented Oct 18, 2019 via email

mjl commented Oct 18, 2019 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Oct 19, 2019

mjl commented Oct 19, 2019 via email

safwanrahman commented Oct 19, 2019

mjl commented Oct 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Oct 28, 2019

mjl commented Oct 28, 2019 via email

safwanrahman Oct 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Oct 28, 2019

mjl commented Oct 28, 2019 via email

safwanrahman commented Oct 28, 2019

mjl commented Oct 28, 2019 via email

safwanrahman left a comment • edited

Choose a reason for hiding this comment

pySilver commented Oct 28, 2019

mjl commented Oct 28, 2019 via email

safwanrahman commented Oct 28, 2019

mjl commented Oct 29, 2019

safwanrahman commented Oct 29, 2019

mjl commented Oct 17, 2019 •

edited

safwanrahman commented Oct 17, 2019 •

edited

safwanrahman Oct 28, 2019 •

edited

safwanrahman left a comment •

edited