Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory when populating large dataset #433

Open
mathijsfr opened this issue Nov 11, 2022 · 1 comment
Open

Out of memory when populating large dataset #433

mathijsfr opened this issue Nov 11, 2022 · 1 comment

Comments

@mathijsfr
Copy link

mathijsfr commented Nov 11, 2022

I am trying to re-index more than 100 million documents, which doesn't work due to lack of ram.

Is it possible that the problem is in the Elasticsearch implementation when executing parallel indexing?

Here is an issue where they talk about the memory leak:
elastic/elasticsearch-py#1101 (comment)

Looks like my memory fills up after this line when using streaming_bulk:
https://github.com/elastic/elasticsearch-py/blob/8d10e1545e2572d3ab1e92cfaf0968085145eb4d/elasticsearch/helpers/actions.py#L232

@lowdeyou
Copy link

lowdeyou commented Feb 17, 2023

I experienced the same issue with MySQL database. Are you also using MySQL database for Django?

I have done some testing and attempted to replicate this issue on both MySQL and Postgres databases, and I found that Postgres does not encounter Out Of Memory issues when dealing with large amount of data. Have not tested with other databases.

I did some investigation and found that the issue stems from Django iterator on this line:

It seems that for MySQL, the iterator will still load the entire data set regardless, ignoring the chunk_size argument.
Not too sure why the difference in behavior but I suspect could be due this:
https://docs.djangoproject.com/en/4.1/ref/models/querysets/#without-server-side-cursors

To get around this issue, I wrote a custom iterator that does the chunking on the app-level, which should behave correctly regardless of which database you use:
lowdeyou@b43e6d2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants