Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
heavy load from spark to ES errors #706
I have done the same test with saveToEs method and I am currently loading 360 000 000 without any issue. What's the problem with saveToEsWithMeta, I assume ES needs to lookup for the doc id when upserting and it's costly.
If I understand correctly, you have 24 partitions across 4 workers which means around 6 tasks per worker.
The issue is that in case of updates (which are basically two operations - delete and index) performance suffers after 80M documents while pure indexing works fine (360M and working).
Since I don't know what hardware you have allocated to ES, I can infer the following:
Basically you are overloading the cluster which eventually gives up. Again, an update is more expensive then just a simple index and favours merging which likely is starting around the 50-60M docs and starts affecting performance seriously at the 80M mark.
You have various options here (see the performance) page for more info.
Note the above are typical ES / distributed systems advices.
Not sure why you are referring to idempotence. The issue in your case is not the outcome but rather performance.
Last but not least, this is not a bug - ES is pushing back and eventually the job fails since it takes too long to process requests. If you need more help, please use the forums.