New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop-Spark2Elasticsearch data ingestion problem: Elasticsearch index docs count is greater than Hive table rows count #628

Closed
steccami opened this Issue Dec 16, 2015 · 3 comments

Comments

Projects
None yet
2 participants
@steccami

steccami commented Dec 16, 2015

Hi all. I am trying to store a Hive table into an Elasticsearch 1.7 index following this approach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html.
At the end of the ingestion phase, the ES index docs count is greater than Hive table rows count.
The same problem occurs when sending the same data from Spark. Hive and Spark are running on CDH 5.4.

I see some job failures due to this exeception:
WARN TaskSetManager: Lost task 58.0 in stage 11.0 (TID 4791, xxx): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [32/471680](maybe ES was overloaded?). Bailing out...
This is a documented issue (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html) but it seems that usually the problem is having less docs in ES, not more.

I would say that yarn job re-submission mechanism causes hadoop to re-send records which causes doc replication in ES. Does this explanation make sense? Any suggestions about how to fix it?

Thanks in advance for your help.

@costin

This comment has been minimized.

Show comment
Hide comment
@costin

costin Jan 15, 2016

Member

Hi.
Yes it does In case of Map Reduce one can actually try to prevent this from happening as documented here. Hive also has an option for this (which ES-Hadoop should document) namely hive.mapred.reduce.tasks.speculative.execution - can you set that to false and see whether it makes any difference?

Member

costin commented Jan 15, 2016

Hi.
Yes it does In case of Map Reduce one can actually try to prevent this from happening as documented here. Hive also has an option for this (which ES-Hadoop should document) namely hive.mapred.reduce.tasks.speculative.execution - can you set that to false and see whether it makes any difference?

costin added a commit that referenced this issue Jan 15, 2016

costin added a commit that referenced this issue Jan 16, 2016

@steccami

This comment has been minimized.

Show comment
Hide comment
@steccami

steccami Jan 19, 2016

Thank you very much for your reply! In the future I will check this parameter very carefully.
Regards.

steccami commented Jan 19, 2016

Thank you very much for your reply! In the future I will check this parameter very carefully.
Regards.

@costin

This comment has been minimized.

Show comment
Hide comment
@costin

costin Jan 29, 2016

Member

Closing the issue.

Member

costin commented Jan 29, 2016

Closing the issue.

@costin costin closed this Jan 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment