Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent num of docs written to ES #167

Closed
mbaig opened this Issue Mar 13, 2014 · 6 comments

Comments

Projects
None yet
2 participants
@mbaig
Copy link

commented Mar 13, 2014

Seeing an issue with M2. Specifically, es-hadoop is not writing the proper number of docs to ES. For instance, es-hadoop should write 2,964,515 docs to ES, instead, I'm seeing the following inconsistent number of writes:

attempt num of docs written
1st 2,876,634
2nd 2,964,515
3rd 2,935,711

Note, I deleted the index between every write attempt, without doing so would cause es-hadoop to write more docs than were actually produced, which is even weirder.
Note also, only the 2nd attempt succeeded.

I would like to point you to some trace logs, but, given their sheer size after ~3mil writes, not sure how to truncate them to a manageable size.

@mbaig

This comment has been minimized.

Copy link
Author

commented Mar 13, 2014

Also of note, I had M2 client connect to our ES cluster via ssh tunnels, which btw work perfectly in M2 but NOT in SNAPSHOT.

@costin

This comment has been minimized.

Copy link
Member

commented Mar 13, 2014

Without any logs or settings regarding networking and the difference between M2 and SNAPSHOT, I can guess what the issue was in your case - which was that the discovered cluster setting was not actually propagated to the job.

costin added a commit that referenced this issue Mar 17, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Mar 17, 2014

Considering the issues you had with M2, it would be great if you could retry your tests with the current master.
Not only it supports proxying (both SOCKS - what you probably use and HTTP) but also stats. These are recorded for each job and are likely to give a better picture of what's going on.
For example:
22:37:49,228 INFO main mapred.JobClient - Elasticsearch Hadoop Counters 22:37:49,228 INFO main mapred.JobClient - Bulk Retries=0 22:37:49,228 INFO main mapred.JobClient - Bytes Written=159129 22:37:49,228 INFO main mapred.JobClient - Network Retries=0 22:37:49,228 INFO main mapred.JobClient - Bytes Retried=0 22:37:49,228 INFO main mapred.JobClient - Bulk Writes=20 22:37:49,228 INFO main mapred.JobClient - Documents Recorded=993 22:37:49,228 INFO main mapred.JobClient - Documents Read=0 22:37:49,229 INFO main mapred.JobClient - Node Retries=0 22:37:49,229 INFO main mapred.JobClient - Documents Written=993 22:37:49,229 INFO main mapred.JobClient - Bytes Recorded=159129 22:37:49,229 INFO main mapred.JobClient - Bytes Read=79921

Notice the various fields - retries vs written vs recorded.
I'd be curious how master fares against your current job.
And make sure you are disabling speculative execution otherwise you will get duplicates.

@costin

This comment has been minimized.

Copy link
Member

commented Mar 18, 2014

By the way, the master contains reporting for local Cascading as well - which means whether you are using local or Hadoop mode, you can get a hold of these stats in Cascading through Flow#getStats.
Can you please retry your job and use the stats instead - they should give us better insight into what is going on.

@costin

This comment has been minimized.

Copy link
Member

commented Mar 24, 2014

Any update on this front?

@costin costin added bug and removed v1.3.0.M2 labels Mar 24, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 8, 2014

@mbaig Since there hasn't been any update on this issue, I'm closing it down. If you still see issues, please re-open it but make sure you test the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.