Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading an index and writing into another one #156

Closed
jpparis-orange opened this Issue Feb 28, 2014 · 6 comments

Comments

Projects
None yet
2 participants
@jpparis-orange
Copy link

commented Feb 28, 2014

Hello,

I'm trying to read an ES index from hive and writing back into another one in the same time. The job runs fine without any errors but no documents appear in the ES index connected to eswrite table.
Is this possible to do such copy?

Here is the version of the different components I use:

  • elasticsearch-1.0.0
  • elasticsearch-hadoop-yarn.jar from 1.3.0.M2
  • hadoop-2.2.0-bin
  • hive-0.12.0-bin

I have prepared a gist recreation here https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother. At the end, you'll find the hive commands (after the shell exit) used to copy.

The COPY command give the following output.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
Ended Job = job_1393406785607_0155
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.33 sec HDFS Read: 1683 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 330 msec
OK

After that, my hwrite index is still empty, but the version of docs in hread is 2 ! It seems to me that I'm writing in the hread index.

When using a temporary hive table, I can copy my data with the following 2 commands:
INSERT OVERWRITE TABLE tmp SELECT * FROM es_read;
INSERT OVERWRITE TABLE es_write SELECT * FROM tmp;

This issue seems to be related to #125 and #70.

thanks
jp

@jpparis-orange

This comment has been minimized.

Copy link
Author

commented Mar 3, 2014

In the above gist, if I change the number of shards for hread index to 4, I get the following exception (it is still ok with 3 shards):
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:240)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.NullPointerException
at org.elasticsearch.hadoop.mr.EsOutputFormat$ESRecordWriter.doClose(EsOutputFormat.java:245)
at org.elasticsearch.hadoop.mr.EsOutputFormat$ESRecordWriter.close(EsOutputFormat.java:229)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$ESHiveRecordWriter.close(EsHiveOutputFormat.java:66)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:181)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:866)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:596)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
... 8 more

@costin

This comment has been minimized.

Copy link
Member

commented Mar 4, 2014

The issue is that the table properties are written to the same Hadoop configuration object which means only one es.resource property will exist in the end which will be used for reading and writing.For your case to work, there should be some type of isolation between the table properties but I'm not sure how that would work.
In fact, I'm not even sure whether generic Hive table properties don't override each other through the main Hadoop configuration object...

@jpparis-orange

This comment has been minimized.

Copy link
Author

commented Mar 5, 2014

I'm a bit disappointed because using an intermediate hive table in my usecase would not be feasible : the size of my indices are way too high. Anyway, I understand your explanation. I'll now try to use your Hadoop-ES bridge directly in Java to see if I can go further.

Thanks for the quick answer!

@costin costin closed this in 68cd50e Mar 10, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Mar 10, 2014

Hi,

This has been fixed in master - can you please try it out? The next nightly build (#336) should include it but of course, you can build it yourself until the build is published.

Basically, there's no need for an intermediate table - you can have different input/output indices in the same job.

@jpparis-orange

This comment has been minimized.

Copy link
Author

commented Mar 12, 2014

Hi,

Great job!

I have updated the gist https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother with the appropriate commands. They are doing the right job now !

I'm really pleased to see this this issue closed!

@costin

This comment has been minimized.

Copy link
Member

commented Mar 12, 2014

Great! Could you also post some stats between using the intermediate table vs using ES directly?

If it's easier we can just chat on IRC about this - I'm costin on #elasticsearch.
On 3/12/2014 5:34 PM, jpparis-orange wrote:

Hi,

Great job!

I have updated the gist https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother with the
appropriate commands. They are doing the right job now !

I'm really pleased to see this this issue closed!


Reply to this email directly or view it on GitHub
#156 (comment).

Costin

costin added a commit that referenced this issue Apr 8, 2014

Split global resource into read/write targets
Improve conf to allow for dedicated read and write resource as oppose to
a single, unified resource used for both. This allows for different ES
indices to be used in the same index, one as a source and the other as
a sink.

'es.resource' is still supported and used as a fall back.
Higher level abstractions, such as Cascading, Hive and Pig, set the
proper property automatically.

fix #156
fix #45
fix #26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.