reading an index and writing into another one #156

jpparis-orange · 2014-02-28T13:18:12Z

Hello,

I'm trying to read an ES index from hive and writing back into another one in the same time. The job runs fine without any errors but no documents appear in the ES index connected to eswrite table.
Is this possible to do such copy?

Here is the version of the different components I use:

elasticsearch-1.0.0
elasticsearch-hadoop-yarn.jar from 1.3.0.M2
hadoop-2.2.0-bin
hive-0.12.0-bin

I have prepared a gist recreation here https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother. At the end, you'll find the hive commands (after the shell exit) used to copy.

The COPY command give the following output.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
Ended Job = job_1393406785607_0155
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.33 sec HDFS Read: 1683 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 330 msec
OK

After that, my hwrite index is still empty, but the version of docs in hread is 2 ! It seems to me that I'm writing in the hread index.

When using a temporary hive table, I can copy my data with the following 2 commands:
INSERT OVERWRITE TABLE tmp SELECT * FROM es_read;
INSERT OVERWRITE TABLE es_write SELECT * FROM tmp;

This issue seems to be related to #125 and #70.

thanks
jp

jpparis-orange · 2014-03-03T07:32:23Z

In the above gist, if I change the number of shards for hread index to 4, I get the following exception (it is still ok with 3 shards):
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:240)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.NullPointerException
at org.elasticsearch.hadoop.mr.EsOutputFormat$ESRecordWriter.doClose(EsOutputFormat.java:245)
at org.elasticsearch.hadoop.mr.EsOutputFormat$ESRecordWriter.close(EsOutputFormat.java:229)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$ESHiveRecordWriter.close(EsHiveOutputFormat.java:66)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:181)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:866)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:596)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
... 8 more

costin · 2014-03-04T17:09:45Z

The issue is that the table properties are written to the same Hadoop configuration object which means only one es.resource property will exist in the end which will be used for reading and writing.For your case to work, there should be some type of isolation between the table properties but I'm not sure how that would work.
In fact, I'm not even sure whether generic Hive table properties don't override each other through the main Hadoop configuration object...

jpparis-orange · 2014-03-05T07:07:13Z

I'm a bit disappointed because using an intermediate hive table in my usecase would not be feasible : the size of my indices are way too high. Anyway, I understand your explanation. I'll now try to use your Hadoop-ES bridge directly in Java to see if I can go further.

Thanks for the quick answer!

costin · 2014-03-10T15:52:09Z

Hi,

This has been fixed in master - can you please try it out? The next nightly build (#336) should include it but of course, you can build it yourself until the build is published.

Basically, there's no need for an intermediate table - you can have different input/output indices in the same job.

jpparis-orange · 2014-03-12T15:34:33Z

Hi,

Great job!

I have updated the gist https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother with the appropriate commands. They are doing the right job now !

I'm really pleased to see this this issue closed!

costin · 2014-03-12T15:41:46Z

Great! Could you also post some stats between using the intermediate table vs using ES directly?

If it's easier we can just chat on IRC about this - I'm costin on #elasticsearch.
On 3/12/2014 5:34 PM, jpparis-orange wrote:

Hi,

Great job!

I have updated the gist https://gist.github.com/jpparis-orange/9319913#file-hivecopyesindextoanother with the
appropriate commands. They are doing the right job now !

I'm really pleased to see this this issue closed!

—
Reply to this email directly or view it on GitHub
#156 (comment).

Costin

Improve conf to allow for dedicated read and write resource as oppose to a single, unified resource used for both. This allows for different ES indices to be used in the same index, one as a source and the other as a sink. 'es.resource' is still supported and used as a fall back. Higher level abstractions, such as Cascading, Hive and Pig, set the proper property automatically. fix #156 fix #45 fix #26

costin added v1.3.0.M3 labels Mar 10, 2014

costin closed this as completed in 68cd50e Mar 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading an index and writing into another one #156

reading an index and writing into another one #156

jpparis-orange commented Feb 28, 2014

jpparis-orange commented Mar 3, 2014

costin commented Mar 4, 2014

jpparis-orange commented Mar 5, 2014

costin commented Mar 10, 2014

jpparis-orange commented Mar 12, 2014

costin commented Mar 12, 2014

reading an index and writing into another one #156

reading an index and writing into another one #156

Comments

jpparis-orange commented Feb 28, 2014

jpparis-orange commented Mar 3, 2014

costin commented Mar 4, 2014

jpparis-orange commented Mar 5, 2014

costin commented Mar 10, 2014

jpparis-orange commented Mar 12, 2014

costin commented Mar 12, 2014