Not possible to query and create indexes in the same Job #26

tzolov · 2013-04-09T14:53:38Z

Because the Configuration allows only one es.resource setting it is impossible to query one index and create a different one in the same Job. (see #18)

costin · 2013-04-18T18:13:57Z

That's actually by design - why would you want to read and write in the same job to ES? You don't gain anything and actually it's quite problematic (though I might be wrong here until the implementation is finished) for parallel querying/insertion.

ash211 · 2013-04-18T18:20:58Z

Suppose I had a workflow that pre-computed an attribute on each
Elasticsearch document and I wanted to save it back to that document?

One example I can think of is you have a large amount of stock purchases
with (unitCount, unitPrice) pairs and you want to multiply those 2 together
to get (unitCount, unitPrice, totalPrice) tuples. You'd want to read and
write from the same index there I think.

This should probably be done by whatever puts data into ES in the first
place, but being able to go back and enrich previously-existing data with a
Hadoop job is a powerful capability.

On Thu, Apr 18, 2013 at 11:13 AM, Costin Leau notifications@github.comwrote:

That's actually by design - why would you want to read and write in the
same job to ES? You don't gain anything and actually it's quite problematic
(though I might be wrong here until the implementation is finished) for
parallel querying/insertion.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/26#issuecomment-16593629
.

tzolov · 2013-04-19T07:27:01Z

I have couple of integration tests that (should) read and write from ES in a single Job. The flow looks like this: (1) Read from ES, (2) aggregate in HDP and write the statistics back to ES. Here is the source:
https://github.com/tzolov/elasticsearch-hadoop/tree/master/src/test/java/org/elasticsearch/hadoop/integration/crunch/writable/e2e

I might be missing the details but isn't the parallelism (on the map side) defined by the number of InputSplits? This is irrelevant to number of Jobs. Or perhaps you have mean something else?

costin · 2013-04-19T11:46:12Z

A Hadoop job is made up of a Mapper/Reducer . In the cases you mentioned, read and write you end up doing two operations so with two Hadoop jobs.
This is not apparent when one uses Cascading/Crunch because the resolution process happens behind the scenes.
Even for cases like Map/Reduce/Reduce (quite common) you actually end up with Map/Reduce, IdentityMapper/Reduce.

As for reading and writing data to the same index, in the same job, considering it's per shard, there are some consistency considerations that need to be taken into account.

tzolov · 2013-04-21T14:42:46Z

I'm familiar with Hadoop and M/R. Also as Crunch committer I have certain understanding of what's going behind the scenes. Although complex pipelines can produce multi-jobs execution plans this is not the case with the example above. It uses a single Hadoop Job (1 mapper + 1 reducer) to query from ES, aggregate data and write the result in (possibly different) ES index. Standalone version of the sample app: https://gist.github.com/tzolov/5429016

I'm still trying to understand what makes it difficult or unworthy to support this use case.

In the https://github.com/tzolov/elasticsearch-hadoop master branch the issue is solved by introducing ES_QUERY along the ES_RESOURCE. Former is used hold the ES query and later keeps the target index. The ESRecordReader.init() checks for ES_QUERY first and falls back to ES_RESOURCE (to support the existing semantic).
This patch works fine for me. Do you see any problem with this approach?

tzolov · 2013-05-23T11:09:09Z

To support this feature (e.g. create and update indices in one job) in addition to the patches mentioned above one has to patch also BufferedRestClient as follow:

    public BufferedRestClient(Settings settings) {
        this.client = new RestClient(settings);
        String tempIndex = settings.getTargetResource();
        if (tempIndex == null) {
            // FIX for issue #26 -- start
            tempIndex = settings.getProperty(ConfigurationOptions.ES_QUERY);
            if (tempIndex == null) {
              tempIndex = "";  
            }
            // FIX for issue #26 -- end
        }
     ...

Improve conf to allow for dedicated read and write resource as oppose to a single, unified resource used for both. This allows for different ES indices to be used in the same index, one as a source and the other as a sink. 'es.resource' is still supported and used as a fall back. Higher level abstractions, such as Cascading, Hive and Pig, set the proper property automatically. fix #156 fix #45 fix #26

tzolov added a commit to tzolov/elasticsearch-hadoop that referenced this issue Apr 9, 2013

temporal fix for issue elastic#26 (elastic#26)

014dfbb

tzolov mentioned this issue May 13, 2013

Elasticsearch+Hadoop read and write #45

Closed

costin removed the v1.3.0.M2 label Feb 6, 2014

costin added enhancement labels Mar 10, 2014

costin closed this as completed in 68cd50e Mar 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not possible to query and create indexes in the same Job #26

Not possible to query and create indexes in the same Job #26

tzolov commented Apr 9, 2013

costin commented Apr 18, 2013

ash211 commented Apr 18, 2013

tzolov commented Apr 19, 2013

costin commented Apr 19, 2013

tzolov commented Apr 21, 2013

tzolov commented May 23, 2013

Not possible to query and create indexes in the same Job #26

Not possible to query and create indexes in the same Job #26

Comments

tzolov commented Apr 9, 2013

costin commented Apr 18, 2013

ash211 commented Apr 18, 2013

tzolov commented Apr 19, 2013

costin commented Apr 19, 2013

tzolov commented Apr 21, 2013

tzolov commented May 23, 2013