Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not possible to query and create indexes in the same Job #26

Closed
tzolov opened this Issue Apr 9, 2013 · 6 comments

Comments

Projects
None yet
3 participants
@tzolov
Copy link

tzolov commented Apr 9, 2013

Because the Configuration allows only one es.resource setting it is impossible to query one index and create a different one in the same Job. (see #18)

tzolov added a commit to tzolov/elasticsearch-hadoop that referenced this issue Apr 9, 2013

@costin

This comment has been minimized.

Copy link
Member

costin commented Apr 18, 2013

That's actually by design - why would you want to read and write in the same job to ES? You don't gain anything and actually it's quite problematic (though I might be wrong here until the implementation is finished) for parallel querying/insertion.

@ash211

This comment has been minimized.

Copy link
Contributor

ash211 commented Apr 18, 2013

Suppose I had a workflow that pre-computed an attribute on each
Elasticsearch document and I wanted to save it back to that document?

One example I can think of is you have a large amount of stock purchases
with (unitCount, unitPrice) pairs and you want to multiply those 2 together
to get (unitCount, unitPrice, totalPrice) tuples. You'd want to read and
write from the same index there I think.

This should probably be done by whatever puts data into ES in the first
place, but being able to go back and enrich previously-existing data with a
Hadoop job is a powerful capability.

On Thu, Apr 18, 2013 at 11:13 AM, Costin Leau notifications@github.comwrote:

That's actually by design - why would you want to read and write in the
same job to ES? You don't gain anything and actually it's quite problematic
(though I might be wrong here until the implementation is finished) for
parallel querying/insertion.


Reply to this email directly or view it on GitHubhttps://github.com//issues/26#issuecomment-16593629
.

@tzolov

This comment has been minimized.

Copy link
Author

tzolov commented Apr 19, 2013

I have couple of integration tests that (should) read and write from ES in a single Job. The flow looks like this: (1) Read from ES, (2) aggregate in HDP and write the statistics back to ES. Here is the source:
https://github.com/tzolov/elasticsearch-hadoop/tree/master/src/test/java/org/elasticsearch/hadoop/integration/crunch/writable/e2e

I might be missing the details but isn't the parallelism (on the map side) defined by the number of InputSplits? This is irrelevant to number of Jobs. Or perhaps you have mean something else?

@costin

This comment has been minimized.

Copy link
Member

costin commented Apr 19, 2013

A Hadoop job is made up of a Mapper/Reducer . In the cases you mentioned, read and write you end up doing two operations so with two Hadoop jobs.
This is not apparent when one uses Cascading/Crunch because the resolution process happens behind the scenes.
Even for cases like Map/Reduce/Reduce (quite common) you actually end up with Map/Reduce, IdentityMapper/Reduce.

As for reading and writing data to the same index, in the same job, considering it's per shard, there are some consistency considerations that need to be taken into account.

@tzolov

This comment has been minimized.

Copy link
Author

tzolov commented Apr 21, 2013

I'm familiar with Hadoop and M/R. Also as Crunch committer I have certain understanding of what's going behind the scenes. Although complex pipelines can produce multi-jobs execution plans this is not the case with the example above. It uses a single Hadoop Job (1 mapper + 1 reducer) to query from ES, aggregate data and write the result in (possibly different) ES index. Standalone version of the sample app: https://gist.github.com/tzolov/5429016

I'm still trying to understand what makes it difficult or unworthy to support this use case.

In the https://github.com/tzolov/elasticsearch-hadoop master branch the issue is solved by introducing ES_QUERY along the ES_RESOURCE. Former is used hold the ES query and later keeps the target index. The ESRecordReader.init() checks for ES_QUERY first and falls back to ES_RESOURCE (to support the existing semantic).
This patch works fine for me. Do you see any problem with this approach?

@tzolov

This comment has been minimized.

Copy link
Author

tzolov commented May 23, 2013

To support this feature (e.g. create and update indices in one job) in addition to the patches mentioned above one has to patch also BufferedRestClient as follow:

    public BufferedRestClient(Settings settings) {
        this.client = new RestClient(settings);
        String tempIndex = settings.getTargetResource();
        if (tempIndex == null) {
            // FIX for issue #26 -- start
            tempIndex = settings.getProperty(ConfigurationOptions.ES_QUERY);
            if (tempIndex == null) {
              tempIndex = "";  
            }
            // FIX for issue #26 -- end
        }
     ...

@costin costin removed the v1.3.0.M2 label Feb 6, 2014

@costin costin closed this in 68cd50e Mar 10, 2014

costin added a commit that referenced this issue Apr 8, 2014

Split global resource into read/write targets
Improve conf to allow for dedicated read and write resource as oppose to
a single, unified resource used for both. This allows for different ES
indices to be used in the same index, one as a source and the other as
a sink.

'es.resource' is still supported and used as a fall back.
Higher level abstractions, such as Cascading, Hive and Pig, set the
proper property automatically.

fix #156
fix #45
fix #26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.