-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not possible to query and create indexes in the same Job #26
Comments
That's actually by design - why would you want to read and write in the same job to ES? You don't gain anything and actually it's quite problematic (though I might be wrong here until the implementation is finished) for parallel querying/insertion. |
Suppose I had a workflow that pre-computed an attribute on each One example I can think of is you have a large amount of stock purchases This should probably be done by whatever puts data into ES in the first On Thu, Apr 18, 2013 at 11:13 AM, Costin Leau notifications@github.comwrote:
|
I have couple of integration tests that (should) read and write from ES in a single Job. The flow looks like this: (1) Read from ES, (2) aggregate in HDP and write the statistics back to ES. Here is the source: I might be missing the details but isn't the parallelism (on the map side) defined by the number of InputSplits? This is irrelevant to number of Jobs. Or perhaps you have mean something else? |
A Hadoop job is made up of a Mapper/Reducer . In the cases you mentioned, read and write you end up doing two operations so with two Hadoop jobs. As for reading and writing data to the same index, in the same job, considering it's per shard, there are some consistency considerations that need to be taken into account. |
I'm familiar with Hadoop and M/R. Also as Crunch committer I have certain understanding of what's going behind the scenes. Although complex pipelines can produce multi-jobs execution plans this is not the case with the example above. It uses a single Hadoop Job (1 mapper + 1 reducer) to query from ES, aggregate data and write the result in (possibly different) ES index. Standalone version of the sample app: https://gist.github.com/tzolov/5429016 I'm still trying to understand what makes it difficult or unworthy to support this use case. In the https://github.com/tzolov/elasticsearch-hadoop master branch the issue is solved by introducing ES_QUERY along the ES_RESOURCE. Former is used hold the ES query and later keeps the target index. The ESRecordReader.init() checks for ES_QUERY first and falls back to ES_RESOURCE (to support the existing semantic). |
To support this feature (e.g. create and update indices in one job) in addition to the patches mentioned above one has to patch also public BufferedRestClient(Settings settings) {
this.client = new RestClient(settings);
String tempIndex = settings.getTargetResource();
if (tempIndex == null) {
// FIX for issue #26 -- start
tempIndex = settings.getProperty(ConfigurationOptions.ES_QUERY);
if (tempIndex == null) {
tempIndex = "";
}
// FIX for issue #26 -- end
}
... |
Improve conf to allow for dedicated read and write resource as oppose to a single, unified resource used for both. This allows for different ES indices to be used in the same index, one as a source and the other as a sink. 'es.resource' is still supported and used as a fall back. Higher level abstractions, such as Cascading, Hive and Pig, set the proper property automatically. fix #156 fix #45 fix #26
Because the Configuration allows only one es.resource setting it is impossible to query one index and create a different one in the same Job. (see #18)
The text was updated successfully, but these errors were encountered: