Issue with named queries #273

SHSE · 2014-09-20T20:41:41Z

It throws exception when _name attribute is set in the query:

elasticsearch.hadoop.EsHadoopIllegalArgumentException: expected object, found START_ARRAY
        org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:50)
        org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:107)
        org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:99)
        org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:78)
        org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:297)
        org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
        org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
        scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        scala.collection.Iterator$class.foreach(Iterator.scala:727)
        scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
        org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The text was updated successfully, but these errors were encountered:

costin · 2014-09-20T21:15:10Z

@SHSE can you provide a bit more context and the query you are trying to execute? Sample JSON/logs would be great.

SHSE · 2014-09-21T06:06:50Z

@costin here is how to reproduce it.

ElasticSearch 1.3.2:

$ curl localhost:9200/bar/docs/1 -d'{"text":"bar"}'

Spark 1.1.0, ElasticSearch Hadoop 2.1.0.BUILD-20140920.154752-136:

context
  .esRDD(
    Map(
      "es.nodes" -> "localhost",
      "es.resource" -> "bar/docs",
      "es.query" -> "{\"query\":{\"query_string\":{\"query\":\"text:bar\",\"_name\":\"myquery\"}}}"
  ))
  .count()

SHSE · 2014-09-21T13:25:40Z

@costin, I've fixed the issue, but there is no way to retrieve value of matched_queries field.

Method ScrollReader.readHit returns a list of two values: document id and document data. Maybe we should add a third value to the list for the document metadata (index, type, matched queries, ...)?

costin · 2014-09-21T15:48:46Z

Can you explain why you would need the matched_query information? In terms of results potentially the metadata is important however it is important how it will be mapped onto the returning result when some type of structure is involved (basically anywhere but M/R and Storm).

SHSE · 2014-09-21T20:12:30Z

@costin, we have an application where users create hierarchical queries:

Query A (for example: phone)
- Sub-query B (for example: price, so the query is: phone AND price)
- Sub-query C (for example: phone, so the query is: phone AND battery)

At some point we need to aggregate all matched documents (for all the queries). For now we do it separately for each query:

{ "query": { "query_string": { "query": "text:phone" } } }
{ "query": { "query_string": { "query": "text:(phone AND price)" } } }
{ "query": { "query_string": { "query": "text:(phone AND battery)" } } }

But if the results set for "Query A" contains the results set for "Sub-query B" and the results set for "Sub-query C" we could load only documents matched "Query A":

{
   "query": {
      "bool": {
         "must": { "query_string": { "query": "text:phone", "_name": "Query A" } },
         "should": [
            { "query_string": { "query": "text:price", "_name": "Sub-query B" } },
            { "query_string": { "query": "text:battery", "_name": "Sub-query C" } }
         ]
      }
   }
}

So now we have only one results set instead of N.

costin · 2014-09-22T10:31:15Z

Fair enough but are you using the resulting _named_query that you get back from the query (potentially to identify the type of the result set)?

SHSE · 2014-09-22T11:23:39Z

Yes, I am using matched_queries attribute to categorize the resulting documents.

costin · 2014-09-24T11:37:23Z

@SHSE Support has been added into master - set es.read.metadata to true and all the metadata will be returned by default under a dedicated field _metadata which can be changed through es.read.metadata.field.

costin · 2014-10-07T13:24:40Z

Hi,

I'm closing this issue - if you are interested in metadata support, please see master (upcoming Beta2).

SHSE mentioned this issue Sep 21, 2014

Fixed an issue with named queries #274

Closed

costin added bug v2.0.2 v2.1.0.Beta2 :MR :Rest labels Sep 21, 2014

costin mentioned this issue Sep 22, 2014

Return metadata (not just source) to clients #275

Closed

costin added enhancement v2.0.2 and removed v2.0.2 enhancement labels Sep 22, 2014

costin closed this as completed Oct 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with named queries #273

Issue with named queries #273

SHSE commented Sep 20, 2014

costin commented Sep 20, 2014

SHSE commented Sep 21, 2014

SHSE commented Sep 21, 2014

costin commented Sep 21, 2014

SHSE commented Sep 21, 2014

costin commented Sep 22, 2014

SHSE commented Sep 22, 2014

costin commented Sep 24, 2014

costin commented Oct 7, 2014

Issue with named queries #273

Issue with named queries #273

Comments

SHSE commented Sep 20, 2014

costin commented Sep 20, 2014

SHSE commented Sep 21, 2014

SHSE commented Sep 21, 2014

costin commented Sep 21, 2014

SHSE commented Sep 21, 2014

costin commented Sep 22, 2014

SHSE commented Sep 22, 2014

costin commented Sep 24, 2014

costin commented Oct 7, 2014