Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with named queries #273

Closed
SHSE opened this issue Sep 20, 2014 · 9 comments
Closed

Issue with named queries #273

SHSE opened this issue Sep 20, 2014 · 9 comments

Comments

@SHSE
Copy link
Contributor

SHSE commented Sep 20, 2014

It throws exception when _name attribute is set in the query:

elasticsearch.hadoop.EsHadoopIllegalArgumentException: expected object, found START_ARRAY
        org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:50)
        org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:107)
        org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:99)
        org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:78)
        org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:297)
        org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
        org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
        scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        scala.collection.Iterator$class.foreach(Iterator.scala:727)
        scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
        org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
@costin
Copy link
Member

costin commented Sep 20, 2014

@SHSE can you provide a bit more context and the query you are trying to execute? Sample JSON/logs would be great.

@SHSE
Copy link
Contributor Author

SHSE commented Sep 21, 2014

@costin here is how to reproduce it.

ElasticSearch 1.3.2:

$ curl localhost:9200/bar/docs/1 -d'{"text":"bar"}'

Spark 1.1.0, ElasticSearch Hadoop 2.1.0.BUILD-20140920.154752-136:

context
  .esRDD(
    Map(
      "es.nodes" -> "localhost",
      "es.resource" -> "bar/docs",
      "es.query" -> "{\"query\":{\"query_string\":{\"query\":\"text:bar\",\"_name\":\"myquery\"}}}"
  ))
  .count()

@SHSE
Copy link
Contributor Author

SHSE commented Sep 21, 2014

@costin, I've fixed the issue, but there is no way to retrieve value of matched_queries field.

Method ScrollReader.readHit returns a list of two values: document id and document data. Maybe we should add a third value to the list for the document metadata (index, type, matched queries, ...)?

@costin
Copy link
Member

costin commented Sep 21, 2014

Can you explain why you would need the matched_query information? In terms of results potentially the metadata is important however it is important how it will be mapped onto the returning result when some type of structure is involved (basically anywhere but M/R and Storm).

@SHSE
Copy link
Contributor Author

SHSE commented Sep 21, 2014

@costin, we have an application where users create hierarchical queries:

  • Query A (for example: phone)
    • Sub-query B (for example: price, so the query is: phone AND price)
    • Sub-query C (for example: phone, so the query is: phone AND battery)

At some point we need to aggregate all matched documents (for all the queries). For now we do it separately for each query:

{ "query": { "query_string": { "query": "text:phone" } } }
{ "query": { "query_string": { "query": "text:(phone AND price)" } } }
{ "query": { "query_string": { "query": "text:(phone AND battery)" } } }

But if the results set for "Query A" contains the results set for "Sub-query B" and the results set for "Sub-query C" we could load only documents matched "Query A":

{
   "query": {
      "bool": {
         "must": { "query_string": { "query": "text:phone", "_name": "Query A" } },
         "should": [
            { "query_string": { "query": "text:price", "_name": "Sub-query B" } },
            { "query_string": { "query": "text:battery", "_name": "Sub-query C" } }
         ]
      }
   }
}

So now we have only one results set instead of N.

@costin
Copy link
Member

costin commented Sep 22, 2014

Fair enough but are you using the resulting _named_query that you get back from the query (potentially to identify the type of the result set)?

@SHSE
Copy link
Contributor Author

SHSE commented Sep 22, 2014

Yes, I am using matched_queries attribute to categorize the resulting documents.

@costin
Copy link
Member

costin commented Sep 24, 2014

@SHSE Support has been added into master - set es.read.metadata to true and all the metadata will be returned by default under a dedicated field _metadata which can be changed through es.read.metadata.field.

@costin
Copy link
Member

costin commented Oct 7, 2014

Hi,

I'm closing this issue - if you are interested in metadata support, please see master (upcoming Beta2).

@costin costin closed this as completed Oct 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants