Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different result between sc.esRDD versus EsSpark.esRDD #687

Closed
randallwhitman opened this issue Feb 1, 2016 · 2 comments

Comments

@randallwhitman
Copy link

commented Feb 1, 2016

In a certain test, I'm seeing one result with EsSpark.esRDD but two results with sc.esRDD.

Can you reproduce the following result? (It may be that I am getting confused by instability in my development environment, though I've got the following result consistently multiple times.)

Elasticsearch-hadoop 2.2.0.BUILD-SNAPSHOT; Elasticsearch Java API 1.7.[14]; Elasticsearch server 1.6.2; Spark-1.6.0; CentOS-6.

    val rawCore = List( Map("colint" -> 1, "colstr" -> "s"),
                        Map("colint" -> null, "colstr" -> null) )
    sc.parallelize(rawCore, 1).saveToEs(coreResource, connectorConfig)
    val qjson =
      """{"query":{"range":{"colint":{"from":null,"to":"9","include_lower":true,"include_upper":true}}}}"""
    val rdd2 = EsSpark.esRDD(sc, coreResource, qjson, connectorConfig)
    val got = rdd2.collect().size   // 1
    val rdd2 = sc.esRDD(coreResource, qjson, connectorConfig)
    val got = rdd2.collect().size   // 2
costin added a commit that referenced this issue Feb 1, 2016
Add extra test
relates #687
@costin

This comment has been minimized.

Copy link
Member

commented Feb 1, 2016

Something is off and based on your code you are probably running two separate tests against the same stateful ES instance, resulting in the difference.
If you look at the source, you'll notice that SparkContext.esRDD is a simple alias for def esRDD(resource: String, query: String) = EsSpark.esRDD(sc, resource, query).
In other words, it is the exact same method, no more, no less...
I don't know what else to tell you except you're probably doing something wrong (make sure you are refreshing Elasticsearch before the count since otherwise you might see the count before and then after the refresh and that will affect the outcome).

@costin costin closed this Feb 1, 2016

@costin costin reopened this Feb 1, 2016

@randallwhitman

This comment has been minimized.

Copy link
Author

commented Feb 1, 2016

Agree that is strange because the methods are aliases.
I can make a note to look into the refresh API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.