Fix spark sql queries with metadata field #930

drudim · 2017-02-11T22:26:22Z

The PR solves scala.MatchError in Spark SQL queries with _metadata field inside.
More details on the issue: #924

I was trying to update integration tests with that case,
but I am getting failed tests even for the current code in branch 5.1.

[x] I have signed the Contributor License Agreement (CLA)

jbaiera · 2017-04-24T21:11:45Z

@drudim Sorry for the very delayed reviews. I'm taking a look through this right now and it looks like it's written against 5.1. Could you rebase these changes on top of 5.x? There are a few changes at the moment upstream that have to do with source field filtering that you may want to get pulled into this PR.

jbaiera

I left some comments on the code. I think you're headed in a very good direction. I'd like to hear about what kind of testing problems you ran into as well to see if I can help out.

jbaiera · 2017-04-25T17:22:32Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/Utils.java

    static final String DATA_SOURCE_KEEP_HANDLED_FILTERS = "es.internal.spark.sql.pushdown.keep.handled.filters";

+    // columns selected by Spark SQL query
+    static final String DATA_SOURCE_REQUIRED_COLUMNS = "es.internal.spark.sql.required.columns";


Could we change this to a different location? Like InternalConfigurationOptions?

Are you sure about this? It looks like all "es.internal.spark." options are located here. I feel like the set of columns selected by spark sql fits into "es.internal.spark." group.

Yeah, this comment was my mistake originally. It's fine in this location.

jbaiera · 2017-04-25T17:23:44Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala

-    }
+
+    val requiredFields = settings.getProperty(Utils.DATA_SOURCE_REQUIRED_COLUMNS)
+    val scrollFields = settings.getScrollFields()


This logic has changed a little bit since the PR was opened. I'd recommend rebasing the whole PR onto the most recent 5.x branch.

jbaiera · 2017-04-25T17:25:22Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/DefaultSource.scala

+      // By default, SearchRequestBuilder includes all fields, if INTERNAL_ES_TARGET_FIELDS is empty.
+      // To prevent us from querying useless data, we set _id field,
+      // which isn't a part of _source, but is allowed by SearchRequestBuilder
+      paramWithScan += (InternalConfigurationOptions.INTERNAL_ES_TARGET_FIELDS -> "_id")


I'm not sure I like relying on this hack. Perhaps we could find a way to signal that we don't want to read the _source field at all by sending _source=false with the request. That way we can avoid depending on obscure and potentially easy-to-break functionality in Elasticsearch.

jbaiera · 2017-05-09T16:26:59Z

@drudim Are you still interested in moving forward with the PR?

drudim · 2017-05-09T16:32:32Z

@jbaiera sure! I had a vacation, just got back. I am going to take look on it this weekend.

The PR solves `scala.MatchError` in Spark SQL queries with `_metadata` field inside. More details on the issue: elastic#924 I was trying to update integration tests with that case, but I am getting failed tests even for the current code in branch `5.1`.

drudim

@jbaiera ready for review, the next things were changed:

rebased the branch on 5.x to use determineSourceFields instead of getScrollFields
introduced es.internal.exclude.source option to handle queries with empty _source field

Let me know what do you think about this logic around excludeSource. We avoided the hack with _id, but introduced new parts which I am not confident in (from the design perspective).

drudim · 2017-05-14T06:17:41Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/Utils.java

    static final String DATA_SOURCE_KEEP_HANDLED_FILTERS = "es.internal.spark.sql.pushdown.keep.handled.filters";

+    // columns selected by Spark SQL query
+    static final String DATA_SOURCE_REQUIRED_COLUMNS = "es.internal.spark.sql.required.columns";


Are you sure about this? It looks like all "es.internal.spark." options are located here. I feel like the set of columns selected by spark sql fits into "es.internal.spark." group.

drudim · 2017-05-15T00:57:12Z

I'd like to hear about what kind of testing problems you ran into as well to see if I can help out.

I tried to run:

./gradlew -Pscala=211 -Pdistro=hadoopYarn :elasticsearch-spark-20:integrationTest

and got:

:elasticsearch-spark-20:integrationTest

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkScalaSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkSQLScalaSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkSQLScalaSuite FAILED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkSQLSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkSQLSuite FAILED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkStreamingScalaSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkStreamingScalaSuite FAILED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkStreamingSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkStreamingSuite FAILED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.integration.SparkSuite STARTED

Gradle Test Run :elasticsearch-spark-20:integrationTest > Gradle Test Executor 2 > org.elasticsearch.spark.sql.UtilsTest STARTED

58 tests completed, 8 failed, 1 skipped

And nothing interesting inside of /spark/sql-20/build/test-results/TEST-org.elasticsearch.spark.integration.SparkSQLSuite.xml.

jbaiera · 2017-05-15T13:24:00Z

When the build executes the test failures are logged to files in the build directory. You'll need to grab the error logs from there.

drudim · 2017-05-15T15:51:54Z

Found the reason why SparkSQLSuite.xml was empty, it's derived from AbstractJavaEsSparkSQLTest.xml, so here is the error: https://gist.github.com/drudim/38faf2d3cc36b2e1a08f4779eee1ce8e

jbaiera · 2017-05-15T16:01:07Z

I actually just got bit by that same exact problem this morning. The problem here is that we import Hadoop 1.2.1 for tests, which is a problem for Spark 2.0 in some cases. We have a shim implemented for the ShutdownHookManager in the spark-20 integTest source root, but it does not include the isShutdownInProgress method on it. You may need to leave off the -Pdistro=hadoopYarn bit on the testing command for the time being. Alternatively you could try to see if it works by adding that method to our mock object.

drudim · 2017-05-15T16:02:35Z

That's cool, will try to extend tests with my case later today. Let me know if you are ok with the new option to exclude _source field.

jbaiera · 2017-05-15T16:03:18Z

@drudim Will do, I'll take a look at your changes and comments today.

drudim · 2017-05-16T18:20:05Z

@jbaiera added the test for _metadata use-case. Ideally other parts should be covered as well (like sql-1.3 and ScrollReader), so I need more time.

drudim · 2017-05-17T15:15:29Z

@jbaiera done:

fixed the same bug in sql-13
added integration test for sql-13 and sql-20
checked that ScrollReader is tested with empty source field: https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/test/java/org/elasticsearch/hadoop/rest/ScrollReaderTest.java#L110
haven't found a separate test for SearchRequestBuilder (to test excludeSource), but it's functionality indirectly tested by integration tests added in the scope of this PR.

jbaiera

Thanks for the updates. I left a few more comments about simplifying some code and a couple sanity assertions. This is looking very good.

jbaiera · 2017-05-17T17:13:35Z

mr/src/main/java/org/elasticsearch/hadoop/rest/SearchRequestBuilder.java

    }

+    public SearchRequestBuilder excludeSource(boolean value) {
+        this.excludeSource = value;


Could we add an assertion here that fields cannot be set if excludeSource is set, and vice-versa?

jbaiera · 2017-05-17T17:18:47Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/Utils.java

    static final String DATA_SOURCE_KEEP_HANDLED_FILTERS = "es.internal.spark.sql.pushdown.keep.handled.filters";

+    // columns selected by Spark SQL query
+    static final String DATA_SOURCE_REQUIRED_COLUMNS = "es.internal.spark.sql.required.columns";


Yeah, this comment was my mistake originally. It's fine in this location.

jbaiera · 2017-05-17T17:58:08Z

spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala

+
+    // In case when we read all fields without metadata
+    else if (StringUtils.hasText(sourceFields))
+      rowInfo._1.setProperty(ROOT_LEVEL_NAME, sourceFields)


Curly braces please

jbaiera · 2017-05-17T18:04:57Z

spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala

+
+    // In case when we read all fields including metadata
+    else if (StringUtils.hasText(sourceFields) && settings.getReadMetadata)
+      rowInfo._1.setProperty(ROOT_LEVEL_NAME, sourceFields + StringUtils.DEFAULT_DELIMITER + settings.getReadMetadataField)


This line is a little confusing to me: Is there ever a point in time where any of the else statements will be executed? I believe that with the current code requiredFields will ALWAYS be set with text, and if it is not, then sourceFields will also be empty. I think this could be simplified to just be:

val requiredFields = // get required fields if (StringUtils.hasText(requiredFields)) { // set the rowInfo._1 properties }

In the current state it makes me wonder if when we get to this code there is a chance that something prior in the connector wasn't set up entirely, but I don't see where it wouldn't be....

Yes, you are right, one condition is enough. My original assumption was:

if (StringUtils.hasText(requiredFields)) { // for queries like dataFrame.select("specific", "fields").take(1) } else if (StringUtils.hasText(sourceFields) && settings.getReadMetadata) { // for queries like dataFrame.take(1) when "es.read.metadata" is set to true } else if (StringUtils.hasText(sourceFields) { // for queries like dataFrame.take(1) when "es.read.metadata" is set to false }

But it isn't the case, because of the way how we set requiredFields.

jbaiera · 2017-05-17T18:05:38Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala

+
+    // In case when we read all fields without metadata
+    else if (StringUtils.hasText(sourceFields))
+      rowInfo._1.setProperty(ROOT_LEVEL_NAME, sourceFields)


Curly braces please

jbaiera · 2017-05-17T18:06:18Z

spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala

+      rowInfo._1.setProperty(ROOT_LEVEL_NAME, requiredFields)
+
+    // In case when we read all fields including metadata
+    else if (StringUtils.hasText(sourceFields) && settings.getReadMetadata)


Same confusion around simplifying the conditions here as the sql-13 package

drudim · 2017-05-21T17:08:25Z

@jbaiera, ready for review:

fixed style issues
added assertions for excludeSource scenario (+tests)
removed useless conditions in SchemaUtils

jbaiera

LGTM! Thanks so much @drudim for the contribution! I'll merge this and forward port it to master.

The PR solves `scala.MatchError` in Spark SQL queries with `_metadata` field inside. When Spark SQL is reading a metadata field, the data source removes the metadata field from the given fields and replaces it at the end. This is because the metadata field is an abstract field that is provided by the Scroll Reader instead of by explicitly asking for it. In removing the field, the order of fields in the projection returned from the Scroll reader does not match the order of fields within the Spark execution plan, thus a match error is thrown. This ordering of fields is now protected with this PR.

drudim force-pushed the fix/spark-sql-metadata-field branch from 3e6db02 to f202946 Compare February 11, 2017 22:31

jbaiera added :Serialization :Spark bug labels Apr 24, 2017

jbaiera self-requested a review April 25, 2017 17:21

jbaiera requested changes Apr 25, 2017

View reviewed changes

drudim force-pushed the fix/spark-sql-metadata-field branch from f202946 to b244d9f Compare May 14, 2017 06:25

drudim changed the base branch from 5.1 to 5.x May 14, 2017 06:25

drudim force-pushed the fix/spark-sql-metadata-field branch 2 times, most recently from ce61988 to cd96841 Compare May 15, 2017 00:13

drudim commented May 15, 2017

View reviewed changes

Introduce excludeSource option

2cd8aaf

drudim force-pushed the fix/spark-sql-metadata-field branch from cd96841 to 2cd8aaf Compare May 15, 2017 00:32

drudim closed this May 15, 2017

drudim reopened this May 15, 2017

Add test for _metadata selection

fe57755

Support metadata field for spark 1.x

1c6548c

jbaiera requested changes May 17, 2017

View reviewed changes

drudim added 3 commits May 21, 2017 09:00

Add curly braces

3b14b5a

Add assertions for excludeSource

628544e

Simplify SchemaUtils conditions

d329304

jbaiera approved these changes May 24, 2017

View reviewed changes

jbaiera merged commit 3b67b57 into elastic:5.x May 24, 2017

jbaiera added v5.4.1 v6.0.0 labels Jun 1, 2017

jbaiera added the v6.0.0-beta1 label Aug 3, 2017

jbaiera mentioned this pull request Feb 5, 2018

scala.MatchError on reading metadata #924

Closed

1 task

Fix spark sql queries with metadata field #930

Fix spark sql queries with metadata field #930

Uh oh!

Conversation

drudim commented Feb 11, 2017

Uh oh!

jbaiera commented Apr 24, 2017

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbaiera commented May 9, 2017

Uh oh!

drudim commented May 9, 2017

Uh oh!

drudim left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drudim commented May 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera commented May 15, 2017

Uh oh!

drudim commented May 15, 2017

Uh oh!

jbaiera commented May 15, 2017

Uh oh!

drudim commented May 15, 2017

Uh oh!

jbaiera commented May 15, 2017

Uh oh!

drudim commented May 16, 2017

Uh oh!

drudim commented May 17, 2017

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drudim commented May 21, 2017

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drudim commented May 15, 2017 •

edited

Loading