[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

aokolnychyi · 2017-07-26T20:13:23Z

What changes were proposed in this pull request?

This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:

spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);

The above AnalysisException happens because the last case calls Dataset.apply() to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.

…set API

SparkQA · 2017-07-26T22:42:23Z

Test build #79973 has finished for PR 18740 at commit b0fbd5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-07-27T04:24:15Z

Please add a test like #18744 did. Thanks.

viirya · 2017-07-27T04:25:34Z

LGTM except for test.

viirya · 2017-07-27T04:28:14Z

cc @cloud-fan @gatorsmile

gatorsmile · 2017-07-27T06:30:33Z

Thanks for fixing it! Please remember to add a test case whenever you fix anything else in the future. We really need to improve our test case coverage. BTW, welcome to contributing more test-only PRs.

gatorsmile · 2017-07-27T17:30:18Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    checkAnswer(df.sort($"id"), df.sort("id"))
+    checkAnswer(df.sort('id), df.sort("id"))
+    checkAnswer(df.orderBy('id), df.sort("id"))
+    checkAnswer(df.orderBy("id"), df.sort("id"))


Since it sounds like you are very interested in Spark open source contribution, let me explain how we write the test cases. Normally, we do not want to duplicate the codes and also want to verify the results. Thus, maybe you can change the code like

val df = spark.range(3).withColumnRenamed("id", "x") val expected = Row(0) :: Row(1) :: Row (2) :: Nil checkAnswer(df.sort("id"), expected) checkAnswer(df.sort(col("id")), expected) checkAnswer(df.sort($"id"), expected) checkAnswer(df.sort('id), expected) checkAnswer(df.orderBy('id), expected) checkAnswer(df.orderBy("id"), expected)

Indeed, looks much better. I appreciate the explanation and will take this into account in the future. I will update the test in a minute, thanks

SparkQA · 2017-07-27T19:14:10Z

Test build #80007 has finished for PR 18740 at commit 309cb8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-27T19:47:15Z

LGTM pending Jenkins

SparkQA · 2017-07-27T20:48:28Z

Test build #80009 has finished for PR 18740 at commit 0b0eea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description: ``` spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ``` The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18740 from aokolnychyi/spark-21538. (cherry picked from commit f44ead8) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

gatorsmile · 2017-07-27T23:50:30Z

Thanks! Merging to master/2.2

## What changes were proposed in this pull request? This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description: ``` spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ``` The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes apache#18740 from aokolnychyi/spark-21538. (cherry picked from commit f44ead8) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

[SPARK-21538][SQL] Fix attribute resolution inconsistency in the Data…

b0fbd5e

…set API

HyukjinKwon mentioned this pull request Jul 27, 2017

[SPARK-21538][SQL] Fix attribute resolution inconsistency in Dataset API #18744

Closed

[SPARK-21538][SQL] Added a test case

309cb8f

gatorsmile reviewed Jul 27, 2017

View reviewed changes

[SPARK-21538][SQL] Updated test case

0b0eea9

asfgit closed this in f44ead8 Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

aokolnychyi commented Jul 26, 2017 •

edited

Loading

SparkQA commented Jul 26, 2017

viirya commented Jul 27, 2017

viirya commented Jul 27, 2017

viirya commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

gatorsmile Jul 27, 2017

aokolnychyi Jul 27, 2017

SparkQA commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

SparkQA commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

Conversation

aokolnychyi commented Jul 26, 2017 • edited Loading

What changes were proposed in this pull request?

SparkQA commented Jul 26, 2017

viirya commented Jul 27, 2017

viirya commented Jul 27, 2017

viirya commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

gatorsmile Jul 27, 2017

Choose a reason for hiding this comment

aokolnychyi Jul 27, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

SparkQA commented Jul 27, 2017

gatorsmile commented Jul 27, 2017

aokolnychyi commented Jul 26, 2017 •

edited

Loading