Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API #18740

Closed
wants to merge 3 commits into from

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Jul 26, 2017

What changes were proposed in this pull request?

This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:

spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);

The above AnalysisException happens because the last case calls Dataset.apply() to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79973 has finished for PR 18740 at commit b0fbd5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 27, 2017

Please add a test like #18744 did. Thanks.

@viirya
Copy link
Member

viirya commented Jul 27, 2017

LGTM except for test.

@viirya
Copy link
Member

viirya commented Jul 27, 2017

cc @cloud-fan @gatorsmile

@gatorsmile
Copy link
Member

Thanks for fixing it! Please remember to add a test case whenever you fix anything else in the future. We really need to improve our test case coverage. BTW, welcome to contributing more test-only PRs.

checkAnswer(df.sort($"id"), df.sort("id"))
checkAnswer(df.sort('id), df.sort("id"))
checkAnswer(df.orderBy('id), df.sort("id"))
checkAnswer(df.orderBy("id"), df.sort("id"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it sounds like you are very interested in Spark open source contribution, let me explain how we write the test cases. Normally, we do not want to duplicate the codes and also want to verify the results. Thus, maybe you can change the code like

    val df = spark.range(3).withColumnRenamed("id", "x")
    val expected = Row(0) :: Row(1) :: Row (2) :: Nil
    checkAnswer(df.sort("id"), expected)
    checkAnswer(df.sort(col("id")), expected)
    checkAnswer(df.sort($"id"), expected)
    checkAnswer(df.sort('id), expected)
    checkAnswer(df.orderBy('id), expected)
    checkAnswer(df.orderBy("id"), expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, looks much better. I appreciate the explanation and will take this into account in the future. I will update the test in a minute, thanks

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #80007 has finished for PR 18740 at commit 309cb8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #80009 has finished for PR 18740 at commit 0b0eea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 27, 2017
## What changes were proposed in this pull request?

This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:

```
spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
```
The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.

Author: aokolnychyi <anton.okolnychyi@sap.com>

Closes #18740 from aokolnychyi/spark-21538.

(cherry picked from commit f44ead8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2

@asfgit asfgit closed this in f44ead8 Jul 27, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?

This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:

```
spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
```
The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.

Author: aokolnychyi <anton.okolnychyi@sap.com>

Closes apache#18740 from aokolnychyi/spark-21538.

(cherry picked from commit f44ead8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants