Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions #4348

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Feb 4, 2015

df.selectExpr("abs(colA)", "colB")
df.filter("age > 21")

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26693 has started for PR 4348 at commit ac65f4b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26693 has finished for PR 4348 at commit ac65f4b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Dsl(object):
    • class ExamplePointUDT(UserDefinedType):
    • class SQLTests(ReusedPySparkTestCase):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26693/
Test FAILed.

… to use UDFs

A more convenient way to define user-defined functions.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#4345 from rxin/defineUDF and squashes the following commits:

639c0f8 [Reynold Xin] udf tests.
0a0b339 [Reynold Xin] defineUDF -> udf.
b452b8d [Reynold Xin] Fix UDF registration.
d2e42c3 [Reynold Xin] SQLContext.udf.register() returns a UserDefinedFunction also.
4333605 [Reynold Xin] [SQL][DataFrame] defineUDF.
…ressions.

e.g.

df.selectExpr("abs(colA)", "colB")

df.filter("age > 21")
@@ -2126,10 +2126,9 @@ def sort(self, *cols):
"""
if not cols:
raise ValueError("should sort by at least one column")
jcols = ListConverter().convert([_to_java_column(c) for c in cols[1:]],
jcols = ListConverter().convert([_to_java_column(c) for c in cols],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies take a look at the Python changes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26723 has started for PR 4348 at commit 2baeef2.

  • This patch merges cleanly.

@@ -179,10 +179,20 @@ private[sql] class DataFrameImpl protected[sql](
select((col +: cols).map(Column(_)) :_*)
}

override def selectExpr(exprs: String*): DataFrame = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one could be merged into select(), column is also a valid expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if it has space ... it will just fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work in these cases with this implementation.

select('*', 'a', '`the name`', 'a + 1', 'min(b) * 3')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea - but asking users to wrap a column name in backticks in strings is fairly annoying.

@davies
Copy link
Contributor

davies commented Feb 4, 2015

This select() and filter() in Python do not support expressions yet

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26723 has finished for PR 4348 at commit 2baeef2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26723/
Test PASSed.

@rxin
Copy link
Contributor Author

rxin commented Feb 4, 2015

We can discuss more offline. For now let's keep this separate, otherwise it can be fairly annoying to use column names that contain space or column names that contain any SQL keywords.

@asfgit asfgit closed this in 40c4cb2 Feb 4, 2015
asfgit pushed a commit that referenced this pull request Feb 4, 2015
…ressions

```scala
df.selectExpr("abs(colA)", "colB")
df.filter("age > 21")
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4348 from rxin/SPARK-5579 and squashes the following commits:

2baeef2 [Reynold Xin] Fix Python.
b416372 [Reynold Xin] [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions.

(cherry picked from commit 40c4cb2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants