-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-29821][SQL] Allow calling non-aggregate SQL functions with column name #26435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29821][SQL] Allow calling non-aggregate SQL functions with column name #26435
Conversation
|
Can one of the admins verify this patch? |
|
Can you file jira first and add a JIRA ID in the title? see: https://spark.apache.org/contributing.html |
| [Row(r1=False, r2=False), Row(r1=True, r2=True)] | ||
| """ | ||
| sc = SparkContext._active_spark_context | ||
| if type(col) is str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems already working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> from pyspark.sql.functions import isnan
>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a")).collect()
[Row(isnan(a)=False), Row(isnan(a)=True)]| * @group normal_funcs | ||
| * @since 1.6.0 | ||
| */ | ||
| def isnan(columnName: String): Column = isnan(Column(columnName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't add this per the comments on the top of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Lines 58 to 60 in f8b1424
| * This function APIs usually have methods with `Column` signature only because it can support not | |
| * only `Column` but also other types such as a native string. The other variants currently exist | |
| * for historical reasons. |
What changes were proposed in this pull request?
I added functions that can be called with the column name for the functions in the non-aggregate functions section of
functions.scala.isnan(columnName: String): Columnisnull(columnName: String): Columnnanvl(col1Name: String, col2Name: String): Columnnegate(columnName: String): Columnnot(columnName: String): ColumnbitwiseNOT(columnName: String): ColumnWhy are the changes needed?
This pull requests makes it possible to check for nan values in the column
xby callingisnan("x"), instead ofisnan($"x"). PySpark:isnan("x"), instead ofisnan(col("x")). This way, users don't need to remember to transform the value to a column. This makes it consistent with other functions such assqrtthat can already be called with the column name.Does this PR introduce any user-facing change?
Yes
See previous section.
How was this patch tested?
I couldn't find a test file, where sql functions and pyspark sql functions are tested. Please point me in the right direction.