Skip to content

Conversation

@MaxHaertwig
Copy link

@MaxHaertwig MaxHaertwig commented Nov 8, 2019

What changes were proposed in this pull request?

I added functions that can be called with the column name for the functions in the non-aggregate functions section of functions.scala.

  • isnan(columnName: String): Column
  • isnull(columnName: String): Column
  • nanvl(col1Name: String, col2Name: String): Column
  • negate(columnName: String): Column
  • not(columnName: String): Column
  • bitwiseNOT(columnName: String): Column

Why are the changes needed?

This pull requests makes it possible to check for nan values in the column x by calling isnan("x"), instead of isnan($"x"). PySpark: isnan("x"), instead of isnan(col("x")). This way, users don't need to remember to transform the value to a column. This makes it consistent with other functions such as sqrt that can already be called with the column name.

Does this PR introduce any user-facing change?

Yes
See previous section.

How was this patch tested?

I couldn't find a test file, where sql functions and pyspark sql functions are tested. Please point me in the right direction.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu
Copy link
Member

maropu commented Nov 9, 2019

Can you file jira first and add a JIRA ID in the title? see: https://spark.apache.org/contributing.html

@MaxHaertwig MaxHaertwig changed the title Allow calling non-aggregate SQL functions with column name [SPARK-29821][SQL] Allow calling non-aggregate SQL functions with column name Nov 9, 2019
[Row(r1=False, r2=False), Row(r1=True, r2=True)]
"""
sc = SparkContext._active_spark_context
if type(col) is str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems already working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> from pyspark.sql.functions import isnan
>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a")).collect()
[Row(isnan(a)=False), Row(isnan(a)=True)]

* @group normal_funcs
* @since 1.6.0
*/
def isnan(columnName: String): Column = isnan(Column(columnName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't add this per the comments on the top of this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* This function APIs usually have methods with `Column` signature only because it can support not
* only `Column` but also other types such as a native string. The other variants currently exist
* for historical reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants