Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 22 additions & 10 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@
from pyspark.sql.udf import UserDefinedFunction, _create_udf


def _create_function(name, doc=""):
""" Create a function for aggregator by name"""
def _create_name_function(name, doc=""):
""" Create a function that takes a column name argument, by name"""
def _(col):
sc = SparkContext._active_spark_context
jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
Expand All @@ -48,6 +48,17 @@ def _(col):
return _


def _create_function(name, doc=""):
""" Create a function that takes a Column object, by name"""
def _(col):
sc = SparkContext._active_spark_context
jc = getattr(sc._jvm.functions, name)(_to_java_column(col))
return Column(jc)
_.__name__ = name
_.__doc__ = doc
return _


def _wrap_deprecated_function(func, message):
""" Wrap the deprecated function to print out deprecation warnings"""
def _(col):
Expand Down Expand Up @@ -85,13 +96,16 @@ def _():
>>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1)
[Row(height=5, spark_user=True)]
"""
_functions = {
_name_functions = {
asmello marked this conversation as resolved.
Show resolved Hide resolved
# name functions take a column name as their argument
'lit': _lit_doc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does lit takes the string as column names? how do we create string literal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and .. what's really "name function" ... ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, lit() takes a literal value and creates a column with that literal. That's what it's for. It doesn't make sense for it to accept a column name, given its nature. So if you give it a string it will create a column with that string literal.

The name "name function" is something I came up with just to distinguish these functions from the ones that take columns as input. They are defined by that distinction - they are "functions that take a column name as their argument", exclusively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lit doesn't take a column name as their argument. Why did we use such name function category for lit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, lit is such a unique function semantically that it would warrant its own category. But since the implementation is exactly the same as those "name functions", I left it there for practical purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (lit) of five (col, column, asc, desc, lit) doesn't sound like a special case tho. It had to have a better category if 20% of items doesn't fit to the category.

'col': 'Returns a :class:`Column` based on the given column name.',
'column': 'Returns a :class:`Column` based on the given column name.',
'asc': 'Returns a sort expression based on the ascending order of the given column name.',
'desc': 'Returns a sort expression based on the descending order of the given column name.',
}

_functions = {
'upper': 'Converts a string expression to upper case.',
'lower': 'Converts a string expression to upper case.',
'sqrt': 'Computes the square root of the specified float value.',
Expand Down Expand Up @@ -141,7 +155,7 @@ def _():
'bitwiseNOT': 'Computes bitwise not.',
}

_functions_2_4 = {
_name_functions_2_4 = {
'asc_nulls_first': 'Returns a sort expression based on the ascending order of the given' +
' column name, and null values return before non-null values.',
'asc_nulls_last': 'Returns a sort expression based on the ascending order of the given' +
Expand Down Expand Up @@ -254,6 +268,8 @@ def _():
_functions_deprecated = {
}

for _name, _doc in _name_functions.items():
globals()[_name] = since(1.3)(_create_name_function(_name, _doc))
for _name, _doc in _functions.items():
globals()[_name] = since(1.3)(_create_function(_name, _doc))
for _name, _doc in _functions_1_4.items():
Expand All @@ -268,8 +284,8 @@ def _():
globals()[_name] = since(2.1)(_create_function(_name, _doc))
for _name, _message in _functions_deprecated.items():
globals()[_name] = _wrap_deprecated_function(globals()[_name], _message)
for _name, _doc in _functions_2_4.items():
globals()[_name] = since(2.4)(_create_function(_name, _doc))
for _name, _doc in _name_functions_2_4.items():
globals()[_name] = since(2.4)(_create_name_function(_name, _doc))
del _name, _doc


Expand Down Expand Up @@ -1437,10 +1453,6 @@ def hash(*cols):
'ascii': 'Computes the numeric value of the first character of the string column.',
'base64': 'Computes the BASE64 encoding of a binary column and returns it as a string column.',
'unbase64': 'Decodes a BASE64 encoded string column and returns it as a binary column.',
'initcap': 'Returns a new string column by converting the first letter of each word to ' +
'uppercase. Words are delimited by whitespace.',
'lower': 'Converts a string column to lower case.',
'upper': 'Converts a string column to upper case.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to stay in string functions!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initcap was also defined explicitly below, with better documentation, so I preserved that instance. lower and upper have been introduced in the 1.3 API, and were being defined there as well, so I prioritised that instance because it used the correct @since number. Arguably we could've used a separate mapping like string_functions_1_3 or something, but that didn't seem appropriate since the point of these is to minimise code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying lower and upper. Looks it has been overwritten by this, so if we keep here, it keeps previous definition. Did you check which PR and JIRAs added lower and upper?

Also, strictly this should have not been removed in this PR as it doesn't target to remove overwritten functions. As you said, we should avoid such function definition way later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it was an exception, it had to describe it specifically.

    # name functions take a column name as their argument
    'lit': _lit_doc,

This doesn't look making sense if you read the codes from scratch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java/scala API documentation says it was added in 1.3, but I just tracked down the JIRA/PR and it seems it actually was 1.0.

https://issues.apache.org/jira/browse/SPARK-1995
#936

As for removing overwritten functions, maybe it would've been better to make a separate PR, but the first fix did require removing them. When I changed the approach it seemed reasonable to keep the change, since the problem was obvious and easy to fix.

'ltrim': 'Trim the spaces from left end for the specified string value.',
'rtrim': 'Trim the spaces from right end for the specified string value.',
'trim': 'Trim the spaces from both ends for the specified string column.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you not add the column name support because Scala side has this signautre below?:

  def trim(e: Column): Column
  def trim(e: Column, trimString: String): Column

That's not allowed in Python

>>> from pyspark.sql.functions import trim, lit
>>> spark.range(1).select(lit('a').alias("value")).select(trim("value", "a"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: _() takes exactly 1 argument (2 given)

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I initially expected when I asked to whitelist them, @asmello

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you're saying here. trim with 2 arguments was never supported in PySpark, and that's a separate issue. What this patch changes is that trim("value") is now supported, when it wasn't previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying this because I didn't get why you excluded string functions.

Expand Down