Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26979][PYTHON][FOLLOW-UP] Make binary math/string functions take string as columns as well #24121

Closed
wants to merge 5 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 18, 2019

What changes were proposed in this pull request?

This is a followup of #23882 to handle binary math/string functions. For instance, see the cases below:

Before:

>>> from pyspark.sql.functions import lit, ascii
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 51, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace:
py4j.Py4JException: Method ascii([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:276)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 78, in _
    jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
ValueError: could not convert string to float: id

After:

>>> from pyspark.sql.functions import lit, ascii
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value"))
DataFrame[ascii(value): int]
>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
DataFrame[ATAN2(id, id): double]

Note that,

  • This PR causes a slight behaviour changes for math functions. For instance, numbers as strings (e.g., "1") were supported as arguments of binary math functions before. After this PR, it recognises it as column names.

  • I also intentionally didn't document this behaviour changes since we're going ahead for Spark 3.0 and I don't think numbers as strings make much sense in math functions.

  • There is another exception when, which takes string as literal values as below. This PR doeesn't fix this ambiguity.

    >>> spark.range(1).select(when(lit(True), col("id"))).show()
    +--------------------------+
    |CASE WHEN true THEN id END|
    +--------------------------+
    |                         0|
    +--------------------------+
    
    >>> spark.range(1).select(when(lit(True), "id")).show()
    +--------------------------+
    |CASE WHEN true THEN id END|
    +--------------------------+
    |                        id|
    +--------------------------+
    

This PR also fixes as below:

#23882 fixed it to:

  • Rename _create_function to _create_name_function
  • Define new _create_function to take strings as column names.

This PR, I proposes to:

  • Revert _create_name_function name to _create_function.
  • Define new _create_function_over_column to take strings as column names.

How was this patch tested?

Some unit tests were added for binary math / string functions.

@HyukjinKwon
Copy link
Member Author

@asmello, @rxin, @srowen, @cloud-fan, this is the followup of #23882 to address my own comments.

If the discussion is going to be longer in this PR, I would like to revert #23882

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103599 has finished for PR 24121 at commit a0574d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103604 has finished for PR 24121 at commit 9f65807.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asmello
Copy link

asmello commented Mar 18, 2019

I'm definitely in favour of the change in how math functions work. Somehow I missed that cast to float, so I thought they just passed the argument directly when given a string (which would've worked, since the Scala API has string overloads for all math functions). So this is something I should've included in my PR.

I'll add some inline comments for stuff I disagree with.

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103612 has finished for PR 24121 at commit 7f78cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I'll merge this one assuming we want this way if there are no more comments this week.

adding @holdenk, @felixcheung, @BryanCutler, @ueshin, @viirya for visibility.

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests/test_functions.py Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103670 has finished for PR 24121 at commit ccbe51e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a quick release-notes tag and Docs text to the JIRA for the one potential behavior change. This looks good.

Copy link

@asmello asmello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-26979 branch March 3, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants