-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882
Conversation
Almost all functions in the SQL API have support for taking in column names in place of Column objects. This patch eliminates the few exceptions to make the API more consistent. Affected functions are: - lower() - upper() - abs() - bitwiseNOT() As a side-effect, this fixes the redefinition of lower() and upper(), which were being defined at lines 95-96 and again at 1442-1443.
As a side-effect, fixed double definition of initcap().
@HyukjinKwon let's continue the discussion from PR #23879 here. |
I still think we should decide if we're going to drop string ones in Scala side first. Although removing string ones looks preferred, I think there's compatibility concern for instance. Ideally we should either consistently allows or consistently disallow across APIs. Yes, I do think it's something we should fix if we see consensus either way. Would you be interested in sending a discussion thread in the mailing list? |
Adding @cloud-fan, @jsnowacki who i remember talked with before, @maropu from previous PR, @srowen and @rxin who has a better insight in API perspective (imho) It's about if we should consistently allows strings as columns in functions API or not. |
I can do that. 👍 |
Adding @MaxGekk who also I talked with before. |
So this removes support for column names as strings in these functions? or adds some support? It looks like the former but description suggests the latter. Why remove this overload? |
@srowen I edited the description to make this more clear. The patch adds support for column names as arguments to those functions. The function mechanism was removed in those instances because it provided no support for this, so those functions had to be explicitly defined instead. The ones I've seen declared twice were The function mechanism is very dangerous for allowing this. Is there a good reason for it to be there at all? |
Agree about removing the duplicates from these lists of course. So, is the difference between, say, If I see, the inconsistency still exists on the Scala side. That can be considered separately. |
I can see the reasoning for the mechanism, but since this duplication happened several times already maybe it would be worth it adding tests for this? I don't think the explicit approach is much worse anyway, specially if better (multiline, with examples) documentation is added. I'm not particularly bothered by this, but I also can see it causing more problems in the future.
I think so, except we then have to break out |
I'm not against breaking out all the functions, just might be a lot of work for you. I like the idea of solving it in one go by modifying I'm missing why |
I will not make that change now, especially since it's not directly relevant to this issue, but good to know that you're supportive.
They only take strings, that is the problem. If we modify |
I actually think we should consider both together while we're here. Some Scala APIs have string versions and most of Python ones have string versions. I already saw many requests about this across APIs. Also, if we're going to change this in Python alone, let's make very sure that all column reference can be done in string, for instance, in |
It's really be fully checked. As I said, some functions like from_json takes Column but also DataType. |
I see. Maybe there's a good reason asc/desc should only take a named column, not sure. Well, we can leave that alone for now if in doubt. I think it's OK to focus here on fixing the duplicated definition, and the small change to On the Scala side, on second thought, I'm not sure whether it's better to add all the String-based methods, or deprecate them. On the Scala side the syntax is For dataframe.py, I actually didn't see many places this comes up. I found |
Ok, I will change this PR so that we turn
Thanks for looking into this, I haven't had the time. I will also update the PR to include a change there so that it supports Column objects too. When I have time I will also look for other cases like this, but it might take me a day or two before I can work on these changes at all. |
Yes, anyway let's whitelist what we're fixing in this PR. Also, it should better be fully checked. Please check other APIs in PySpark side can also be easily done by |
Also, we should better stick to all string or all columns in Scala side at the end. If we don't see it's worth to fix Scala side for now, I don't see it's worth to fix Python side as well to be honest. |
If we did it over again, I think Scala should support |
I deem this to be an anti-pattern, actually. By defining an explicit dependency on the
|
I haven't read through the entire thread, just want to provide some background. At the beginning, having a string version of the function was considered as good for UX. However, when we add more and more functions, problems occur. When a function takes a column, the column can be in 3 types:
2 and 3 may conflict if it's a string column. I think a reasonable rule is giving up 3. Up to my experience, passing a column name string is more common than passing a literal of the column. |
Side comment: I also don't find the |
@asmello I think you're clear to proceed as per your last comment. |
@srowen ok, I'm whitelisting |
Ok, this was tedious. I've whitelisted all functions defined in All these functions take columns as arguments, and they explicitly handle Column/name duality:
@HyukjinKwon you mentioned As for the dataframe methods in
I was even a little surprised by The following function is also 100% compliant with both API styles, but it relies on the jvm backend to do the conversion:
This is something that should be kept in mind should the Scala API change. EDIT: this list was larger, but I realised I'd mistaken The following functions are violations, they only take strings:
The reason these are exceptions is that they're implemented in Since this patch is about making name support universal, however, I don't think it falls under scope to do anything about the One other special case I should mention is With these checks in place, I'm confident the list of exceptions in this PR's description is exhaustive. I will work on solving those now as agreed (by redefining the |
This reverses the previous strategy of breaking out inconsistent functions from the automated mechanism, and solves the inconsistency problem by instead always passing a Column object to the jvm callee. A few auto-gen functions, however, can only take a column name as input - and that is as it should be, semantically: - lit() - col() - column() - asc() - desc() - asc_nulls_first() - asc_nulls_last() - desc_nulls_first() - desc_nulls_last() To avoid breaking those functions, a different kind of automation mechanism is introduced, and applied for those only. This is just the previous implementation renamed. The reason the original exceptions aren't being handled specially instead is that the original mechanism was inconsistent with how most jvm functions are called. The functions listed above are the true exceptions that should be handled specially, since it would not make sense for them to accept column objects like the others.
Alright, as noted in the commit, I've restored the original exceptions to be handled by the mechanism, and handled these separately:
The original mechanism has been renamed and applied those only. The new mechanism is consistent with how most functions are handled outside the mechanism - it converts the argument to a Column object by default. |
@srowen @HyukjinKwon we should be good to go now. |
I ran another test build to be sure, and it failed, but I'm pretty sure it's spurious. @shaneknapp for some weird reason sometimes builds fail because the java style checker refers to old DTDs that disappeared. We fixed that in master a week or two ago. I don't know why. I'll trigger another build. If it passes I think this is good to go. |
Test build #4623 has finished for PR 23882 at commit
|
yeah, that's definitely weird. the workspace that the builds run in is
liberally hit by `git clean -fdx`...
…On Thu, Mar 14, 2019 at 2:19 PM Sean Owen ***@***.***> wrote:
I ran another test build to be sure, and it failed, but I'm pretty sure
it's spurious. @shaneknapp <https://github.com/shaneknapp> for some weird
reason sometimes builds fail because the java style checker refers to old
DTDs that disappeared. We fixed that in master a week or two ago. I don't
know why.
I'll trigger another build. If it passes I think this is good to go.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABiDrI7BFxOUBqM7grguf_of0Q-3m6-2ks5vWrzngaJpZM4bOheq>
.
|
@@ -85,13 +96,16 @@ def _(): | |||
>>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1) | |||
[Row(height=5, spark_user=True)] | |||
""" | |||
_functions = { | |||
_name_functions = { | |||
# name functions take a column name as their argument | |||
'lit': _lit_doc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does lit
takes the string as column names? how do we create string literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and .. what's really "name function" ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, lit()
takes a literal value and creates a column with that literal. That's what it's for. It doesn't make sense for it to accept a column name, given its nature. So if you give it a string it will create a column with that string literal.
The name "name function" is something I came up with just to distinguish these functions from the ones that take columns as input. They are defined by that distinction - they are "functions that take a column name as their argument", exclusively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lit
doesn't take a column name as their argument. Why did we use such name function
category for lit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, lit
is such a unique function semantically that it would warrant its own category. But since the implementation is exactly the same as those "name functions", I left it there for practical purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (lit) of five (col, column, asc, desc, lit) doesn't sound like a special case tho. It had to have a better category if 20% of items doesn't fit to the category.
Merged to master |
What happened to math functions? >>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/functions.py", line 78, in _
jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
ValueError: could not convert string to float: id >>> from pyspark.sql.functions import atan2, col
>>> spark.range(1).select(atan2(col("id"), col("id")))
DataFrame[ATAN2(id, id): double] |
@srowen, I haven't finished my review yet. It's been only 2 days which was weekends. |
Math cases are difficult to fix because of the existing support: def _create_binary_mathfunction(name, doc=""):
""" Create a binary mathfunction by name"""
def _(col1, col2):
sc = SparkContext._active_spark_context
# users might write ints for simplicity. This would throw an error on the JVM side.
jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
col2._jc if isinstance(col2, Column) else float(col2)) Therefore, if |
There is another exception, >>> spark.range(1).select(when(lit(True), col("id"))).show()
>>> spark.range(1).select(when(lit(True), "id")).show()
Here's another ambiguity |
'initcap': 'Returns a new string column by converting the first letter of each word to ' + | ||
'uppercase. Words are delimited by whitespace.', | ||
'lower': 'Converts a string column to lower case.', | ||
'upper': 'Converts a string column to upper case.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had to stay in string functions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initcap
was also defined explicitly below, with better documentation, so I preserved that instance. lower
and upper
have been introduced in the 1.3 API, and were being defined there as well, so I prioritised that instance because it used the correct @since
number. Arguably we could've used a separate mapping like string_functions_1_3
or something, but that didn't seem appropriate since the point of these is to minimise code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was saying lower
and upper
. Looks it has been overwritten by this, so if we keep here, it keeps previous definition. Did you check which PR and JIRAs added lower
and upper
?
Also, strictly this should have not been removed in this PR as it doesn't target to remove overwritten functions. As you said, we should avoid such function definition way later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if it was an exception, it had to describe it specifically.
# name functions take a column name as their argument
'lit': _lit_doc,
This doesn't look making sense if you read the codes from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java/scala API documentation says it was added in 1.3, but I just tracked down the JIRA/PR and it seems it actually was 1.0.
https://issues.apache.org/jira/browse/SPARK-1995
#936
As for removing overwritten functions, maybe it would've been better to make a separate PR, but the first fix did require removing them. When I changed the approach it seemed reasonable to keep the change, since the problem was obvious and easy to fix.
Why was >>> spark.range(1).select(lit('a').alias("value")).select(ascii(col("value"))).show()
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")).show()
|
'initcap': 'Returns a new string column by converting the first letter of each word to ' + | ||
'uppercase. Words are delimited by whitespace.', | ||
'lower': 'Converts a string column to lower case.', | ||
'upper': 'Converts a string column to upper case.', | ||
'ltrim': 'Trim the spaces from left end for the specified string value.', | ||
'rtrim': 'Trim the spaces from right end for the specified string value.', | ||
'trim': 'Trim the spaces from both ends for the specified string column.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you not add the column name support because Scala side has this signautre below?:
def trim(e: Column): Column
def trim(e: Column, trimString: String): Column
That's not allowed in Python
>>> from pyspark.sql.functions import trim, lit
>>> spark.range(1).select(lit('a').alias("value")).select(trim("value", "a"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: _() takes exactly 1 argument (2 given)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I initially expected when I asked to whitelist them, @asmello
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you're saying here. trim
with 2 arguments was never supported in PySpark, and that's a separate issue. What this patch changes is that trim("value")
is now supported, when it wasn't previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was saying this because I didn't get why you excluded string functions.
Same reason as the other string functions; they were being defined using the automatic mechanism, which assumed the argument was either a Column object or a literal. So if it was a string (column name) and the Scala API didn't have an overload for that, it would fail. |
Yes this is a potential source of confusion, but it's more complicated to fix because it would require breaking the current API. |
What do you mean? doesn't this PR target to have ones that take strings as column names across PySpark function API? |
Yes, this had to be identified and whitelisted before asking if we need something else at this point. |
No, this is a different problem. It's not about the API not supporting a column name, it's about it supporting a literal instead. It's outside the scope here, which I kept restricted specifically so we wouldn't block on other problems like this. |
The whole problem we discussed at this PR is, if we're going to support strings as column names in general, no? And there looks a few cases of ambiguity here and there. |
Yes. I was saying the previous implementation was problematic because of that. |
Then why we didn't handle all other cases I listed above? It's something you missed. I don't know why you try to justify your mistakes. |
I didn't miss those, I deliberately didn't tackle them because 1) They were a different kind of problem; 2) The fix is not as simple as it requires breaking the API. The only thing I did miss was the problem with the math functions, and I'm happy to have those fixed. I'm tired of arguing this with you, the PR was open for weeks and I tried to fix every problem you pointed out. What's the point of complaining now that it has been merged? I don't know why you antagonise me so much. |
Because I was deciding if I should revert or make a followup. If you revert it, the reasons should better be left, and I am arguing because you're justifying the reasons that I listed. This PR targets to support string as columns in general, and there were some cases missing found, which seems to be not mentioned or discussed. The main reason I was initially worried was that we should see if it makes sense to support string as columns in PySpark's API (further more in other languages API). I think other people also expressed this concern. It doesn't look this concern is addressed properly. |
And, in particular this PR, there had to be a lot of efforts. We shouldn't fix something as just-work-for-now anymore in particular when it repeatedly becomes an issue. |
It's definitely better after the merge than before, so I see no reason to revert. Though due to the missing math functions fix, I agree a follow up is required. As for cases like
That's a valid concern, but, again, it's better to be consistent for now. It's not a "just-work-for-now" situation, either, it's a full fix for the consistency problem. String support itself is a related, but separate discussion, too. |
I already explained the reasons to revert. If we see some ambiguity, we should consider not supporting string as columns officially, or such discussion should be concluded. I made a followup to fix missing instances here, and conclude the discussion. If that's getting longer, I'm going to revert this PR. It's confusing what string means in the PySpark function because it has two semantically different meaning here. |
I think this change was fine and will review the follow on too. I wouldn't revert this. I think we decided to keep this to a narrow fix for consistency. Anything else can be a future change or discussion. It would return this to a worse state if it's reverted. |
…ke string as columns as well ## What changes were proposed in this pull request? This is a followup of #23882 to handle binary math/string functions. For instance, see the cases below: **Before:** ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 51, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace: py4j.Py4JException: Method ascii([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:276) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 78, in _ jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1), ValueError: could not convert string to float: id ``` **After:** ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) DataFrame[ascii(value): int] ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) DataFrame[ATAN2(id, id): double] ``` Note that, - This PR causes a slight behaviour changes for math functions. For instance, numbers as strings (e.g., `"1"`) were supported as arguments of binary math functions before. After this PR, it recognises it as column names. - I also intentionally didn't document this behaviour changes since we're going ahead for Spark 3.0 and I don't think numbers as strings make much sense in math functions. - There is another exception `when`, which takes string as literal values as below. This PR doeesn't fix this ambiguity. ```python >>> spark.range(1).select(when(lit(True), col("id"))).show() ``` ``` +--------------------------+ |CASE WHEN true THEN id END| +--------------------------+ | 0| +--------------------------+ ``` ```python >>> spark.range(1).select(when(lit(True), "id")).show() ``` ``` +--------------------------+ |CASE WHEN true THEN id END| +--------------------------+ | id| +--------------------------+ ``` This PR also fixes as below: #23882 fixed it to: - Rename `_create_function` to `_create_name_function` - Define new `_create_function` to take strings as column names. This PR, I proposes to: - Revert `_create_name_function` name to `_create_function`. - Define new `_create_function_over_column` to take strings as column names. ## How was this patch tested? Some unit tests were added for binary math / string functions. Closes #24121 from HyukjinKwon/SPARK-26979. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sorry if I missed a discussion thread that we're agreed upon fixing this partially, and there was an explicit conclusion here. If not, I don't think we should fix such thing partially without an explicit conclusion - it's been raised multiple times. |
What changes were proposed in this pull request?
Most SQL functions defined in
spark.sql.functions
have two calling patterns, one with a Column object as input, and another with a string representing a column name, which is then converted into a Column object internally.There are, however, a few notable exceptions:
While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences:
It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error;
It gets in the way of stylistic conventions. Most of the time it makes Python code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable.
This patch is meant to fix the aforementioned problem.
Effect
This patch enables support for passing column names as input to those functions mentioned above.
Side effects
This PR also fixes an issue with some functions being defined multiple times by using
_create_function()
.How it works
_create_function()
was redefined to always convert the argument to a Column object. The old implementation has been kept under_create_name_function()
, and is still being used to generate the following special functions:This is because these functions can only take a column name as their argument. This is not a problem, as their semantics require so.
How was this patch tested?
Ran ./dev/run-tests and tested it manually.