-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26979][SQL] Add missing column name support for SQL functions #23879
Conversation
Most SQL functions already had support for taking column names in place of Column objects. This change enables the same functionality for the following functions: - lower() - upper() - abs() - bitwiseNOT()
Can one of the admins verify this patch? |
It can be easily worked around by, for instance, |
I also think we don't need these APIs ... |
Forgive me, but that doesn't seem like a reason not to add it at all. It's like saying "don't invent a cart, we've always carried the bags on our backs". As I've mentioned in the JIRA ticket, this does cause some attrition for new learners and breaks consistency, which I do think is good enough reason to include the change. Is there a reason not to, other than your opinion that we "don't need it"? Also, I will stress that almost all SQL functions already work like this, so clearly someone thought this was a good idea and no one disagreed. I really don't see why these functions should be exceptions. |
Yes, it costs maintenance overhead - exposed APIs are getting larger. You're not already inventing your a cart by calling Not all SQL functions work like this - there are so many similar cases. See Spark community had a discussion about it before and people tend to think about rather removing string overridden versions. |
Ok, now this is a fair point and one I really can't dispute. But with such a trivial addition, I find it hard to believe it will ever cost more than it costs people using the API not to have it.
All math functions do, except I ran into this issue from PySpark - there these four functions are pretty much the only ones that can't take a column name, but I see now that that's because PySpark does conversion itself, under the hood, most of the time. So I think it may be probably better to apply this change there.
Again, this problem surfaced from using PySpark, so I can see why you don't think it exists in core Spark. But it's very real. Java likes to keep things verbose and repetitive, and maybe that spilled over to Scala a little bit, but Python is strongly biased towards readability over protections (type, access etc.). I suspect that is why PySpark developers made almost all functions support column names. It falls in line with the language's philosophy, making things easier to read, and it helps you not repeat yourself writing
This I think is a bad idea, adding overloads is not a breaking change, but removing them is. It's been inconsistent like this for a long time, so I'm sure loads of code already relies on using column names where allowed. I find it hard to justify such a big downside for the sake of consistency.
This is an exaggeration. All I proposed was something already done elsewhere in the same codebase, not a new use case.
If it ever happens, ok, at least it will make the API consistent. Still there is a reason those overloads exist - it does make code less verbose, which I don't think is a bad thing. With all of this in mind, I'll be happy to close this PR and reapply the change in the PySpark side. Does that sound reasonable? |
Of course it deprecates (and removes) them and not just removes. This itself is a trivial change but for consistency we shouldn't make exception unless there's a special reason for that. Spark's now going ahead major version bump up to Spark 3. We should fix something we have postponed in lower versions. PySpark has the same problem as well. I wouldn't fix it what's already there from Spark 1.x - if it was a new API, I would have been okay since we already have the inconsistency there. We actually should fix it in Scala side and see what's going to happen. If Spark's going to remove string arguments, it should also be considered to remove string as columns in PySpark APIs as well. |
I get the reasoning, but like I said, in Python it aligns better with the language to keep the overloads. It's already 99.9% the way to fully consistent, too - it would take a huge refactor to do the subtractive change, but it only takes a few more lines for the additive one. I have no voice over this decision, but I'd be deeply disappointed if PySpark lost the API sugar. |
Are you 100% sure that few lines make all functions API consistent? I think it's no IIRC (I already took a look before). BTW, why supporting other types are Python specific API sugar? Good Python API explicitly expects what as input and what as output. Duck-typing is orthogonal here. |
I'm very sure, because the general coding pattern used for most functions is
The exceptions are those who are defined automatically from their name with So unless there's a very sneaky definition somewhere, I don't think any other inconsistencies exist.
I believe this is something that should be made very clear by the documentation, but otherwise I don't see a problem. The built-in functions themselves are full with examples of duck-typing like this. See |
These generics are only applied to I at least see one more instance,
|
I had whitelisted one by one at Still the point stands that uses of
This is a different matter altogether.
What kind of impact does this have there?
The same can be said about BTW, I've found more than one issue with I'll close this PR now, as I'm convinced the change should be made on PySpark's side. |
What changes were proposed in this pull request?
Most SQL functions already had support for taking column names in
place of Column objects. This change enables the same functionality
for the following functions:
How was this patch tested?
Ran /dev/run-tests