Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26979][SQL] Add missing column name support for SQL functions #23879

Closed
wants to merge 2 commits into from

Conversation

asmello
Copy link

@asmello asmello commented Feb 23, 2019

What changes were proposed in this pull request?

Most SQL functions already had support for taking column names in
place of Column objects. This change enables the same functionality
for the following functions:

  • lower()
  • upper()
  • abs()
  • bitwiseNOT()

How was this patch tested?

Ran /dev/run-tests

Most SQL functions already had support for taking column names in
place of Column objects. This change enables the same functionality
for the following functions:

- lower()
- upper()
- abs()
- bitwiseNOT()
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

It can be easily worked around by, for instance, lower(col(...)). let's don't add it.

@maropu
Copy link
Member

maropu commented Feb 24, 2019

I also think we don't need these APIs ...

@asmello
Copy link
Author

asmello commented Feb 24, 2019

It can be easily worked around

Forgive me, but that doesn't seem like a reason not to add it at all. It's like saying "don't invent a cart, we've always carried the bags on our backs". As I've mentioned in the JIRA ticket, this does cause some attrition for new learners and breaks consistency, which I do think is good enough reason to include the change. Is there a reason not to, other than your opinion that we "don't need it"?

Also, I will stress that almost all SQL functions already work like this, so clearly someone thought this was a good idea and no one disagreed. I really don't see why these functions should be exceptions.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 24, 2019

Yes, it costs maintenance overhead - exposed APIs are getting larger. You're not already inventing your a cart by calling lower(col(...)) that leverages Spark SQL engine.

Not all SQL functions work like this - there are so many similar cases. See functions.scala. Spark already has tons of APIs. What kind of attrition does it cause to new learners? Inconsistency? In that case we should rather focus on removing string overridden versions. We should rather focus on reducing APIs rather than adding them. Otherwise, all systems will end up with some APIs like sending an email at the end.

Spark community had a discussion about it before and people tend to think about rather removing string overridden versions.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

Yes, it costs maintenance overhead - exposed APIs are getting larger.

Ok, now this is a fair point and one I really can't dispute. But with such a trivial addition, I find it hard to believe it will ever cost more than it costs people using the API not to have it.

Not all SQL functions work like this.

All math functions do, except abs(). Other string functions too, like concat_ws(). But you are actually right, now that I looked more closely, there are many more exceptions than I realised.

I ran into this issue from PySpark - there these four functions are pretty much the only ones that can't take a column name, but I see now that that's because PySpark does conversion itself, under the hood, most of the time. So I think it may be probably better to apply this change there.

What kind of attrition does it cause to new learners? Inconsistency?

Again, this problem surfaced from using PySpark, so I can see why you don't think it exists in core Spark. But it's very real. Java likes to keep things verbose and repetitive, and maybe that spilled over to Scala a little bit, but Python is strongly biased towards readability over protections (type, access etc.).

I suspect that is why PySpark developers made almost all functions support column names. It falls in line with the language's philosophy, making things easier to read, and it helps you not repeat yourself writing col() over and over again. So people learning PySpark quickly get into the habit of using column names, but then they face these few exceptions where they can't do that. Then what was once a clear pattern is broken, and without a clear guideline people will sometimes use col() and sometimes not, making code stylistically inconsistent and confusing.

In that case we should rather focus on removing string overridden versions. We should rather focus on reducing APIs rather than adding them.

This I think is a bad idea, adding overloads is not a breaking change, but removing them is. It's been inconsistent like this for a long time, so I'm sure loads of code already relies on using column names where allowed. I find it hard to justify such a big downside for the sake of consistency.

Otherwise, all systems will end up with some APIs like sending an email at the end.

This is an exaggeration. All I proposed was something already done elsewhere in the same codebase, not a new use case.

Spark community had a discussion about it before and people tend to think about rather removing string overridden versions.

If it ever happens, ok, at least it will make the API consistent. Still there is a reason those overloads exist - it does make code less verbose, which I don't think is a bad thing.

With all of this in mind, I'll be happy to close this PR and reapply the change in the PySpark side. Does that sound reasonable?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 24, 2019

Of course it deprecates (and removes) them and not just removes. This itself is a trivial change but for consistency we shouldn't make exception unless there's a special reason for that.

Spark's now going ahead major version bump up to Spark 3. We should fix something we have postponed in lower versions.

PySpark has the same problem as well. I wouldn't fix it what's already there from Spark 1.x - if it was a new API, I would have been okay since we already have the inconsistency there.

We actually should fix it in Scala side and see what's going to happen. If Spark's going to remove string arguments, it should also be considered to remove string as columns in PySpark APIs as well.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

If Spark's going to remove string arguments, it should also be considered to remove string as columns in PySpark APIs as well.

I get the reasoning, but like I said, in Python it aligns better with the language to keep the overloads. It's already 99.9% the way to fully consistent, too - it would take a huge refactor to do the subtractive change, but it only takes a few more lines for the additive one. I have no voice over this decision, but I'd be deeply disappointed if PySpark lost the API sugar.

@HyukjinKwon
Copy link
Member

Are you 100% sure that few lines make all functions API consistent? I think it's no IIRC (I already took a look before). BTW, why supporting other types are Python specific API sugar? Good Python API explicitly expects what as input and what as output. Duck-typing is orthogonal here.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

I'm very sure, because the general coding pattern used for most functions is

@since(X.Y)
def function_name(col):
    """
    Documentation
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.function_name(_to_java_column(col)))

The exceptions are those who are defined automatically from their name with _create_function(), and of those only those four functions (lower(), upper(), abs(), bitwiseNOT()) have no string support on the JVM side.

So unless there's a very sneaky definition somewhere, I don't think any other inconsistencies exist.

Good Python API explicitly expects what as input and what as output. Duck-typing is orthogonal here.

I believe this is something that should be made very clear by the documentation, but otherwise I don't see a problem. The built-in functions themselves are full with examples of duck-typing like this. See max(), dict(), range() just to name a few.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 24, 2019

These generics are only applied to functions.py. Can you then whitelist one by one?

I at least see one more instance, trim by _create_function. Also, there are few exceptions like from_json that takes schema as Column but also DataType. Plus, we should make it consistent in dataframe.py as well.

Column and string are completely different types. It's not iterable or something related with duck-typing. See help(max) in Python.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

These generics are only applied to functions.py. Can you then whitelist one by one?

I had whitelisted one by one at _create_function() uses between lines 257-273, but I forgot to do the same for line 1451. My bad, I truly apologise. Turns out all functions defined there are exceptions too.

Still the point stands that uses of _create_function() are the only cases where the JVM function is called directly without argument conversion, and with those functions also taken care of, the SQL API should be fully consistent in this regard.

Also, there are few exceptions like from_json that takes schema as Column but also DataType

This is a different matter altogether. schema is not the column being operated at in this case, it's a schema specification, so there's no reason it should take a column name here. It could, I suppose, but that sounds like an anti-pattern to me.

Plus, we should make it consistent in dataframe.py as well.

What kind of impact does this have there?

Column and string are completely different types. It's not iterable or something related with duck-typing.

The same can be said about max(1, 2, 3) and max([1, 2, 3]), you know. As long as they are related semantically, Python encourages different calling patterns.

BTW, I've found more than one issue with _create_function() being used to define the same function more than once. This is dangerous. I feel it should be removed altogether.

I'll close this PR now, as I'm convinced the change should be made on PySpark's side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants