[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882

asmello · 2019-02-24T13:04:56Z

What changes were proposed in this pull request?

Most SQL functions defined in spark.sql.functions have two calling patterns, one with a Column object as input, and another with a string representing a column name, which is then converted into a Column object internally.

There are, however, a few notable exceptions:

lower()
upper()
abs()
bitwiseNOT()
ltrim()
rtrim()
trim()
ascii()
base64()
unbase64()

While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences:

It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error;
It gets in the way of stylistic conventions. Most of the time it makes Python code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable.

This patch is meant to fix the aforementioned problem.

Effect

This patch enables support for passing column names as input to those functions mentioned above.

Side effects

This PR also fixes an issue with some functions being defined multiple times by using _create_function().

How it works

_create_function() was redefined to always convert the argument to a Column object. The old implementation has been kept under _create_name_function(), and is still being used to generate the following special functions:

lit()
col()
column()
asc()
desc()
asc_nulls_first()
asc_nulls_last()
desc_nulls_first()
desc_nulls_last()

This is because these functions can only take a column name as their argument. This is not a problem, as their semantics require so.

How was this patch tested?

Ran ./dev/run-tests and tested it manually.

Almost all functions in the SQL API have support for taking in column names in place of Column objects. This patch eliminates the few exceptions to make the API more consistent. Affected functions are: - lower() - upper() - abs() - bitwiseNOT() As a side-effect, this fixes the redefinition of lower() and upper(), which were being defined at lines 95-96 and again at 1442-1443.

As a side-effect, fixed double definition of initcap().

asmello · 2019-02-24T13:08:48Z

@HyukjinKwon let's continue the discussion from PR #23879 here.

HyukjinKwon · 2019-02-24T13:28:29Z

I still think we should decide if we're going to drop string ones in Scala side first. Although removing string ones looks preferred, I think there's compatibility concern for instance.

Ideally we should either consistently allows or consistently disallow across APIs. Yes, I do think it's something we should fix if we see consensus either way. Would you be interested in sending a discussion thread in the mailing list?

HyukjinKwon · 2019-02-24T13:35:10Z

Adding @cloud-fan, @jsnowacki who i remember talked with before, @maropu from previous PR, @srowen and @rxin who has a better insight in API perspective (imho)

It's about if we should consistently allows strings as columns in functions API or not.

asmello · 2019-02-24T13:40:39Z

Would you be interested in sending a discussion thread in the mailing list?

I can do that. 👍

HyukjinKwon · 2019-02-24T13:58:02Z

Adding @MaxGekk who also I talked with before.

srowen · 2019-02-24T13:58:48Z

So this removes support for column names as strings in these functions? or adds some support? It looks like the former but description suggests the latter. Why remove this overload?
Either way why do the functions need to be declared by hand rather than with the function mechanism? I was missing that part. I see some are declared twice; which ones?

asmello · 2019-02-24T14:10:17Z

@srowen I edited the description to make this more clear. The patch adds support for column names as arguments to those functions. The function mechanism was removed in those instances because it provided no support for this, so those functions had to be explicitly defined instead.

The ones I've seen declared twice were lower(), upper() and initcap(). Since the second definition of initcap() was explicit and better documented, I only removed the implicit instance in that case, but otherwise I replaced both implicit definitions with a new explicit one.

The function mechanism is very dangerous for allowing this. Is there a good reason for it to be there at all?

srowen · 2019-02-24T14:52:57Z

Agree about removing the duplicates from these lists of course.
I agree this method of declaring can be error-prone; I fixed another one recently. It's there to avoid repeating a bunch of code. I don't think it's so bad or worth changing here.

So, is the difference between, say, abs and sqrt that the Scala side has overloads for Column and String only for sqrt, not abs? Makes sense.

If _create_function made use of _to_java_column, would this be resolved for all these functions without breaking them out? do the mapping on the Python side, always.

I see, the inconsistency still exists on the Scala side. That can be considered separately.

asmello · 2019-02-24T15:21:04Z

I agree this method of declaring can be error-prone; I fixed another one recently. It's there to avoid repeating a bunch of code. I don't think it's so bad or worth changing here.

I can see the reasoning for the mechanism, but since this duplication happened several times already maybe it would be worth it adding tests for this? I don't think the explicit approach is much worse anyway, specially if better (multiline, with examples) documentation is added. I'm not particularly bothered by this, but I also can see it causing more problems in the future.

If _create_function made use of _to_java_column, would this be resolved for all these functions without breaking them out?

I think so, except we then have to break out lit(), col(), column(), asc() and desc()...

srowen · 2019-02-24T15:31:01Z

I'm not against breaking out all the functions, just might be a lot of work for you.
Yes, we do need tests, ideally. And I suppose that's what doctest is good for, and that would require breaking these out.
If you want to go all the way to break out with brief doctests, I think that's OK. Otherwise I'd leave it for now.

I like the idea of solving it in one go by modifying _create_function as it would be more consistent. Yes you're right that also requires special-casing.

I'm missing why asc and desc are special; they also operate on columns so can take strings too?

asmello · 2019-02-24T15:49:50Z

I'm not against breaking out all the functions, just might be a lot of work for you.

I will not make that change now, especially since it's not directly relevant to this issue, but good to know that you're supportive.

I'm missing why asc and desc are special; they also operate on columns so can take strings too?

They only take strings, that is the problem. If we modify _create_function to always convert to Column, it would break those functions - and maybe others too, haven't fully checked. Alternatively we could define another version of _create_function just for those, but I really question whether that would be any better.

HyukjinKwon · 2019-02-24T15:54:38Z

I actually think we should consider both together while we're here. Some Scala APIs have string versions and most of Python ones have string versions. I already saw many requests about this across APIs.

Also, if we're going to change this in Python alone, let's make very sure that all column reference can be done in string, for instance, in dataframe.py too. I'm pretty sure there are holes here and there.

HyukjinKwon · 2019-02-24T15:55:43Z

It's really be fully checked. As I said, some functions like from_json takes Column but also DataType.

srowen · 2019-02-24T18:06:20Z

I see. Maybe there's a good reason asc/desc should only take a named column, not sure. Well, we can leave that alone for now if in doubt.

I think it's OK to focus here on fixing the duplicated definition, and the small change to _create_function that lets it work consistently with other similar Pyspark methods.

On the Scala side, on second thought, I'm not sure whether it's better to add all the String-based methods, or deprecate them. On the Scala side the syntax is $"col" so there's already just 1 character of difference for the caller. On the Pyspark side it happens to be easy to support both versions with no additional declarations, and there's an argument that supporting strings is its best attempt to mimic the "$" syntax. Still I take the point about difference across APIs not being great, but there are going to be bigger ones anyway. This I could see taking up separately, too.

For dataframe.py, I actually didn't see many places this comes up. I found sampleBy which supports string or Column, but approxQuantile which doesn't seem to. drop might make sense as taking only strings as it does now. It could be reasonable to fix approxQuantile here, or in a follow up.

asmello · 2019-02-24T18:24:02Z

I think it's OK to focus here on fixing the duplicated definition, and the small change to _create_function that lets it work consistently with other similar Pyspark methods.

Ok, I will change this PR so that we turn lit(), col(), column(), asc() and desc() into explicit forms instead of the string/binary functions and abs(). I will also double check if the
_create_function change affects anything else.

found sampleBy which supports string or Column, but approxQuantile which doesn't seem to.

Thanks for looking into this, I haven't had the time. I will also update the PR to include a change there so that it supports Column objects too. When I have time I will also look for other cases like this, but it might take me a day or two before I can work on these changes at all.

HyukjinKwon · 2019-02-25T01:28:49Z

Yes, anyway let's whitelist what we're fixing in this PR. Also, it should better be fully checked. Please check other APIs in functions.py and dataframe.py. What I am worried is if there's a place that doesn't make sense for Column and string - we should really whitelist. If it doesn;t make sense, we should consider removing those string arguments support.

PySpark side can also be easily done by lower(df.col). It doesn't look too much bothering to me actually. "$" syntax is Scala specific and it should stay in Scala. Python specific stuff stays in Python alone.

HyukjinKwon · 2019-02-25T01:31:06Z

Also, we should better stick to all string or all columns in Scala side at the end. If we don't see it's worth to fix Scala side for now, I don't see it's worth to fix Python side as well to be honest.

srowen · 2019-02-25T02:37:43Z

If we did it over again, I think Scala should support Column only for simplicity and 'support' String with the $ implicit. Making Python also support both is a step in the right direction, so not worried about this partial fix.

asmello · 2019-02-25T15:12:11Z

PySpark side can also be easily done by lower(df.col)

I deem this to be an anti-pattern, actually. By defining an explicit dependency on the df dataframe, several things stink:

You might rename the dataframe variable, and then this breaks - once for every column you use;
You rely on the variable name being very short for this to be convenient to write;
If your column name has spaces or other unsupported characters, you have to access it by df["foo bar"], which is just as bad as col("foo bar");
In some contexts, like in a select statement, newly defined columns cannot be accessed this way;
Assigning an abstract expression like col("prod_status") == 'Delivered' to a variable makes it reusable for multiple dataframes, while df.prod_status == 'Delivered' is bound to df;
This will never be as clean and readable as simply passing the column name as a string.

cloud-fan · 2019-02-25T15:38:10Z

I haven't read through the entire thread, just want to provide some background.

At the beginning, having a string version of the function was considered as good for UX. However, when we add more and more functions, problems occur.

When a function takes a column, the column can be in 3 types:

a Column
a String
a java type of the column literal, like double, int, etc.

2 and 3 may conflict if it's a string column.

I think a reasonable rule is giving up 3. Up to my experience, passing a column name string is more common than passing a literal of the column.

srowen · 2019-02-25T15:52:09Z

Side comment: I also don't find the df.col syntax very usable in practice for the reasons above. It's necessary when the column name by itself is ambiguous (i.e. in a join) but df["col"] is more reliable there IMHO.

srowen · 2019-03-02T15:11:02Z

@asmello I think you're clear to proceed as per your last comment.

asmello · 2019-03-02T16:31:23Z

@srowen ok, I'm whitelisting dataframe.py now, and I'll post an update when I'm done double checking functions.py too.

asmello · 2019-03-02T17:39:30Z

Ok, this was tedious. I've whitelisted all functions defined in functions.py, save for the previously identified exceptions, which were all being defined automatically.

All these functions take columns as arguments, and they explicitly handle Column/name duality:

approx_count_distinct
coalesce
corr
covar_pop
covar_samp
countDistinct
first
grouping
grouping_id
isnan
isnull
last
nanvl
round
bround
shiftLeft
shiftRight
shiftRightUnsigned
struct
greatest
least
when
log
log2
conv
factorial
lag
col
year
quarter
month
dayofweek
dayofmonth
dayofyear
hour
minute
second
weekofyear
date_add
date_sub
datediff
add_months
months_between
to_date
to_timestamp
trunc
date_trunc
next_day
last_day
from_unixtime
unix_timestamp
from_utc_timestamp
to_utc_timestamp
window
crc32
md5
sha1
sha2
hash
concat_ws
decode
encode
format_number
format_string
instr
substring
substring_index
levenshtein
locate
lpad
rpad
repeat
split
regexp_extract
regexp_replace
soundex
bin
hex
unhex
length
translate
create_map
map_from_arrays
array
array_contains
arrays_overlap
slice
def
concat
array_position
element_at
array_remove
array_distinct
array_intersect
array_union
array_except
explode
posexplode
explode_outer
posexplode_outer
get_json_object
json_tuple
from_json
to_json
schema_of_json
schema_of_csv
col
size
array_min
array_max
sort_array
array_sort
shuffle
reverse
flatten
map_keys
map_values
map_entries
map_from_entries
array_repeat
arrays_zip
map_concat
sequence
from_csv

@HyukjinKwon you mentioned from_json(), but that's not a true exception. The schema argument is not meant to be an actual column, it's a column specification, so it's a completely different case. That same function also takes an actual column to operate on, which it explicitly accepts as a name or instance. So there's absolutely no problem there.

As for the dataframe methods in dataframe.py, these functions also take column names or instances and do explicit conversion:

join
sampleBy
sort
select
groupBy
rollup
cube
dropna
fillna
replace
drop
toDF
describe
dropDuplicates
sortWithinPartitions

I was even a little surprised by drop(), I thought it would only take names, but it does handle both cases. EDIT: Sorry, I double checked and it does support both, but only when you give a single argument. If you drop more than one column, it enforces a check for strings. And the reason is because there is no Seq support on the Scala side.

The following function is also 100% compliant with both API styles, but it relies on the jvm backend to do the conversion:

__getitem__

This is something that should be kept in mind should the Scala API change. EDIT: this list was larger, but I realised I'd mistaken _jseq for a jvm function when it's really not.

The following functions are violations, they only take strings:

approxQuantile
corr
cov
crosstab
freqItem

The reason these are exceptions is that they're implemented in org.apache.spark.sql.DataFrameStatFunctions, and for some reason almost all functions there only take name arguments. The PySpark never converts from Column object to string, so it also only supports the name spec. This is something that should be fixed on the Scala side IMO.

Since this patch is about making name support universal, however, I don't think it falls under scope to do anything about the drop() and stats violations for now.

One other special case I should mention is colRegex, which also only takes strings. But in that case, it just wouldn't make sense to take Column objects (how would you specific a regex that way???), so that is fine.

With these checks in place, I'm confident the list of exceptions in this PR's description is exhaustive. I will work on solving those now as agreed (by redefining the _create_function argument passing strategy to do explicit conversion).

This reverses the previous strategy of breaking out inconsistent functions from the automated mechanism, and solves the inconsistency problem by instead always passing a Column object to the jvm callee. A few auto-gen functions, however, can only take a column name as input - and that is as it should be, semantically: - lit() - col() - column() - asc() - desc() - asc_nulls_first() - asc_nulls_last() - desc_nulls_first() - desc_nulls_last() To avoid breaking those functions, a different kind of automation mechanism is introduced, and applied for those only. This is just the previous implementation renamed. The reason the original exceptions aren't being handled specially instead is that the original mechanism was inconsistent with how most jvm functions are called. The functions listed above are the true exceptions that should be handled specially, since it would not make sense for them to accept column objects like the others.

asmello · 2019-03-02T19:08:16Z

Alright, as noted in the commit, I've restored the original exceptions to be handled by the mechanism, and handled these separately:

lit()
col()
column()
asc()
desc()
asc_nulls_first()
asc_nulls_last()
desc_nulls_first()
desc_nulls_last()

The original mechanism has been renamed and applied those only. The new mechanism is consistent with how most functions are handled outside the mechanism - it converts the argument to a Column object by default.

asmello · 2019-03-03T08:35:59Z

@srowen @HyukjinKwon we should be good to go now.

srowen · 2019-03-14T21:18:27Z

I ran another test build to be sure, and it failed, but I'm pretty sure it's spurious. @shaneknapp for some weird reason sometimes builds fail because the java style checker refers to old DTDs that disappeared. We fixed that in master a week or two ago. I don't know why.

I'll trigger another build. If it passes I think this is good to go.

SparkQA · 2019-03-14T21:54:41Z

Test build #4623 has finished for PR 23882 at commit 23d0222.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-03-14T22:10:51Z

yeah, that's definitely weird. the workspace that the builds run in is liberally hit by `git clean -fdx`...

…

On Thu, Mar 14, 2019 at 2:19 PM Sean Owen ***@***.***> wrote: I ran another test build to be sure, and it failed, but I'm pretty sure it's spurious. @shaneknapp <https://github.com/shaneknapp> for some weird reason sometimes builds fail because the java style checker refers to old DTDs that disappeared. We fixed that in master a week or two ago. I don't know why. I'll trigger another build. If it passes I think this is good to go. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABiDrI7BFxOUBqM7grguf_of0Q-3m6-2ks5vWrzngaJpZM4bOheq> .

HyukjinKwon · 2019-03-15T00:38:55Z

python/pyspark/sql/functions.py

@@ -85,13 +96,16 @@ def _():
    >>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1)
    [Row(height=5, spark_user=True)]
    """
-_functions = {
+_name_functions = {
+    # name functions take a column name as their argument
    'lit': _lit_doc,


Does lit takes the string as column names? how do we create string literal?

and .. what's really "name function" ... ?

No, lit() takes a literal value and creates a column with that literal. That's what it's for. It doesn't make sense for it to accept a column name, given its nature. So if you give it a string it will create a column with that string literal.

The name "name function" is something I came up with just to distinguish these functions from the ones that take columns as input. They are defined by that distinction - they are "functions that take a column name as their argument", exclusively.

lit doesn't take a column name as their argument. Why did we use such name function category for lit?

To be fair, lit is such a unique function semantically that it would warrant its own category. But since the implementation is exactly the same as those "name functions", I left it there for practical purposes.

One (lit) of five (col, column, asc, desc, lit) doesn't sound like a special case tho. It had to have a better category if 20% of items doesn't fit to the category.

srowen · 2019-03-17T18:01:06Z

Merged to master

HyukjinKwon · 2019-03-18T01:40:25Z

What happened to math functions?

>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 78, in _
    jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
ValueError: could not convert string to float: id

>>> from pyspark.sql.functions import atan2, col
>>> spark.range(1).select(atan2(col("id"), col("id")))
DataFrame[ATAN2(id, id): double]

HyukjinKwon · 2019-03-18T01:44:04Z

@srowen, I haven't finished my review yet. It's been only 2 days which was weekends.

HyukjinKwon · 2019-03-18T02:00:24Z

Math cases are difficult to fix because of the existing support:

def _create_binary_mathfunction(name, doc=""):
    """ Create a binary mathfunction by name"""
    def _(col1, col2):
        sc = SparkContext._active_spark_context
        # users might write ints for simplicity. This would throw an error on the JVM side.
        jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
                                              col2._jc if isinstance(col2, Column) else float(col2))

Therefore, if col1 argument is a string number, for instance, "1" but also can be a name of the column. Here's the ambiguity. Since we're going ahead for Spark 3.0, I am going to disallow string as numbers in a followup.

HyukjinKwon · 2019-03-18T02:22:44Z

There is another exception, when() which takes columns and literals:

>>> spark.range(1).select(when(lit(True), col("id"))).show()

+--------------------------+
|CASE WHEN true THEN id END|
+--------------------------+
|                         0|
+--------------------------+

>>> spark.range(1).select(when(lit(True), "id")).show()

+--------------------------+
|CASE WHEN true THEN id END|
+--------------------------+
|                        id|
+--------------------------+

Here's another ambiguity

HyukjinKwon · 2019-03-18T03:16:16Z

python/pyspark/sql/functions.py

-    'initcap': 'Returns a new string column by converting the first letter of each word to ' +
-               'uppercase. Words are delimited by whitespace.',
-    'lower': 'Converts a string column to lower case.',
-    'upper': 'Converts a string column to upper case.',


This had to stay in string functions!

initcap was also defined explicitly below, with better documentation, so I preserved that instance. lower and upper have been introduced in the 1.3 API, and were being defined there as well, so I prioritised that instance because it used the correct @since number. Arguably we could've used a separate mapping like string_functions_1_3 or something, but that didn't seem appropriate since the point of these is to minimise code.

I was saying lower and upper. Looks it has been overwritten by this, so if we keep here, it keeps previous definition. Did you check which PR and JIRAs added lower and upper?

Also, strictly this should have not been removed in this PR as it doesn't target to remove overwritten functions. As you said, we should avoid such function definition way later.

Also, if it was an exception, it had to describe it specifically.

# name functions take a column name as their argument 'lit': _lit_doc,

This doesn't look making sense if you read the codes from scratch.

The java/scala API documentation says it was added in 1.3, but I just tracked down the JIRA/PR and it seems it actually was 1.0.

https://issues.apache.org/jira/browse/SPARK-1995
#936

As for removing overwritten functions, maybe it would've been better to make a separate PR, but the first fix did require removing them. When I changed the approach it seemed reasonable to keep the change, since the problem was obvious and easy to fix.

HyukjinKwon · 2019-03-18T03:21:19Z

Why was ascii in string function an exception?

>>> spark.range(1).select(lit('a').alias("value")).select(ascii(col("value"))).show()

+------------+
|ascii(value)|
+------------+
|          97|
+------------+

>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")).show()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 51, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace:
py4j.Py4JException: Method ascii([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:276)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

HyukjinKwon · 2019-03-18T03:42:14Z

python/pyspark/sql/functions.py

-    'initcap': 'Returns a new string column by converting the first letter of each word to ' +
-               'uppercase. Words are delimited by whitespace.',
-    'lower': 'Converts a string column to lower case.',
-    'upper': 'Converts a string column to upper case.',
    'ltrim': 'Trim the spaces from left end for the specified string value.',
    'rtrim': 'Trim the spaces from right end for the specified string value.',
    'trim': 'Trim the spaces from both ends for the specified string column.',


Did you not add the column name support because Scala side has this signautre below?:

def trim(e: Column): Column def trim(e: Column, trimString: String): Column

That's not allowed in Python

>>> from pyspark.sql.functions import trim, lit >>> spark.range(1).select(lit('a').alias("value")).select(trim("value", "a")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: _() takes exactly 1 argument (2 given)

This is what I initially expected when I asked to whitelist them, @asmello

I don't understand what you're saying here. trim with 2 arguments was never supported in PySpark, and that's a separate issue. What this patch changes is that trim("value") is now supported, when it wasn't previously.

I was saying this because I didn't get why you excluded string functions.

asmello · 2019-03-18T08:08:33Z

Why was ascii in string function an exception?

Same reason as the other string functions; they were being defined using the automatic mechanism, which assumed the argument was either a Column object or a literal. So if it was a string (column name) and the Scala API didn't have an overload for that, it would fail.

asmello · 2019-03-18T08:12:11Z

There is another exception, when() which takes columns and literals:

Yes this is a potential source of confusion, but it's more complicated to fix because it would require breaking the current API.

HyukjinKwon · 2019-03-18T08:43:32Z

Same reason as the other string functions; they were being defined using the automatic mechanism, which assumed the argument was either a Column object or a literal. So if it was a string (column name) and the Scala API didn't have an overload for that, it would fail.

What do you mean? doesn't this PR target to have ones that take strings as column names across PySpark function API?

HyukjinKwon · 2019-03-18T08:44:28Z

Yes this is a potential source of confusion, but it's more complicated to fix because it would require breaking the current API.

Yes, this had to be identified and whitelisted before asking if we need something else at this point.

asmello · 2019-03-18T08:58:51Z

Yes, this had to be identified and whitelisted before asking if we need something else at this point.

No, this is a different problem. It's not about the API not supporting a column name, it's about it supporting a literal instead. It's outside the scope here, which I kept restricted specifically so we wouldn't block on other problems like this.

HyukjinKwon · 2019-03-18T09:01:42Z

No, this is a different problem. It's not about the API not supporting a column name, it's about it supporting a literal instead.

The whole problem we discussed at this PR is, if we're going to support strings as column names in general, no? And there looks a few cases of ambiguity here and there.

asmello · 2019-03-18T09:01:46Z

What do you mean? doesn't this PR target to have ones that take strings as column names across PySpark function API?

Yes. I was saying the previous implementation was problematic because of that.

HyukjinKwon · 2019-03-18T09:02:24Z

Then why we didn't handle all other cases I listed above? It's something you missed. I don't know why you try to justify your mistakes.

asmello · 2019-03-18T09:21:21Z

Then why we didn't handle all other cases I listed above? It's something you missed. I don't know why you try to justify your mistakes.

I didn't miss those, I deliberately didn't tackle them because 1) They were a different kind of problem; 2) The fix is not as simple as it requires breaking the API. The only thing I did miss was the problem with the math functions, and I'm happy to have those fixed.

I'm tired of arguing this with you, the PR was open for weeks and I tried to fix every problem you pointed out. What's the point of complaining now that it has been merged? I don't know why you antagonise me so much.

HyukjinKwon · 2019-03-18T09:32:58Z

Because I was deciding if I should revert or make a followup. If you revert it, the reasons should better be left, and I am arguing because you're justifying the reasons that I listed.

This PR targets to support string as columns in general, and there were some cases missing found, which seems to be not mentioned or discussed.

The main reason I was initially worried was that we should see if it makes sense to support string as columns in PySpark's API (further more in other languages API). I think other people also expressed this concern. It doesn't look this concern is addressed properly.

HyukjinKwon · 2019-03-18T09:41:34Z

And, in particular this PR, there had to be a lot of efforts. We shouldn't fix something as just-work-for-now anymore in particular when it repeatedly becomes an issue.

asmello · 2019-03-18T09:57:04Z

It's definitely better after the merge than before, so I see no reason to revert. Though due to the missing math functions fix, I agree a follow up is required.

As for cases like when() and array_contains(), that warrants a deeper discussion. Again, it's not that support for column names is missing there, it's just that those accept string literals, which take precedence in the API. In a way, supporting columns at all is an after-thought there, as their original formulation only supported literals. So we either remove columns from there entirely (at the cost of losing functionality) or we drop the literal support (which redefines the API in a big way). But neither has anything to do with supporting column names, so this should be reserved for another PR.

The main reason I was initially worried was that we should see if it makes sense to support string as columns in PySpark's API

That's a valid concern, but, again, it's better to be consistent for now. It's not a "just-work-for-now" situation, either, it's a full fix for the consistency problem. String support itself is a related, but separate discussion, too.

HyukjinKwon · 2019-03-18T23:18:44Z

I already explained the reasons to revert. If we see some ambiguity, we should consider not supporting string as columns officially, or such discussion should be concluded.

I made a followup to fix missing instances here, and conclude the discussion. If that's getting longer, I'm going to revert this PR.

It's confusing what string means in the PySpark function because it has two semantically different meaning here.

srowen · 2019-03-19T13:33:32Z

I think this change was fine and will review the follow on too. I wouldn't revert this. I think we decided to keep this to a narrow fix for consistency. Anything else can be a future change or discussion. It would return this to a worse state if it's reverted.

…ke string as columns as well ## What changes were proposed in this pull request? This is a followup of #23882 to handle binary math/string functions. For instance, see the cases below: **Before:** ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 51, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace: py4j.Py4JException: Method ascii([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:276) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 78, in _ jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1), ValueError: could not convert string to float: id ``` **After:** ```python >>> from pyspark.sql.functions import lit, ascii >>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")) DataFrame[ascii(value): int] ``` ```python >>> from pyspark.sql.functions import atan2 >>> spark.range(1).select(atan2("id", "id")) DataFrame[ATAN2(id, id): double] ``` Note that, - This PR causes a slight behaviour changes for math functions. For instance, numbers as strings (e.g., `"1"`) were supported as arguments of binary math functions before. After this PR, it recognises it as column names. - I also intentionally didn't document this behaviour changes since we're going ahead for Spark 3.0 and I don't think numbers as strings make much sense in math functions. - There is another exception `when`, which takes string as literal values as below. This PR doeesn't fix this ambiguity. ```python >>> spark.range(1).select(when(lit(True), col("id"))).show() ``` ``` +--------------------------+ |CASE WHEN true THEN id END| +--------------------------+ | 0| +--------------------------+ ``` ```python >>> spark.range(1).select(when(lit(True), "id")).show() ``` ``` +--------------------------+ |CASE WHEN true THEN id END| +--------------------------+ | id| +--------------------------+ ``` This PR also fixes as below: #23882 fixed it to: - Rename `_create_function` to `_create_name_function` - Define new `_create_function` to take strings as column names. This PR, I proposes to: - Revert `_create_name_function` name to `_create_function`. - Define new `_create_function_over_column` to take strings as column names. ## How was this patch tested? Some unit tests were added for binary math / string functions. Closes #24121 from HyukjinKwon/SPARK-26979. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon · 2019-03-19T23:22:02Z

Sorry if I missed a discussion thread that we're agreed upon fixing this partially, and there was an explicit conclusion here. If not, I don't think we should fix such thing partially without an explicit conclusion - it's been raised multiple times.

amello-palantir added 2 commits February 24, 2019 11:21

Add remaining string/binary functions missing colname support

935bf4c

As a side-effect, fixed double definition of initcap().

asmello changed the title ~~[SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions~~ [SPARK-26979][PySpark] Add missing column name support for some SQL functions Mar 2, 2019

HyukjinKwon reviewed Mar 15, 2019

View reviewed changes

srowen closed this in f9180f8 Mar 17, 2019

HyukjinKwon reviewed Mar 18, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Mar 18, 2019

[SPARK-26979][PYTHON][FOLLOW-UP] Make binary math/string functions take string as columns as well #24121

Closed

[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882

[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882

Conversation

asmello commented Feb 24, 2019 • edited Loading

What changes were proposed in this pull request?

Effect

Side effects

How it works

How was this patch tested?

asmello commented Feb 24, 2019

HyukjinKwon commented Feb 24, 2019

HyukjinKwon commented Feb 24, 2019 • edited Loading

asmello commented Feb 24, 2019

HyukjinKwon commented Feb 24, 2019

srowen commented Feb 24, 2019

asmello commented Feb 24, 2019

srowen commented Feb 24, 2019

asmello commented Feb 24, 2019

srowen commented Feb 24, 2019

asmello commented Feb 24, 2019

HyukjinKwon commented Feb 24, 2019

HyukjinKwon commented Feb 24, 2019

srowen commented Feb 24, 2019

asmello commented Feb 24, 2019

HyukjinKwon commented Feb 25, 2019 • edited Loading

HyukjinKwon commented Feb 25, 2019

srowen commented Feb 25, 2019

asmello commented Feb 25, 2019 • edited Loading

cloud-fan commented Feb 25, 2019

srowen commented Feb 25, 2019

srowen commented Mar 2, 2019

asmello commented Mar 2, 2019

asmello commented Mar 2, 2019 • edited Loading

asmello commented Mar 2, 2019

asmello commented Mar 3, 2019

srowen commented Mar 14, 2019

SparkQA commented Mar 14, 2019

shaneknapp commented Mar 14, 2019 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Mar 17, 2019

HyukjinKwon commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 18, 2019

Choose a reason for hiding this comment

HyukjinKwon Mar 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asmello commented Mar 18, 2019

asmello commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

asmello commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

asmello commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

asmello commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

HyukjinKwon commented Mar 18, 2019

asmello commented Mar 18, 2019 • edited Loading

HyukjinKwon commented Mar 18, 2019 • edited Loading

srowen commented Mar 19, 2019

HyukjinKwon commented Mar 19, 2019

asmello commented Feb 24, 2019 •

edited

Loading

HyukjinKwon commented Feb 24, 2019 •

edited

Loading

HyukjinKwon commented Feb 25, 2019 •

edited

Loading

asmello commented Feb 25, 2019 •

edited

Loading

asmello commented Mar 2, 2019 •

edited

Loading

HyukjinKwon Mar 18, 2019 •

edited

Loading

asmello commented Mar 18, 2019 •

edited

Loading

HyukjinKwon commented Mar 18, 2019 •

edited

Loading