Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882

Closed
wants to merge 4 commits into from

Conversation

asmello
Copy link

@asmello asmello commented Feb 24, 2019

What changes were proposed in this pull request?

Most SQL functions defined in spark.sql.functions have two calling patterns, one with a Column object as input, and another with a string representing a column name, which is then converted into a Column object internally.

There are, however, a few notable exceptions:

  • lower()
  • upper()
  • abs()
  • bitwiseNOT()
  • ltrim()
  • rtrim()
  • trim()
  • ascii()
  • base64()
  • unbase64()

While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences:

  1. It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error;

  2. It gets in the way of stylistic conventions. Most of the time it makes Python code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable.

This patch is meant to fix the aforementioned problem.

Effect

This patch enables support for passing column names as input to those functions mentioned above.

Side effects

This PR also fixes an issue with some functions being defined multiple times by using _create_function().

How it works

_create_function() was redefined to always convert the argument to a Column object. The old implementation has been kept under _create_name_function(), and is still being used to generate the following special functions:

  • lit()
  • col()
  • column()
  • asc()
  • desc()
  • asc_nulls_first()
  • asc_nulls_last()
  • desc_nulls_first()
  • desc_nulls_last()

This is because these functions can only take a column name as their argument. This is not a problem, as their semantics require so.

How was this patch tested?

Ran ./dev/run-tests and tested it manually.

Almost all functions in the SQL API have support for taking in
column names in place of Column objects. This patch eliminates the
few exceptions to make the API more consistent.

Affected functions are:
- lower()
- upper()
- abs()
- bitwiseNOT()

As a side-effect, this fixes the redefinition of lower() and
upper(), which were being defined at lines 95-96 and again at
1442-1443.
As a side-effect, fixed double definition of initcap().
@asmello
Copy link
Author

asmello commented Feb 24, 2019

@HyukjinKwon let's continue the discussion from PR #23879 here.

@HyukjinKwon
Copy link
Member

I still think we should decide if we're going to drop string ones in Scala side first. Although removing string ones looks preferred, I think there's compatibility concern for instance.

Ideally we should either consistently allows or consistently disallow across APIs. Yes, I do think it's something we should fix if we see consensus either way. Would you be interested in sending a discussion thread in the mailing list?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 24, 2019

Adding @cloud-fan, @jsnowacki who i remember talked with before, @maropu from previous PR, @srowen and @rxin who has a better insight in API perspective (imho)

It's about if we should consistently allows strings as columns in functions API or not.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

Would you be interested in sending a discussion thread in the mailing list?

I can do that. 👍

@HyukjinKwon
Copy link
Member

Adding @MaxGekk who also I talked with before.

@srowen
Copy link
Member

srowen commented Feb 24, 2019

So this removes support for column names as strings in these functions? or adds some support? It looks like the former but description suggests the latter. Why remove this overload?
Either way why do the functions need to be declared by hand rather than with the function mechanism? I was missing that part. I see some are declared twice; which ones?

@asmello
Copy link
Author

asmello commented Feb 24, 2019

@srowen I edited the description to make this more clear. The patch adds support for column names as arguments to those functions. The function mechanism was removed in those instances because it provided no support for this, so those functions had to be explicitly defined instead.

The ones I've seen declared twice were lower(), upper() and initcap(). Since the second definition of initcap() was explicit and better documented, I only removed the implicit instance in that case, but otherwise I replaced both implicit definitions with a new explicit one.

The function mechanism is very dangerous for allowing this. Is there a good reason for it to be there at all?

@srowen
Copy link
Member

srowen commented Feb 24, 2019

Agree about removing the duplicates from these lists of course.
I agree this method of declaring can be error-prone; I fixed another one recently. It's there to avoid repeating a bunch of code. I don't think it's so bad or worth changing here.

So, is the difference between, say, abs and sqrt that the Scala side has overloads for Column and String only for sqrt, not abs? Makes sense.

If _create_function made use of _to_java_column, would this be resolved for all these functions without breaking them out? do the mapping on the Python side, always.

I see, the inconsistency still exists on the Scala side. That can be considered separately.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

I agree this method of declaring can be error-prone; I fixed another one recently. It's there to avoid repeating a bunch of code. I don't think it's so bad or worth changing here.

I can see the reasoning for the mechanism, but since this duplication happened several times already maybe it would be worth it adding tests for this? I don't think the explicit approach is much worse anyway, specially if better (multiline, with examples) documentation is added. I'm not particularly bothered by this, but I also can see it causing more problems in the future.

If _create_function made use of _to_java_column, would this be resolved for all these functions without breaking them out?

I think so, except we then have to break out lit(), col(), column(), asc() and desc()...

@srowen
Copy link
Member

srowen commented Feb 24, 2019

I'm not against breaking out all the functions, just might be a lot of work for you.
Yes, we do need tests, ideally. And I suppose that's what doctest is good for, and that would require breaking these out.
If you want to go all the way to break out with brief doctests, I think that's OK. Otherwise I'd leave it for now.

I like the idea of solving it in one go by modifying _create_function as it would be more consistent. Yes you're right that also requires special-casing.

I'm missing why asc and desc are special; they also operate on columns so can take strings too?

@asmello
Copy link
Author

asmello commented Feb 24, 2019

I'm not against breaking out all the functions, just might be a lot of work for you.

I will not make that change now, especially since it's not directly relevant to this issue, but good to know that you're supportive.

I'm missing why asc and desc are special; they also operate on columns so can take strings too?

They only take strings, that is the problem. If we modify _create_function to always convert to Column, it would break those functions - and maybe others too, haven't fully checked. Alternatively we could define another version of _create_function just for those, but I really question whether that would be any better.

@HyukjinKwon
Copy link
Member

I actually think we should consider both together while we're here. Some Scala APIs have string versions and most of Python ones have string versions. I already saw many requests about this across APIs.

Also, if we're going to change this in Python alone, let's make very sure that all column reference can be done in string, for instance, in dataframe.py too. I'm pretty sure there are holes here and there.

@HyukjinKwon
Copy link
Member

It's really be fully checked. As I said, some functions like from_json takes Column but also DataType.

@srowen
Copy link
Member

srowen commented Feb 24, 2019

I see. Maybe there's a good reason asc/desc should only take a named column, not sure. Well, we can leave that alone for now if in doubt.

I think it's OK to focus here on fixing the duplicated definition, and the small change to _create_function that lets it work consistently with other similar Pyspark methods.

On the Scala side, on second thought, I'm not sure whether it's better to add all the String-based methods, or deprecate them. On the Scala side the syntax is $"col" so there's already just 1 character of difference for the caller. On the Pyspark side it happens to be easy to support both versions with no additional declarations, and there's an argument that supporting strings is its best attempt to mimic the "$" syntax. Still I take the point about difference across APIs not being great, but there are going to be bigger ones anyway. This I could see taking up separately, too.

For dataframe.py, I actually didn't see many places this comes up. I found sampleBy which supports string or Column, but approxQuantile which doesn't seem to. drop might make sense as taking only strings as it does now. It could be reasonable to fix approxQuantile here, or in a follow up.

@asmello
Copy link
Author

asmello commented Feb 24, 2019

I think it's OK to focus here on fixing the duplicated definition, and the small change to _create_function that lets it work consistently with other similar Pyspark methods.

Ok, I will change this PR so that we turn lit(), col(), column(), asc() and desc() into explicit forms instead of the string/binary functions and abs(). I will also double check if the
_create_function change affects anything else.

found sampleBy which supports string or Column, but approxQuantile which doesn't seem to.

Thanks for looking into this, I haven't had the time. I will also update the PR to include a change there so that it supports Column objects too. When I have time I will also look for other cases like this, but it might take me a day or two before I can work on these changes at all.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 25, 2019

Yes, anyway let's whitelist what we're fixing in this PR. Also, it should better be fully checked. Please check other APIs in functions.py and dataframe.py. What I am worried is if there's a place that doesn't make sense for Column and string - we should really whitelist. If it doesn;t make sense, we should consider removing those string arguments support.

PySpark side can also be easily done by lower(df.col). It doesn't look too much bothering to me actually. "$" syntax is Scala specific and it should stay in Scala. Python specific stuff stays in Python alone.

@HyukjinKwon
Copy link
Member

Also, we should better stick to all string or all columns in Scala side at the end. If we don't see it's worth to fix Scala side for now, I don't see it's worth to fix Python side as well to be honest.

@srowen
Copy link
Member

srowen commented Feb 25, 2019

If we did it over again, I think Scala should support Column only for simplicity and 'support' String with the $ implicit. Making Python also support both is a step in the right direction, so not worried about this partial fix.

@asmello
Copy link
Author

asmello commented Feb 25, 2019

PySpark side can also be easily done by lower(df.col)

I deem this to be an anti-pattern, actually. By defining an explicit dependency on the df dataframe, several things stink:

  • You might rename the dataframe variable, and then this breaks - once for every column you use;
  • You rely on the variable name being very short for this to be convenient to write;
  • If your column name has spaces or other unsupported characters, you have to access it by df["foo bar"], which is just as bad as col("foo bar");
  • In some contexts, like in a select statement, newly defined columns cannot be accessed this way;
  • Assigning an abstract expression like col("prod_status") == 'Delivered' to a variable makes it reusable for multiple dataframes, while df.prod_status == 'Delivered' is bound to df;
  • This will never be as clean and readable as simply passing the column name as a string.

@cloud-fan
Copy link
Contributor

I haven't read through the entire thread, just want to provide some background.

At the beginning, having a string version of the function was considered as good for UX. However, when we add more and more functions, problems occur.

When a function takes a column, the column can be in 3 types:

  1. a Column
  2. a String
  3. a java type of the column literal, like double, int, etc.

2 and 3 may conflict if it's a string column.

I think a reasonable rule is giving up 3. Up to my experience, passing a column name string is more common than passing a literal of the column.

@srowen
Copy link
Member

srowen commented Feb 25, 2019

Side comment: I also don't find the df.col syntax very usable in practice for the reasons above. It's necessary when the column name by itself is ambiguous (i.e. in a join) but df["col"] is more reliable there IMHO.

@srowen
Copy link
Member

srowen commented Mar 2, 2019

@asmello I think you're clear to proceed as per your last comment.

@asmello
Copy link
Author

asmello commented Mar 2, 2019

@srowen ok, I'm whitelisting dataframe.py now, and I'll post an update when I'm done double checking functions.py too.

@asmello
Copy link
Author

asmello commented Mar 2, 2019

Ok, this was tedious. I've whitelisted all functions defined in functions.py, save for the previously identified exceptions, which were all being defined automatically.

All these functions take columns as arguments, and they explicitly handle Column/name duality:

approx_count_distinct
coalesce
corr
covar_pop
covar_samp
countDistinct
first
grouping
grouping_id
isnan
isnull
last
nanvl
round
bround
shiftLeft
shiftRight
shiftRightUnsigned
struct
greatest
least
when
log
log2
conv
factorial
lag
col
year
quarter
month
dayofweek
dayofmonth
dayofyear
hour
minute
second
weekofyear
date_add
date_sub
datediff
add_months
months_between
to_date
to_timestamp
trunc
date_trunc
next_day
last_day
from_unixtime
unix_timestamp
from_utc_timestamp
to_utc_timestamp
window
crc32
md5
sha1
sha2
hash
concat_ws
decode
encode
format_number
format_string
instr
substring
substring_index
levenshtein
locate
lpad
rpad
repeat
split
regexp_extract
regexp_replace
soundex
bin
hex
unhex
length
translate
create_map
map_from_arrays
array
array_contains
arrays_overlap
slice
def
concat
array_position
element_at
array_remove
array_distinct
array_intersect
array_union
array_except
explode
posexplode
explode_outer
posexplode_outer
get_json_object
json_tuple
from_json
to_json
schema_of_json
schema_of_csv
col
size
array_min
array_max
sort_array
array_sort
shuffle
reverse
flatten
map_keys
map_values
map_entries
map_from_entries
array_repeat
arrays_zip
map_concat
sequence
from_csv

@HyukjinKwon you mentioned from_json(), but that's not a true exception. The schema argument is not meant to be an actual column, it's a column specification, so it's a completely different case. That same function also takes an actual column to operate on, which it explicitly accepts as a name or instance. So there's absolutely no problem there.

As for the dataframe methods in dataframe.py, these functions also take column names or instances and do explicit conversion:

join
sampleBy
sort
select
groupBy
rollup
cube
dropna
fillna
replace
drop
toDF
describe
dropDuplicates
sortWithinPartitions

I was even a little surprised by drop(), I thought it would only take names, but it does handle both cases. EDIT: Sorry, I double checked and it does support both, but only when you give a single argument. If you drop more than one column, it enforces a check for strings. And the reason is because there is no Seq support on the Scala side.

The following function is also 100% compliant with both API styles, but it relies on the jvm backend to do the conversion:

__getitem__

This is something that should be kept in mind should the Scala API change. EDIT: this list was larger, but I realised I'd mistaken _jseq for a jvm function when it's really not.

The following functions are violations, they only take strings:

approxQuantile
corr
cov
crosstab
freqItem

The reason these are exceptions is that they're implemented in org.apache.spark.sql.DataFrameStatFunctions, and for some reason almost all functions there only take name arguments. The PySpark never converts from Column object to string, so it also only supports the name spec. This is something that should be fixed on the Scala side IMO.

Since this patch is about making name support universal, however, I don't think it falls under scope to do anything about the drop() and stats violations for now.

One other special case I should mention is colRegex, which also only takes strings. But in that case, it just wouldn't make sense to take Column objects (how would you specific a regex that way???), so that is fine.

With these checks in place, I'm confident the list of exceptions in this PR's description is exhaustive. I will work on solving those now as agreed (by redefining the _create_function argument passing strategy to do explicit conversion).

This reverses the previous strategy of breaking out inconsistent
functions from the automated mechanism, and solves the inconsistency
problem by instead always passing a Column object to the jvm callee.

A few auto-gen functions, however, can only take a column name as
input - and that is as it should be, semantically:

- lit()
- col()
- column()
- asc()
- desc()
- asc_nulls_first()
- asc_nulls_last()
- desc_nulls_first()
- desc_nulls_last()

To avoid breaking those functions, a different kind of automation
mechanism is introduced, and applied for those only. This is just
the previous implementation renamed.

The reason the original exceptions aren't being handled specially
instead is that the original mechanism was inconsistent with how
most jvm functions are called. The functions listed above are the
true exceptions that should be handled specially, since it would
not make sense for them to accept column objects like the others.
@asmello
Copy link
Author

asmello commented Mar 2, 2019

Alright, as noted in the commit, I've restored the original exceptions to be handled by the mechanism, and handled these separately:

  • lit()
  • col()
  • column()
  • asc()
  • desc()
  • asc_nulls_first()
  • asc_nulls_last()
  • desc_nulls_first()
  • desc_nulls_last()

The original mechanism has been renamed and applied those only. The new mechanism is consistent with how most functions are handled outside the mechanism - it converts the argument to a Column object by default.

@asmello asmello changed the title [SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions [SPARK-26979][PySpark] Add missing column name support for some SQL functions Mar 2, 2019
@asmello
Copy link
Author

asmello commented Mar 3, 2019

@srowen @HyukjinKwon we should be good to go now.

@srowen
Copy link
Member

srowen commented Mar 14, 2019

I ran another test build to be sure, and it failed, but I'm pretty sure it's spurious. @shaneknapp for some weird reason sometimes builds fail because the java style checker refers to old DTDs that disappeared. We fixed that in master a week or two ago. I don't know why.

I'll trigger another build. If it passes I think this is good to go.

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #4623 has finished for PR 23882 at commit 23d0222.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

shaneknapp commented Mar 14, 2019 via email

@@ -85,13 +96,16 @@ def _():
>>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1)
[Row(height=5, spark_user=True)]
"""
_functions = {
_name_functions = {
# name functions take a column name as their argument
'lit': _lit_doc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does lit takes the string as column names? how do we create string literal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and .. what's really "name function" ... ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, lit() takes a literal value and creates a column with that literal. That's what it's for. It doesn't make sense for it to accept a column name, given its nature. So if you give it a string it will create a column with that string literal.

The name "name function" is something I came up with just to distinguish these functions from the ones that take columns as input. They are defined by that distinction - they are "functions that take a column name as their argument", exclusively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lit doesn't take a column name as their argument. Why did we use such name function category for lit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, lit is such a unique function semantically that it would warrant its own category. But since the implementation is exactly the same as those "name functions", I left it there for practical purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (lit) of five (col, column, asc, desc, lit) doesn't sound like a special case tho. It had to have a better category if 20% of items doesn't fit to the category.

@srowen srowen closed this in f9180f8 Mar 17, 2019
@srowen
Copy link
Member

srowen commented Mar 17, 2019

Merged to master

@HyukjinKwon
Copy link
Member

What happened to math functions?

>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 78, in _
    jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
ValueError: could not convert string to float: id
>>> from pyspark.sql.functions import atan2, col
>>> spark.range(1).select(atan2(col("id"), col("id")))
DataFrame[ATAN2(id, id): double]

@HyukjinKwon
Copy link
Member

@srowen, I haven't finished my review yet. It's been only 2 days which was weekends.

@HyukjinKwon
Copy link
Member

Math cases are difficult to fix because of the existing support:

def _create_binary_mathfunction(name, doc=""):
    """ Create a binary mathfunction by name"""
    def _(col1, col2):
        sc = SparkContext._active_spark_context
        # users might write ints for simplicity. This would throw an error on the JVM side.
        jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
                                              col2._jc if isinstance(col2, Column) else float(col2))

Therefore, if col1 argument is a string number, for instance, "1" but also can be a name of the column. Here's the ambiguity. Since we're going ahead for Spark 3.0, I am going to disallow string as numbers in a followup.

@HyukjinKwon
Copy link
Member

There is another exception, when() which takes columns and literals:

>>> spark.range(1).select(when(lit(True), col("id"))).show()
+--------------------------+
|CASE WHEN true THEN id END|
+--------------------------+
|                         0|
+--------------------------+
>>> spark.range(1).select(when(lit(True), "id")).show()
+--------------------------+
|CASE WHEN true THEN id END|
+--------------------------+
|                        id|
+--------------------------+

Here's another ambiguity

'initcap': 'Returns a new string column by converting the first letter of each word to ' +
'uppercase. Words are delimited by whitespace.',
'lower': 'Converts a string column to lower case.',
'upper': 'Converts a string column to upper case.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to stay in string functions!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initcap was also defined explicitly below, with better documentation, so I preserved that instance. lower and upper have been introduced in the 1.3 API, and were being defined there as well, so I prioritised that instance because it used the correct @since number. Arguably we could've used a separate mapping like string_functions_1_3 or something, but that didn't seem appropriate since the point of these is to minimise code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying lower and upper. Looks it has been overwritten by this, so if we keep here, it keeps previous definition. Did you check which PR and JIRAs added lower and upper?

Also, strictly this should have not been removed in this PR as it doesn't target to remove overwritten functions. As you said, we should avoid such function definition way later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it was an exception, it had to describe it specifically.

    # name functions take a column name as their argument
    'lit': _lit_doc,

This doesn't look making sense if you read the codes from scratch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java/scala API documentation says it was added in 1.3, but I just tracked down the JIRA/PR and it seems it actually was 1.0.

https://issues.apache.org/jira/browse/SPARK-1995
#936

As for removing overwritten functions, maybe it would've been better to make a separate PR, but the first fix did require removing them. When I changed the approach it seemed reasonable to keep the change, since the problem was obvious and easy to fix.

@HyukjinKwon
Copy link
Member

Why was ascii in string function an exception?

>>> spark.range(1).select(lit('a').alias("value")).select(ascii(col("value"))).show()
+------------+
|ascii(value)|
+------------+
|          97|
+------------+
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value")).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 51, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace:
py4j.Py4JException: Method ascii([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:276)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

'initcap': 'Returns a new string column by converting the first letter of each word to ' +
'uppercase. Words are delimited by whitespace.',
'lower': 'Converts a string column to lower case.',
'upper': 'Converts a string column to upper case.',
'ltrim': 'Trim the spaces from left end for the specified string value.',
'rtrim': 'Trim the spaces from right end for the specified string value.',
'trim': 'Trim the spaces from both ends for the specified string column.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you not add the column name support because Scala side has this signautre below?:

  def trim(e: Column): Column
  def trim(e: Column, trimString: String): Column

That's not allowed in Python

>>> from pyspark.sql.functions import trim, lit
>>> spark.range(1).select(lit('a').alias("value")).select(trim("value", "a"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: _() takes exactly 1 argument (2 given)

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I initially expected when I asked to whitelist them, @asmello

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you're saying here. trim with 2 arguments was never supported in PySpark, and that's a separate issue. What this patch changes is that trim("value") is now supported, when it wasn't previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying this because I didn't get why you excluded string functions.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

Why was ascii in string function an exception?

Same reason as the other string functions; they were being defined using the automatic mechanism, which assumed the argument was either a Column object or a literal. So if it was a string (column name) and the Scala API didn't have an overload for that, it would fail.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

There is another exception, when() which takes columns and literals:

Yes this is a potential source of confusion, but it's more complicated to fix because it would require breaking the current API.

@HyukjinKwon
Copy link
Member

Same reason as the other string functions; they were being defined using the automatic mechanism, which assumed the argument was either a Column object or a literal. So if it was a string (column name) and the Scala API didn't have an overload for that, it would fail.

What do you mean? doesn't this PR target to have ones that take strings as column names across PySpark function API?

@HyukjinKwon
Copy link
Member

Yes this is a potential source of confusion, but it's more complicated to fix because it would require breaking the current API.

Yes, this had to be identified and whitelisted before asking if we need something else at this point.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

Yes, this had to be identified and whitelisted before asking if we need something else at this point.

No, this is a different problem. It's not about the API not supporting a column name, it's about it supporting a literal instead. It's outside the scope here, which I kept restricted specifically so we wouldn't block on other problems like this.

@HyukjinKwon
Copy link
Member

No, this is a different problem. It's not about the API not supporting a column name, it's about it supporting a literal instead.

The whole problem we discussed at this PR is, if we're going to support strings as column names in general, no? And there looks a few cases of ambiguity here and there.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

What do you mean? doesn't this PR target to have ones that take strings as column names across PySpark function API?

Yes. I was saying the previous implementation was problematic because of that.

@HyukjinKwon
Copy link
Member

Then why we didn't handle all other cases I listed above? It's something you missed. I don't know why you try to justify your mistakes.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

Then why we didn't handle all other cases I listed above? It's something you missed. I don't know why you try to justify your mistakes.

I didn't miss those, I deliberately didn't tackle them because 1) They were a different kind of problem; 2) The fix is not as simple as it requires breaking the API. The only thing I did miss was the problem with the math functions, and I'm happy to have those fixed.

I'm tired of arguing this with you, the PR was open for weeks and I tried to fix every problem you pointed out. What's the point of complaining now that it has been merged? I don't know why you antagonise me so much.

@HyukjinKwon
Copy link
Member

Because I was deciding if I should revert or make a followup. If you revert it, the reasons should better be left, and I am arguing because you're justifying the reasons that I listed.

This PR targets to support string as columns in general, and there were some cases missing found, which seems to be not mentioned or discussed.

The main reason I was initially worried was that we should see if it makes sense to support string as columns in PySpark's API (further more in other languages API). I think other people also expressed this concern. It doesn't look this concern is addressed properly.

@HyukjinKwon
Copy link
Member

And, in particular this PR, there had to be a lot of efforts. We shouldn't fix something as just-work-for-now anymore in particular when it repeatedly becomes an issue.

@asmello
Copy link
Author

asmello commented Mar 18, 2019

It's definitely better after the merge than before, so I see no reason to revert. Though due to the missing math functions fix, I agree a follow up is required.

As for cases like when() and array_contains(), that warrants a deeper discussion. Again, it's not that support for column names is missing there, it's just that those accept string literals, which take precedence in the API. In a way, supporting columns at all is an after-thought there, as their original formulation only supported literals. So we either remove columns from there entirely (at the cost of losing functionality) or we drop the literal support (which redefines the API in a big way). But neither has anything to do with supporting column names, so this should be reserved for another PR.

The main reason I was initially worried was that we should see if it makes sense to support string as columns in PySpark's API

That's a valid concern, but, again, it's better to be consistent for now. It's not a "just-work-for-now" situation, either, it's a full fix for the consistency problem. String support itself is a related, but separate discussion, too.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 18, 2019

I already explained the reasons to revert. If we see some ambiguity, we should consider not supporting string as columns officially, or such discussion should be concluded.

I made a followup to fix missing instances here, and conclude the discussion. If that's getting longer, I'm going to revert this PR.

It's confusing what string means in the PySpark function because it has two semantically different meaning here.

@srowen
Copy link
Member

srowen commented Mar 19, 2019

I think this change was fine and will review the follow on too. I wouldn't revert this. I think we decided to keep this to a narrow fix for consistency. Anything else can be a future change or discussion. It would return this to a worse state if it's reverted.

HyukjinKwon added a commit that referenced this pull request Mar 19, 2019
…ke string as columns as well

## What changes were proposed in this pull request?

This is a followup of #23882 to handle binary math/string functions. For instance, see the cases below:

**Before:**

```python
>>> from pyspark.sql.functions import lit, ascii
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 51, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.ascii. Trace:
py4j.Py4JException: Method ascii([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:276)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
```

```python
>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 78, in _
    jc = getattr(sc._jvm.functions, name)(col1._jc if isinstance(col1, Column) else float(col1),
ValueError: could not convert string to float: id
```

**After:**

```python
>>> from pyspark.sql.functions import lit, ascii
>>> spark.range(1).select(lit('a').alias("value")).select(ascii("value"))
DataFrame[ascii(value): int]
```

```python
>>> from pyspark.sql.functions import atan2
>>> spark.range(1).select(atan2("id", "id"))
DataFrame[ATAN2(id, id): double]
```

Note that,

- This PR causes a slight behaviour changes for math functions. For instance, numbers as strings (e.g., `"1"`) were supported as arguments of binary math functions before. After this PR, it recognises it as column names.

- I also intentionally didn't document this behaviour changes since we're going ahead for Spark 3.0 and I don't think numbers as strings make much sense in math functions.

- There is another exception `when`, which takes string as literal values as below. This PR doeesn't fix this ambiguity.
  ```python
  >>> spark.range(1).select(when(lit(True), col("id"))).show()
  ```

  ```
  +--------------------------+
  |CASE WHEN true THEN id END|
  +--------------------------+
  |                         0|
  +--------------------------+
  ```

  ```python
  >>> spark.range(1).select(when(lit(True), "id")).show()
  ```

  ```
  +--------------------------+
  |CASE WHEN true THEN id END|
  +--------------------------+
  |                        id|
  +--------------------------+
  ```

This PR also fixes as below:

#23882 fixed it to:

- Rename `_create_function` to `_create_name_function`
- Define new `_create_function` to take strings as column names.

This PR, I proposes to:

- Revert `_create_name_function` name to `_create_function`.
- Define new `_create_function_over_column` to take strings as column names.

## How was this patch tested?

Some unit tests were added for binary math / string functions.

Closes #24121 from HyukjinKwon/SPARK-26979.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

Sorry if I missed a discussion thread that we're agreed upon fixing this partially, and there was an explicit conclusion here. If not, I don't think we should fix such thing partially without an explicit conclusion - it's been raised multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants