-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160
Conversation
cc @cloud-fan, @davies and @holdenk. |
__contains__ = _bin_op("contains") | ||
def __contains__(self, item): | ||
raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' " | ||
"in a string column or 'array_contains' function for an array column.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant here is use
>>> df = spark.range(1)
>>> df.select(df.id.contains(0)).show()
+---------------+
|contains(id, 0)|
+---------------+
| true|
+---------------+
or
>>> from pyspark.sql.functions import array_contains
>>> df = spark.createDataFrame([[[0]]], ["id"])
>>> df.select(array_contains(df.id, 0)).show()
+---------------------+
|array_contains(id, 0)|
+---------------------+
| true|
+---------------------+
Test build #73891 has finished for PR 17160 at commit
|
what if we just remove |
I tested with it. IIRC, the error messages looked not useful. As we are overwriting some operators, maybe, it would be better to say explicitly such operators are not supported against the column. I am now outside. Let me post some test results here when I get to my computer. |
without class Column(object): pass
>>> 1 in Column()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
True without class Column(object):
def __contains__(self, item):
print "I am contains"
return Column()
>>> 1 in Column()
I am contains
True
>>> bool(Column())
True without class Column(object):
def __nonzero__(self):
print "I am nonzero"
return Column()
>>> 1 in Column()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
I am nonzero
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned Column FWIW, In case of |
So.. it seems TypeError: argument of type 'Column' is not iterable vs ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column. |
so do we still need |
It does not need for For example, >>> not spark.range(1).id
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. If we remove this, then >>> not spark.range(1).id
False I think we want >>> 1 < spark.range(1).id
Column<(id > 1)> but for |
thanks, merging to master! |
@cloud-fan Thank you sincerely so much for looking into this deeper and asking the details. |
What changes were proposed in this pull request?
This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for
in
operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:1.5.2
1.6.3
2.1.0
Current Master
After
In more details,
It seems the implementation intended to support this
However, currently, it throws an exception as below:
What happens here is as below:
It seems it calls
__contains__
first and then__nonzero__
or__bool__
is being called againstColumn()
to make this a bool (or int to be specific).It seems
__nonzero__
(for Python 2),__bool__
(for Python 3) and__contains__
forcing the the return into a bool unlike other operators. There are few references about this as below:https://bugs.python.org/issue16011
http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
It seems we can't overwrite
__nonzero__
or__bool__
as a workaround to make this working because these force the return type as a bool as below:How was this patch tested?
Added unit tests in
tests.py
.