Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 4, 2017

What changes were proposed in this pull request?

This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for in operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:

1.5.2

>>> df = sqlContext.createDataFrame([[1]])
>>> 1 in df._1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

1.6.3

>>> 1 in sqlContext.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

2.1.0

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Current Master

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

After

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
    raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

In more details,

It seems the implementation intended to support this

1 in df.column

However, currently, it throws an exception as below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

What happens here is as below:

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        raise Exception("I am nonzero.")

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in __nonzero__
Exception: I am nonzero.

It seems it calls __contains__ first and then __nonzero__ or __bool__ is being called against Column() to make this a bool (or int to be specific).

It seems __nonzero__ (for Python 2), __bool__ (for Python 3) and __contains__ forcing the the return into a bool unlike other operators. There are few references about this as below:

https://bugs.python.org/issue16011
http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777

It seems we can't overwrite __nonzero__ or __bool__ as a workaround to make this working because these force the return type as a bool as below:

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        return "a"

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned str

How was this patch tested?

Added unit tests in tests.py.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, @davies and @holdenk.

__contains__ = _bin_op("contains")
def __contains__(self, item):
raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
"in a string column or 'array_contains' function for an array column.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant here is use

>>> df = spark.range(1)
>>> df.select(df.id.contains(0)).show()
+---------------+
|contains(id, 0)|
+---------------+
|           true|
+---------------+

or

>>> from pyspark.sql.functions import array_contains
>>> df = spark.createDataFrame([[[0]]], ["id"])
>>> df.select(array_contains(df.id, 0)).show()
+---------------------+
|array_contains(id, 0)|
+---------------------+
|                 true|
+---------------------+

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73891 has finished for PR 17160 at commit 509747c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

what if we just remove __contains__, __nonzero__ and __bool__?

@HyukjinKwon
Copy link
Member Author

I tested with it. IIRC, the error messages looked not useful. As we are overwriting some operators, maybe, it would be better to say explicitly such operators are not supported against the column. I am now outside. Let me post some test results here when I get to my computer.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 4, 2017

without __contains__, __nonzero__ and __bool__

class Column(object): pass

>>> 1 in Column()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
True

without __nonzero__ and __bool__

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()

>>> 1 in Column()
I am contains
True
>>> bool(Column())
True

without __contains__

class Column(object):
    def __nonzero__(self):
        print "I am nonzero"
        return Column()

>>> 1 in Column()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
I am nonzero
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned Column

FWIW, In case of __contains__, please refer https://docs.python.org/2/reference/datamodel.html#object.__contains__

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 4, 2017

So.. it seems

TypeError: argument of type 'Column' is not iterable

vs

ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

@cloud-fan
Copy link
Contributor

so do we still need __non_zero__?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 6, 2017

It does not need for __contains__ but need for bool because I guess we would not want to return bool as other operators return Column.

For example,

>>> not spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

If we remove this, then

>>> not spark.range(1).id
False

I think we want Column as below:

>>> 1 < spark.range(1).id
Column<(id > 1)>

but for bool, it seems not easily possible.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@HyukjinKwon
Copy link
Member Author

@cloud-fan Thank you sincerely so much for looking into this deeper and asking the details.

@asfgit asfgit closed this in 224e0e7 Mar 6, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-19701 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants