[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

HyukjinKwon · 2017-03-04T03:57:06Z

What changes were proposed in this pull request?

This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for in operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:

1.5.2

>>> df = sqlContext.createDataFrame([[1]])
>>> 1 in df._1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

1.6.3

>>> 1 in sqlContext.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

2.1.0

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Current Master

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

After

>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
    raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

In more details,

It seems the implementation intended to support this

1 in df.column

However, currently, it throws an exception as below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

What happens here is as below:

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        raise Exception("I am nonzero.")

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in __nonzero__
Exception: I am nonzero.

It seems it calls __contains__ first and then __nonzero__ or __bool__ is being called against Column() to make this a bool (or int to be specific).

It seems __nonzero__ (for Python 2), __bool__ (for Python 3) and __contains__ forcing the the return into a bool unlike other operators. There are few references about this as below:

https://bugs.python.org/issue16011
http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777

It seems we can't overwrite __nonzero__ or __bool__ as a workaround to make this working because these force the return type as a bool as below:

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        return "a"

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned str

How was this patch tested?

Added unit tests in tests.py.

HyukjinKwon · 2017-03-04T03:57:26Z

cc @cloud-fan, @davies and @holdenk.

HyukjinKwon · 2017-03-04T04:03:25Z

python/pyspark/sql/column.py

-    __contains__ = _bin_op("contains")
+    def __contains__(self, item):
+        raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
+                         "in a string column or 'array_contains' function for an array column.")


What I meant here is use

>>> df = spark.range(1) >>> df.select(df.id.contains(0)).show()

+---------------+ |contains(id, 0)| +---------------+ | true| +---------------+

or

>>> from pyspark.sql.functions import array_contains >>> df = spark.createDataFrame([[[0]]], ["id"]) >>> df.select(array_contains(df.id, 0)).show()

+---------------------+ |array_contains(id, 0)| +---------------------+ | true| +---------------------+

SparkQA · 2017-03-04T04:26:58Z

Test build #73891 has finished for PR 17160 at commit 509747c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-04T05:53:23Z

what if we just remove __contains__, __nonzero__ and __bool__?

HyukjinKwon · 2017-03-04T05:57:24Z

I tested with it. IIRC, the error messages looked not useful. As we are overwriting some operators, maybe, it would be better to say explicitly such operators are not supported against the column. I am now outside. Let me post some test results here when I get to my computer.

HyukjinKwon · 2017-03-04T09:26:21Z

without __contains__, __nonzero__ and __bool__

class Column(object): pass

>>> 1 in Column()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
True

without __nonzero__ and __bool__

class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()

>>> 1 in Column()
I am contains
True
>>> bool(Column())
True

without __contains__

class Column(object):
    def __nonzero__(self):
        print "I am nonzero"
        return Column()

>>> 1 in Column()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument of type 'Column' is not iterable
>>> bool(Column())
I am nonzero
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned Column

FWIW, In case of __contains__, please refer https://docs.python.org/2/reference/datamodel.html#object.__contains__

HyukjinKwon · 2017-03-04T09:31:36Z

So.. it seems

TypeError: argument of type 'Column' is not iterable

vs

ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

cloud-fan · 2017-03-06T00:29:13Z

so do we still need __non_zero__?

HyukjinKwon · 2017-03-06T01:03:51Z

It does not need for __contains__ but need for bool because I guess we would not want to return bool as other operators return Column.

For example,

>>> not spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

If we remove this, then

>>> not spark.range(1).id
False

I think we want Column as below:

>>> 1 < spark.range(1).id
Column<(id > 1)>

but for bool, it seems not easily possible.

cloud-fan · 2017-03-06T02:05:20Z

thanks, merging to master!

HyukjinKwon · 2017-03-06T02:06:55Z

@cloud-fan Thank you sincerely so much for looking into this deeper and asking the details.

Throws a correct exception for in operator against column

509747c

HyukjinKwon commented Mar 4, 2017

View reviewed changes

asfgit closed this in 224e0e7 Mar 6, 2017

HyukjinKwon deleted the SPARK-19701 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

HyukjinKwon commented Mar 4, 2017

HyukjinKwon Mar 4, 2017

SparkQA commented Mar 4, 2017

cloud-fan commented Mar 4, 2017

HyukjinKwon commented Mar 4, 2017

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

cloud-fan commented Mar 6, 2017

HyukjinKwon commented Mar 6, 2017 •

edited

Loading

cloud-fan commented Mar 6, 2017

HyukjinKwon commented Mar 6, 2017

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column #17160

Conversation

HyukjinKwon commented Mar 4, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Mar 4, 2017

HyukjinKwon Mar 4, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 4, 2017

cloud-fan commented Mar 4, 2017

HyukjinKwon commented Mar 4, 2017

HyukjinKwon commented Mar 4, 2017 • edited Loading

HyukjinKwon commented Mar 4, 2017 • edited Loading

cloud-fan commented Mar 6, 2017

HyukjinKwon commented Mar 6, 2017 • edited Loading

cloud-fan commented Mar 6, 2017

HyukjinKwon commented Mar 6, 2017

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

HyukjinKwon commented Mar 4, 2017 •

edited

Loading

HyukjinKwon commented Mar 6, 2017 •

edited

Loading