[SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances #26288

HyukjinKwon · 2019-10-29T03:21:29Z

What changes were proposed in this pull request?

This PR proposes to allow array_contains to take column instances.

Why are the changes needed?

For consistent support in Scala and Python APIs. Scala allows column instances at array_contains

Scala:

import org.apache.spark.sql.functions._
val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
df.select(array_contains($"data", lit("a"))).show()

Python:

from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()

However, PySpark sides does not allow.

Does this PR introduce any user-facing change?

Yes.

from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()

Before:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

After:

+-----------------------+
|array_contains(data, a)|
+-----------------------+
|                   true|
|                  false|
+-----------------------+

How was this patch tested?

Manually tested and added a doctest.

SparkQA · 2019-10-29T03:55:07Z

Test build #112815 has finished for PR 26288 at commit 7ab0c76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk

Should we tell users that colum is supported too? https://github.com/apache/spark/pull/26288/files#diff-f5295f69bfbdbf6e161aed54057ea36dR1943

SparkQA · 2019-10-29T15:48:09Z

Test build #112851 has finished for PR 26288 at commit 5f8b160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-30T00:45:02Z

Thanks, @MaxGekk

HyukjinKwon · 2019-10-30T00:45:10Z

Merged to master.

dongjoon-hyun · 2019-10-30T05:20:23Z

Late LGTM!

Allow array_contains to take column instances

7ab0c76

MaxGekk reviewed Oct 29, 2019

View reviewed changes

Fix doc

5f8b160

MaxGekk approved these changes Oct 29, 2019

View reviewed changes

HyukjinKwon closed this in 8682bb1 Oct 30, 2019

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

dongjoon-hyun added the SQL label Feb 5, 2020

HyukjinKwon deleted the SPARK-29627 branch March 3, 2020 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances #26288

[SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances #26288

HyukjinKwon commented Oct 29, 2019 •

edited

SparkQA commented Oct 29, 2019

MaxGekk left a comment

SparkQA commented Oct 29, 2019

HyukjinKwon commented Oct 30, 2019

HyukjinKwon commented Oct 30, 2019

dongjoon-hyun commented Oct 30, 2019

[SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances #26288

[SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances #26288

Conversation

HyukjinKwon commented Oct 29, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 29, 2019

MaxGekk left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 29, 2019

HyukjinKwon commented Oct 30, 2019

HyukjinKwon commented Oct 30, 2019

dongjoon-hyun commented Oct 30, 2019

HyukjinKwon commented Oct 29, 2019 •

edited