[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener #21007

HyukjinKwon · 2018-04-09T10:05:07Z

What changes were proposed in this pull request?

This PR proposes to add collect to a query executor as an action.

Seems collect / collect with Arrow are not recognised via QueryExecutionListener as an action. For example, if we have a custom listener as below:

package org.apache.spark.sql

import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.QueryExecution
import org.apache.spark.sql.util.QueryExecutionListener


class TestQueryExecutionListener extends QueryExecutionListener with Logging {
  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
    logError("Look at me! I'm 'onSuccess'")
  }

  override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { }
}

and set spark.sql.queryExecutionListeners to org.apache.spark.sql.TestQueryExecutionListener

Other operations in PySpark or Scala side seems fine:

>>> sql("SELECT * FROM range(1)").show()

18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
+---+
| id|
+---+
|  0|
+---+

scala> sql("SELECT * FROM range(1)").collect()

18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
res1: Array[org.apache.spark.sql.Row] = Array([0])

but ..

Before

>>> sql("SELECT * FROM range(1)").collect()

[Row(id=0)]

>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>> sql("SELECT * FROM range(1)").toPandas()

   id
0   0

After

>>> sql("SELECT * FROM range(1)").collect()

18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
[Row(id=0)]

>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>> sql("SELECT * FROM range(1)").toPandas()

18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
   id
0   0

How was this patch tested?

I have manually tested as described above and unit test was added.

SparkQA · 2018-04-09T14:50:18Z

Test build #89053 has finished for PR 21007 at commit db1987f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-10T01:08:23Z

cc @cloud-fan and @viirya (from checking the history).

felixcheung

add test?

felixcheung · 2018-04-10T07:33:45Z

@BryanCutler @ueshin
add test in https://github.com/apache/spark/blob/master/python/pyspark/tests.py?

HyukjinKwon · 2018-04-10T07:40:43Z

Yup, will add. I was just hesitant because It needs some complexity for writing an actual test (as described in the PR description) whereas the fix could be quite straightforward.

felixcheung · 2018-04-10T07:58:09Z

got it. sorry I missed the last sentence. maybe a jvm only test?

HyukjinKwon · 2018-04-10T08:05:38Z

Will give a shot first for the Python one to show how it looks like. I have an incomplete one in my local.

viirya

This fix looks good.

viirya · 2018-04-10T08:23:55Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -3189,10 +3189,10 @@ class Dataset[T] private[sql](

  private[sql] def collectToPython(): Int = {
    EvaluatePython.registerPicklers()
-    withNewExecutionId {
+    withAction("collect", queryExecution) { plan =>


collect or collectToPython?

viirya · 2018-04-10T08:39:23Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -3312,10 +3313,15 @@ class Dataset[T] private[sql](

  /** Convert to an RDD of ArrowPayload byte arrays */
  private[sql] def toArrowPayload: RDD[ArrowPayload] = {
+    // This is only used in tests, for now.


Should this comment be moved above on def toArrowPayload: RDD[ArrowPayload]?

HyukjinKwon · 2018-04-10T09:10:37Z

Will address the comments together soon.

HyukjinKwon · 2018-04-10T11:21:59Z

python/pyspark/sql/tests.py

@@ -3062,6 +3062,73 @@ def test_sparksession_with_stopped_sparkcontext(self):
            sc.stop()


+class SQLTests3(unittest.TestCase):


I manually tested different conditions with this test for sure:

Before:

test_query_execution_listener_on_collect (pyspark.sql.tests.SQLTests3) ... FAIL test_query_execution_listener_on_collect_with_arrow (pyspark.sql.tests.SQLTests3) ... FAIL ====================================================================== FAIL: test_query_execution_listener_on_collect (pyspark.sql.tests.SQLTests3) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 3105, in test_query_execution_listener_on_collect "The callback from the query execution listener should be called after 'collect'") AssertionError: The callback from the query execution listener should be called after 'collect' ====================================================================== FAIL: test_query_execution_listener_on_collect_with_arrow (pyspark.sql.tests.SQLTests3) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 3122, in test_query_execution_listener_on_collect_with_arrow "The callback from the query execution listener should be called after 'toPandas'") AssertionError: The callback from the query execution listener should be called after 'toPandas'

After:

test_query_execution_listener_on_collect (pyspark.sql.tests.SQLTests3) ... ok test_query_execution_listener_on_collect_with_arrow (pyspark.sql.tests.SQLTests3) ... ok

Missing 'org.apache.spark.sql.TestQueryExecutionListener'

skipped "'org.apache.spark.sql.TestQueryExecutionListener' is not available. Skipping the related tests."

Missing Pandas

test_query_execution_listener_on_collect (pyspark.sql.tests.SQLTests3) ... ok test_query_execution_listener_on_collect_with_arrow (pyspark.sql.tests.SQLTests3) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'

Missing PyArrow

test_query_execution_listener_on_collect (pyspark.sql.tests.SQLTests3) ... ok test_query_execution_listener_on_collect_with_arrow (pyspark.sql.tests.SQLTests3) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'

BTW, I don't feel strongly about this test. Let me know if you guys think it's rather better to just take out. I am fine either way.

The test seem good to me, but would it be more appropriate to call it TestQueryExecutionListener instead?

SparkQA · 2018-04-10T13:56:35Z

Test build #89109 has finished for PR 21007 at commit e865c88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-10T14:00:20Z

retest this please

SparkQA · 2018-04-10T14:58:16Z

Test build #89108 has finished for PR 21007 at commit edb5eea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-10T17:28:22Z

Test build #89127 has finished for PR 21007 at commit e865c88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Just a couple questions, but overall looks good

BryanCutler · 2018-04-10T22:16:46Z

python/pyspark/sql/tests.py

@@ -3062,6 +3062,73 @@ def test_sparksession_with_stopped_sparkcontext(self):
            sc.stop()


+class SQLTests3(unittest.TestCase):


The test seem good to me, but would it be more appropriate to call it TestQueryExecutionListener instead?

BryanCutler · 2018-04-10T22:18:19Z

python/pyspark/sql/tests.py

+        not _have_pandas or not _have_pyarrow,
+        _pandas_requirement_message or _pyarrow_requirement_message)
+    def test_query_execution_listener_on_collect_with_arrow(self):
+        # Here, it deplicates codes in ReusedSQLTestCase.sql_conf context manager.


I think it would be fine refactor sql_conf a little to use it here, it makes things much clearer

BryanCutler · 2018-04-10T22:20:21Z

sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala

+  def isCalled(): Boolean = isOnSuccessCalled.get()
+
+  def clear(): Unit = isOnSuccessCalled.set(false)
+}


does this need a newline at the end?

nope, it already has. github shows a warning and mark on this UI if it doesn't IIRC.

BryanCutler · 2018-04-10T22:26:34Z

sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala

@@ -0,0 +1,45 @@
+/*


Is it possible to modify this slightly and reuse it? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/util/ExecutionListenerManagerSuite.scala#L48

I think it's possible. Just took a look; however, mind if I had a separate one as is for Python test specifically? maybe I am too much worried but thinking about having a dependency with a class in a suite and I am a bit hesitant.

Yeah, I think that's fine. Thanks for putting a comment in the class for what it is for.

HyukjinKwon · 2018-04-11T00:04:28Z

will address the comments soon.

SparkQA · 2018-04-11T04:24:16Z

Test build #89163 has finished for PR 21007 at commit deacb17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLTestUtils(object):
class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils):
class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils):

HyukjinKwon · 2018-04-11T05:21:46Z

retest this please

SparkQA · 2018-04-11T07:05:01Z

Test build #89173 has finished for PR 21007 at commit deacb17.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLTestUtils(object):
class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils):
class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils):

HyukjinKwon · 2018-04-11T08:13:40Z

retest this please

viirya · 2018-04-11T10:26:36Z

sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala

+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+class TestQueryExecutionListener extends QueryExecutionListener with Logging {


No need to with Logging now?

SparkQA · 2018-04-11T11:43:38Z

Test build #89182 has finished for PR 21007 at commit deacb17.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLTestUtils(object):
class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils):
class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils):

SparkQA · 2018-04-11T15:29:02Z

Test build #89193 has finished for PR 21007 at commit 1a52dbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TestQueryExecutionListener extends QueryExecutionListener

BryanCutler

LGTM

viirya · 2018-04-12T12:42:53Z

sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala

+
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.spark.internal.Logging


We should get rid of this import too. :)

viirya · 2018-04-12T12:47:49Z

python/pyspark/sql/tests.py

+            "sql/core/target/scala-*/test-classes/org/apache/spark/sql/"
+            "TestQueryExecutionListener.class")
+        if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+            raise unittest.SkipTest(


I'm not sure about this part. What is the case we can't find the class? TestQueryExecutionListener.scala has been removed or moved? If it happens, should we just silently skip this test like this?

Ah, nope. It's when we do sbt package, according to https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt. In this case, test files are not actually compiled. If we run the tests, it'd hit some exceptions.

I admit It's rare. But I believe this is more correct. In fact, there are few test cases actually taking care about this.

and .. for

If it happens, should we just silently skip this test like this?

Yea, ideally we should warn explicitly in the console. The problem is about our own testing script .. We could make some changes to explicitly warn but seems we need some duplicated changes.

There are some discussions / changes going on here - #20909

Ok. I see. Makes sense.

Thank you @viirya. I know this one is a rather tricky one to judge what's righter. Will maybe cc you when we actually discuss about this further. I believe some people could think differently and I might have to have more discussion. But for now, I feel sure on this.

viirya · 2018-04-12T14:36:20Z

LGTM

SparkQA · 2018-04-12T16:02:02Z

Test build #89267 has finished for PR 21007 at commit 7c1b3c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-12T16:04:31Z

retest this please

SparkQA · 2018-04-12T17:55:23Z

Test build #89279 has finished for PR 21007 at commit 7c1b3c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-04-12T18:01:27Z

retest this please

SparkQA · 2018-04-12T20:32:15Z

Test build #89290 has finished for PR 21007 at commit 7c1b3c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-12T22:43:47Z

retest this please

SparkQA · 2018-04-13T02:21:18Z

Test build #89304 has finished for PR 21007 at commit 7c1b3c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-13T03:29:16Z

Merged to master.

Thanks for reviewing this @felixcheung, @viirya and @BryanCutler.

…uery executor listener This PR proposes to add `collect` to a query executor as an action. Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below: ```scala package org.apache.spark.sql import org.apache.spark.internal.Logging import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.util.QueryExecutionListener class TestQueryExecutionListener extends QueryExecutionListener with Logging { override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = { logError("Look at me! I'm 'onSuccess'") } override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { } } ``` and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener` Other operations in PySpark or Scala side seems fine: ```python >>> sql("SELECT * FROM range(1)").show() ``` ``` 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' +---+ | id| +---+ | 0| +---+ ``` ```scala scala> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' res1: Array[org.apache.spark.sql.Row] = Array([0]) ``` but .. **Before** ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` id 0 0 ``` **After** ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` 18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' id 0 0 ``` I have manually tested as described above and unit test was added. Author: hyukjinkwon <gurwls223@apache.org> Closes apache#21007 from HyukjinKwon/SPARK-23942. (cherry picked from commit ab7b961) Signed-off-by: hyukjinkwon <gurwls223@apache.org>

gatorsmile · 2018-04-15T03:34:26Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -3189,10 +3189,10 @@ class Dataset[T] private[sql](

  private[sql] def collectToPython(): Int = {
    EvaluatePython.registerPicklers()
-    withNewExecutionId {
+    withAction("collectToPython", queryExecution) { plan =>


These changes can cause the behavior changes. Please submit a PR to document it.

gatorsmile · 2018-04-15T03:40:24Z

@HyukjinKwon @BryanCutler @viirya @felixcheung The first sentence of this PR really scares me. After reading the PR description, I found it is wrong. Since the PR description will be part of our change log. Please be careful to ensure they are right.

HyukjinKwon · 2018-04-15T05:51:05Z

What's wrong in the description and PR title, and what to document? Do you mean the first sentence This PR proposes to add collect to a query executor as an action. is wrong because this sentence doesn't mention Pyspark specifically? I think the PR title and other words explain what this PR changes.

It's already documented -

spark/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

Line 44 in bd4eb9c

* A callback function that will be called when a query executed successfully.

Should we document collect in PySpark is now recignised as a query execution action in a query execution listener?

BryanCutler · 2018-04-16T16:51:47Z

@gatorsmile I think the PR description here is great and very detailed, what exactly is wrong and scary?

Makes collect in PySpark as action for a query executor listener

db1987f

felixcheung reviewed Apr 10, 2018

View reviewed changes

viirya reviewed Apr 10, 2018

View reviewed changes

HyukjinKwon added 2 commits April 10, 2018 19:14

Add a test and address comments

edb5eea

Fix nits

e865c88

HyukjinKwon commented Apr 10, 2018

View reviewed changes

BryanCutler reviewed Apr 10, 2018

View reviewed changes

Address comments

deacb17

viirya reviewed Apr 11, 2018

View reviewed changes

Address a comment and add few more words in comments

1a52dbe

BryanCutler approved these changes Apr 11, 2018

View reviewed changes

viirya reviewed Apr 12, 2018

View reviewed changes

D'oh

7c1b3c6

asfgit closed this in ab7b961 Apr 13, 2018

gatorsmile reviewed Apr 15, 2018

View reviewed changes

gatorsmile mentioned this pull request Apr 17, 2018

[SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in PySpark as action for a query executor listener #21060

Closed

This was referenced Aug 13, 2018

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark #21990

Closed

[SPARK-24721][SQL] Extract Python UDFs at the end of optimizer #22104

Closed

HyukjinKwon deleted the SPARK-23942 branch October 16, 2018 12:45

		@@ -3062,6 +3062,73 @@ def test_sparksession_with_stopped_sparkcontext(self):
		sc.stop()


		class SQLTests3(unittest.TestCase):

		import org.apache.spark.sql.util.QueryExecutionListener


		class TestQueryExecutionListener extends QueryExecutionListener with Logging {


		import java.util.concurrent.atomic.AtomicBoolean

		import org.apache.spark.internal.Logging

[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener #21007

[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener #21007

Conversation

HyukjinKwon commented Apr 9, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 9, 2018

HyukjinKwon commented Apr 10, 2018

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung commented Apr 10, 2018

HyukjinKwon commented Apr 10, 2018 • edited

felixcheung commented Apr 10, 2018

HyukjinKwon commented Apr 10, 2018

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2018

HyukjinKwon commented Apr 10, 2018

SparkQA commented Apr 10, 2018

SparkQA commented Apr 10, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 11, 2018

SparkQA commented Apr 11, 2018

HyukjinKwon commented Apr 11, 2018

SparkQA commented Apr 11, 2018

HyukjinKwon commented Apr 11, 2018

viirya Apr 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 11, 2018

SparkQA commented Apr 11, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Apr 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Apr 12, 2018

SparkQA commented Apr 12, 2018

HyukjinKwon commented Apr 12, 2018

SparkQA commented Apr 12, 2018

BryanCutler commented Apr 12, 2018

SparkQA commented Apr 12, 2018

HyukjinKwon commented Apr 12, 2018

SparkQA commented Apr 13, 2018

HyukjinKwon commented Apr 13, 2018

Choose a reason for hiding this comment

gatorsmile commented Apr 15, 2018 • edited

HyukjinKwon commented Apr 15, 2018

BryanCutler commented Apr 16, 2018

HyukjinKwon commented Apr 9, 2018 •

edited

HyukjinKwon commented Apr 10, 2018 •

edited

viirya Apr 11, 2018 •

edited

viirya Apr 12, 2018 •

edited

gatorsmile commented Apr 15, 2018 •

edited