[SPARK-25680][SQL] SQL execution listener shouldn't happen on execution thread #22674

cloud-fan · 2018-10-08T17:27:27Z

What changes were proposed in this pull request?

The SQL execution listener framework was created from scratch(see #9078). It didn't leverage what we already have in the spark listener framework, and one major problem is, the listener runs on the spark execution thread, which means a bad listener can block spark's query processing.

This PR re-implements the SQL execution listener framework. Now ExecutionListenerManager is just a normal spark listener, which watches the SparkListenerSQLExecutionEnd events and post events to the
user-provided SQL execution listeners.

How was this patch tested?

existing tests.

cloud-fan · 2018-10-08T17:27:43Z

cc @gatorsmile @brkyvz

cloud-fan · 2018-10-08T17:34:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala

-  extends SparkListenerEvent
+  extends SparkListenerEvent {
+
+  @JsonIgnore private[sql] var executionName: Option[String] = None


For backward compatibility, I make these new fields var.

Why do we want to be backwards compatible here? SHS?

It's a developer api, which is public. The backward compatibility is not that strong, compared to end-user public APIs, but we should still keep them unchanged if not too hard.

that said, a developer can write a spark listener and catch this event.

cloud-fan · 2018-10-08T17:35:37Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

-  override def clone(): ExecutionListenerManager = writeLock {
-    val newListenerManager = new ExecutionListenerManager
-    listeners.foreach(newListenerManager.register)
+  def clone(session: SparkSession): ExecutionListenerManager = {


I don't know why this method is public at the first place... I have to break it here.

Could you add MiMa exclusion rule?

SparkQA · 2018-10-08T17:46:40Z

Test build #97122 has finished for PR 22674 at commit 1701f3b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-08T17:53:32Z

Test build #97123 has finished for PR 22674 at commit 28f64d0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-09T05:51:01Z

Test build #97140 has finished for PR 22674 at commit a456226.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-09T07:05:02Z

Test build #97145 has finished for PR 22674 at commit 436197b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-10-09T13:15:02Z

retest this please

jiangxb1987

looks good

jiangxb1987 · 2018-10-09T14:37:25Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

@@ -75,95 +76,69 @@ trait QueryExecutionListener {
 */
 @Experimental
 @InterfaceStability.Evolving
-class ExecutionListenerManager private extends Logging {
+class ExecutionListenerManager private[sql](session: SparkSession, loadExtensions: Boolean)


nit: we shall add param comments.

SparkQA · 2018-10-09T17:40:52Z

Test build #97159 has finished for PR 22674 at commit 436197b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-09T19:03:01Z

Test build #97161 has finished for PR 22674 at commit 642ddd3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-10-09T21:41:26Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

-
-  private[sql] def this(conf: SparkConf) = {
-    this()
+// The `session` is used to indicate which session carries this listener manager, and we only


Why is this not a class doc?

The constructor is private, so we should not make it visible in the class doc

hvanhovell · 2018-10-09T22:33:04Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

-      wl.unlock()
-    }
+  private def shouldCatchEvent(e: SparkListenerSQLExecutionEnd): Boolean = {
+    // Only catch SQL execution with a name, and triggered by the same spark session that this


So this is what bugs me. You are adding separation between the SparkSession and its listeners, to undo that here. It seems like a bit of a hassle to go through because you basically need async execution.

yea. Assuming we have many spark sessions, running queries at the same time. Each session sends query execution events to the central event bus, and sets up a listener to watch its own query execution events, asynchronously.

To make it work, the most straightforward way is to carry the session identifier in the events, and the listener only watch events with the expected session identifier.

Maybe a better way is to introduce session in the Spark core, so the listener framework can dispatch events w.r.t. session automatically. But that's a lot of work.

we had the same problem in the StreamingQueryListener. You can check how we solved it in StreamExecution. Since each SparkSession will have its own ExecutionListenerManager, you may be able to only have the proper ExecutionListenerManager deal with its own messages.

@brkyvz thanks for the information! It seems the StreamingQueryListener framework picks the same idea but the implementation is better. I'll update my PR accordingly.

hvanhovell · 2018-10-09T22:38:05Z

sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

+      val funcName = e.executionName.get
+      e.executionFailure match {
+        case Some(ex) =>
+          listeners.iterator().asScala.foreach(_.onFailure(funcName, e.qe, ex))


This is a bit of high level thought, you could consider making the calling event queue responsible for the dispatch of these events. That way you can leverage any improvement to the underlying event bus.

ExecutionListenerManager is already a listener, which is running in a separated thread, receiving events from LiveListenerBus

brkyvz

This is a much larger change than I was expecting, but definitely a better one than which I had imagined. Left minor comments.

brkyvz · 2018-10-10T09:23:43Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-      case e: Exception =>
-        sparkSession.listenerManager.onFailure(name, qe, e)
-        throw e
+    qe.executedPlan.foreach { plan =>


can this throw an exception? Imagine if df.count() threw an exception, and then you run it again.
Won't this be a behavior change in that case?

I don't think resetMetrics can throw exception...

can't executedPlan throw an exception? I thought it can if the original spark plan failed?

ah i see your point here

brkyvz · 2018-10-10T09:25:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+          // can specify the execution name in more places in the future, so that
+          // `QueryExecutionListener` can track more cases.
+          event.executionName = name
+          event.duration = endTime - startTime


duration used to be reported in nanos. Now it's millis. I would still report it as nanos if possible.

ah good catch!

SparkQA · 2018-10-10T14:10:02Z

Test build #97200 has finished for PR 22674 at commit a25524b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T18:18:11Z

Test build #97203 has finished for PR 22674 at commit 3ffa536.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-11T02:48:17Z

retest this please

SparkQA · 2018-10-11T06:48:42Z

Test build #97231 has finished for PR 22674 at commit 3ffa536.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-11T07:04:54Z

hmm, seems it failed at the same test.

cloud-fan · 2018-10-11T11:31:07Z

I couldn't reproduce it locally, let me try again

cloud-fan · 2018-10-11T11:31:14Z

retest this please

brkyvz · 2018-10-11T11:54:25Z

I would just up the timeout in that suite. Now that we're pushing a bunch more stuff to the LiveListenerBus, it may not be draining quickly enough. On slow jenkins' it could likely cause flakiness.

SparkQA · 2018-10-11T15:29:06Z

Test build #97251 has finished for PR 22674 at commit 3ffa536.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-11T20:33:23Z

Test build #97269 has finished for PR 22674 at commit b3e546d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-12T00:50:34Z

Test build #97277 has finished for PR 22674 at commit 0bfc240.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-12T07:00:29Z

Test build #97291 has finished for PR 22674 at commit 6e3a345.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2018-10-12T11:37:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala

+  // The following 3 fields are only accessed when `executionName` is defined.
+
+  // The duration of the SQL execution, in nanoseconds.
+  @JsonIgnore private[sql] var duration: Long = 0L


did you verify that the JsonIgnore annotation actually works? For some reason, I actually needed to annotate the class as

@JsonIgnoreProperties(Array("a", b", "c")) class SomeClass { @JsonProperty("a") val a: ... @JsonProperty("b") val b: ... }

the reason being Json4s understands that API better. I believe we use Json4s for all of these events

There is a test to verify it: https://github.com/apache/spark/pull/22674/files#diff-6fa1d00d1cb20554dda238f2a3bc3ecbR55

I also used @JsonIgnoreProperties before, when I put these fields in case class constructor. It seems we don't need @JsonIgnoreProperties when they are private vars.

SparkQA · 2018-10-15T19:12:19Z

Test build #97397 has finished for PR 22674 at commit c42b499.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-10-16T01:58:54Z

LGTM, do you have any other concerns @hvanhovell @brkyvz @dongjoon-hyun ?

cloud-fan · 2018-10-17T08:04:40Z

since there is no objection, I'm merging it to master, thanks!

…on thread ## What changes were proposed in this pull request? The SQL execution listener framework was created from scratch(see apache#9078). It didn't leverage what we already have in the spark listener framework, and one major problem is, the listener runs on the spark execution thread, which means a bad listener can block spark's query processing. This PR re-implements the SQL execution listener framework. Now `ExecutionListenerManager` is just a normal spark listener, which watches the `SparkListenerSQLExecutionEnd` events and post events to the user-provided SQL execution listeners. ## How was this patch tested? existing tests. Closes apache#22674 from cloud-fan/listener. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan force-pushed the listener branch from 1701f3b to 28f64d0 Compare October 8, 2018 17:33

cloud-fan commented Oct 8, 2018

View reviewed changes

SQL execution listener shouldn't happen on execution thread

a456226

cloud-fan force-pushed the listener branch from 28f64d0 to a456226 Compare October 9, 2018 02:17

fix tests

436197b

jiangxb1987 approved these changes Oct 9, 2018

View reviewed changes

address comment

642ddd3

hvanhovell reviewed Oct 9, 2018

View reviewed changes

brkyvz suggested changes Oct 10, 2018

View reviewed changes

address comments

a25524b

add back Logging

3ffa536

fix a mistake

0bfc240

cloud-fan force-pushed the listener branch from b3e546d to 0bfc240 Compare October 11, 2018 18:11

Merge branch 'master' into listener

6e3a345

brkyvz reviewed Oct 12, 2018

View reviewed changes

Merge branch 'master' into listener

c42b499

asfgit closed this in 9690eba Oct 17, 2018

[SPARK-25680][SQL] SQL execution listener shouldn't happen on execution thread #22674

[SPARK-25680][SQL] SQL execution listener shouldn't happen on execution thread #22674

Conversation

cloud-fan commented Oct 8, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Oct 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 8, 2018

SparkQA commented Oct 8, 2018

SparkQA commented Oct 9, 2018

SparkQA commented Oct 9, 2018

jiangxb1987 commented Oct 9, 2018

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 9, 2018

SparkQA commented Oct 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 10, 2018

SparkQA commented Oct 10, 2018

cloud-fan commented Oct 11, 2018

SparkQA commented Oct 11, 2018

viirya commented Oct 11, 2018

cloud-fan commented Oct 11, 2018

cloud-fan commented Oct 11, 2018

brkyvz commented Oct 11, 2018

SparkQA commented Oct 11, 2018

SparkQA commented Oct 11, 2018

SparkQA commented Oct 12, 2018

SparkQA commented Oct 12, 2018

brkyvz Oct 12, 2018 • edited

Choose a reason for hiding this comment

cloud-fan Oct 12, 2018 • edited

Choose a reason for hiding this comment

SparkQA commented Oct 15, 2018

jiangxb1987 commented Oct 16, 2018

cloud-fan commented Oct 17, 2018

cloud-fan commented Oct 8, 2018 •

edited

brkyvz Oct 12, 2018 •

edited

cloud-fan Oct 12, 2018 •

edited