[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

EnricoMi · 2021-07-19T15:09:22Z

What changes were proposed in this pull request?

This pull request introduces a helper class that simplifies usage of Dataset.observe() for batch datasets:

val observation = Observation("name")
val observed = ds.observe(observation, max($"id").as("max_id"))
observed.count()
val metrics = observation.get

Why are the changes needed?

Currently, users are required to implement the QueryExecutionListener interface to retrieve the metrics, as well as apply some knowledge on threading and locking to pull the metrics over to the main thread. With the helper class, metrics can be retrieved from batch dataset processing with three lines of code (the action on the observed dataset does not count as a line of code here).

Does this PR introduce any user-facing change?

Yes, one new class and one `Dataset`` method.

How was this patch tested?

Adds a unit test to DataFrameSuite, similar to "get observable metrics by callback" in DataFrameCallbackSuite.

EnricoMi · 2021-07-19T15:13:19Z

@cloud-fan @HyukjinKwon

HyukjinKwon · 2021-07-20T00:26:38Z

ok to test

HyukjinKwon · 2021-07-20T00:27:12Z

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

+}
+
+/**
+ * (Scala-specific) Create a named or anonymous instance of Observation.


Let's also add since

* @since 3.3.0

HyukjinKwon · 2021-07-20T00:27:48Z

cc @hvanhovell too

SparkQA · 2021-07-20T01:44:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45799/

SparkQA · 2021-07-20T02:20:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45799/

cloud-fan · 2021-07-20T03:20:06Z

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

+  /**
+   * Observation constructor for creating an anonymous observation.
+   */
+  def apply(): Observation = new Observation(UUID.randomUUID().toString)


shall we also add a default parameter in the constructor? then java users can also get benefits.

Nah, Observation companion won't work anyway, and the default parameter doesn't work in Java side.

Oh, we can define another constructor at the class though for Java side.

Alright, I have added the unnamed constructor to the Observation class.

I have added a test to JavaDataFrameSuite to test the interaction from Java with observations. It shows that the Dataset.observe API is not really Java-friendly (see JavaDataFrameSuite.testObservation), but this shouldn't prevent us from making Observation Java-friendly.

The Dataset.observe methods could be made Java-friendly (by adding @varargs) in a separate PR. @cloud-fan @HyukjinKwon @hvanhovell What is you opinion on that effort?

SparkQA · 2021-07-20T05:10:02Z

Test build #141285 has finished for PR 33422 at commit 2a21bb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-07-20T09:56:40Z

I have no more comments on that otherwise.

SparkQA · 2021-07-20T10:15:02Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45833/

SparkQA · 2021-07-20T10:18:35Z

Test build #141322 has finished for PR 33422 at commit 5644aba.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-20T11:51:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45837/

SparkQA · 2021-07-20T12:27:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45837/

cloud-fan · 2021-07-20T13:03:39Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group typedrel
+   * @since 3.3.0
+   */
+  def observe(observation: Observation, expr: Column, exprs: Column*): Dataset[T] = {


if it's simply adding an annotation @varargs, shall we just do it in this PR?

what about observe(String, Column, Column*)?

That's fine. Let's don't add it for now

SparkQA · 2021-07-20T13:41:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45845/

SparkQA · 2021-07-20T14:15:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45845/

SparkQA · 2021-07-20T16:19:45Z

Test build #141330 has finished for PR 33422 at commit 4f620a9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-21T06:25:15Z

Test build #141389 has finished for PR 33422 at commit 4d78f0c.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-21T07:09:35Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45907/

…umn*)

SparkQA · 2021-07-21T10:48:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45917/

SparkQA · 2021-07-21T11:31:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45917/

SparkQA · 2021-07-21T15:36:57Z

Test build #141399 has finished for PR 33422 at commit 3b2d9ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

EnricoMi · 2021-07-21T17:28:05Z

This patch passes all tests.

This patch merges cleanly.

This patch adds no public classes.

I hope this now really means we are good to go.

HyukjinKwon · 2021-07-22T08:09:31Z

Yeah, looks like Javadoc build passed too at https://github.com/G-Research/spark/runs/3122095226?check_suite_focus=true

cloud-fan · 2021-07-22T08:57:00Z

thanks, merging to master!

SparkQA · 2021-07-22T12:03:44Z

Test build #141478 has finished for PR 33422 at commit 3b2d9ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

EnricoMi · 2021-07-22T18:33:39Z

@cloud-fan @HyukjinKwon thanks for your time and valuable input!

EnricoMi added 19 commits July 16, 2021 20:02

Add Observation helper for Dataset.observe

c0ccf3d

Move column expressions from Observation() to Dataset.observe()

6a8dacf

An observation can only be used with a Dataset once

db36a65

Turn case class into class, reword reset() docstring

4cabd90

Reduce Observation API to a get and waitComplete, remove reset and close

36b34e0

Replace lock/condition by synchronized, add asserts, make thread-safe

3dc8a4b

Always call notify, but assert row is defined

62add3f

Remove unused function argument

fafcd26

Make public methods Java compatible

1cd466a

Finish scaladoc

ca67030

Simplify private waitCompleted method

a881630

Handle millis=Some(0) and multiple calls to onFinish

a3e459a

Mark var sparkSession volatile

a32bb9c

Removed waitCompleted, notify only when metric is retrieved

6ea55e5

Handle spurious wakeups

ed18ad2

Replace IllegalStateException with IllegalArgumentException

4632670

Move this to @since 3.3.0

4cc44d9

Minor docstring changes

b7fd9b3

Fixing unidoc errors

2a21bb3

github-actions bot added the SQL label Jul 19, 2021

EnricoMi mentioned this pull request Jul 19, 2021

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #31905

Closed

HyukjinKwon reviewed Jul 20, 2021

View reviewed changes

cloud-fan reviewed Jul 20, 2021

View reviewed changes

Add constructor for anonymous Observation to class

53c7986

Fix import and indentation in JavaDataFrameSuite

5644aba

EnricoMi added 2 commits July 20, 2021 13:05

Fix Java style errors in unit tests

db8b02c

Fix type of empty Seq in java test

4f620a9

cloud-fan reviewed Jul 20, 2021

View reviewed changes

Add @varargs annotation to Dataset.observe(Observation, Column, Col…

3b2d9ae

…umn*)

EnricoMi force-pushed the branch-observation branch from 4d78f0c to 3b2d9ae Compare July 21, 2021 08:39

cloud-fan approved these changes Jul 22, 2021

View reviewed changes

cloud-fan closed this in 4e9c1b8 Jul 22, 2021

EnricoMi mentioned this pull request Jul 22, 2021

[SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark #33484

Closed

EnricoMi mentioned this pull request Sep 22, 2021

Reliable Accumulators G-Research/spark-extension#3

Closed

EnricoMi deleted the branch-observation branch September 9, 2022 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

EnricoMi commented Jul 19, 2021

EnricoMi commented Jul 19, 2021

HyukjinKwon commented Jul 20, 2021

HyukjinKwon Jul 20, 2021

HyukjinKwon commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

cloud-fan Jul 20, 2021

HyukjinKwon Jul 20, 2021 •

edited

HyukjinKwon Jul 20, 2021

EnricoMi Jul 20, 2021

SparkQA commented Jul 20, 2021

HyukjinKwon commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

cloud-fan Jul 20, 2021

HyukjinKwon Jul 20, 2021

EnricoMi Jul 20, 2021

HyukjinKwon Jul 21, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

EnricoMi commented Jul 21, 2021

HyukjinKwon commented Jul 22, 2021

cloud-fan commented Jul 22, 2021

SparkQA commented Jul 22, 2021

EnricoMi commented Jul 22, 2021

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

Conversation

EnricoMi commented Jul 19, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

EnricoMi commented Jul 19, 2021

HyukjinKwon commented Jul 20, 2021

Choose a reason for hiding this comment

HyukjinKwon commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

Choose a reason for hiding this comment

HyukjinKwon Jul 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2021

HyukjinKwon commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

EnricoMi commented Jul 21, 2021

HyukjinKwon commented Jul 22, 2021

cloud-fan commented Jul 22, 2021

SparkQA commented Jul 22, 2021

EnricoMi commented Jul 22, 2021

HyukjinKwon Jul 20, 2021 •

edited