Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34806][SQL] Add Observation helper for Dataset.observe #33422

Closed
wants to merge 26 commits into from

Conversation

EnricoMi
Copy link
Contributor

What changes were proposed in this pull request?

This pull request introduces a helper class that simplifies usage of Dataset.observe() for batch datasets:

val observation = Observation("name")
val observed = ds.observe(observation, max($"id").as("max_id"))
observed.count()
val metrics = observation.get

Why are the changes needed?

Currently, users are required to implement the QueryExecutionListener interface to retrieve the metrics, as well as apply some knowledge on threading and locking to pull the metrics over to the main thread. With the helper class, metrics can be retrieved from batch dataset processing with three lines of code (the action on the observed dataset does not count as a line of code here).

Does this PR introduce any user-facing change?

Yes, one new class and one `Dataset`` method.

How was this patch tested?

Adds a unit test to DataFrameSuite, similar to "get observable metrics by callback" in DataFrameCallbackSuite.

@EnricoMi
Copy link
Contributor Author

@cloud-fan @HyukjinKwon

@HyukjinKwon
Copy link
Member

ok to test

}

/**
* (Scala-specific) Create a named or anonymous instance of Observation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add since

 * @since 3.3.0

@HyukjinKwon
Copy link
Member

cc @hvanhovell too

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45799/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45799/

/**
* Observation constructor for creating an anonymous observation.
*/
def apply(): Observation = new Observation(UUID.randomUUID().toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also add a default parameter in the constructor? then java users can also get benefits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, Observation companion won't work anyway, and the default parameter doesn't work in Java side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we can define another constructor at the class though for Java side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have added the unnamed constructor to the Observation class.

I have added a test to JavaDataFrameSuite to test the interaction from Java with observations. It shows that the Dataset.observe API is not really Java-friendly (see JavaDataFrameSuite.testObservation), but this shouldn't prevent us from making Observation Java-friendly.

The Dataset.observe methods could be made Java-friendly (by adding @varargs) in a separate PR. @cloud-fan @HyukjinKwon @hvanhovell What is you opinion on that effort?

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141285 has finished for PR 33422 at commit 2a21bb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I have no more comments on that otherwise.

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45833/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141322 has finished for PR 33422 at commit 5644aba.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45837/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45837/

* @group typedrel
* @since 3.3.0
*/
def observe(observation: Observation, expr: Column, exprs: Column*): Dataset[T] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's simply adding an annotation @varargs, shall we just do it in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about observe(String, Column, Column*)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Let's don't add it for now

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45845/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45845/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141330 has finished for PR 33422 at commit 4f620a9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Test build #141389 has finished for PR 33422 at commit 4d78f0c.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45907/

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45917/

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45917/

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Test build #141399 has finished for PR 33422 at commit 3b2d9ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@EnricoMi
Copy link
Contributor Author

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

I hope this now really means we are good to go.

@HyukjinKwon
Copy link
Member

Yeah, looks like Javadoc build passed too at https://github.com/G-Research/spark/runs/3122095226?check_suite_focus=true

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4e9c1b8 Jul 22, 2021
@SparkQA
Copy link

SparkQA commented Jul 22, 2021

Test build #141478 has finished for PR 33422 at commit 3b2d9ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@EnricoMi
Copy link
Contributor Author

@cloud-fan @HyukjinKwon thanks for your time and valuable input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants