Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47545][CONNECT] Dataset observe support for the Scala client #45701

Closed
wants to merge 24 commits into from

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented Mar 25, 2024

What changes were proposed in this pull request?

This PR adds support for Dataset.observe to the Spark Connect Scala client. Note that the support here does not include listener support as it runs on the serve side.

This PR includes a small refactoring to the Observation helper class. We extracted methods that are not bound to the SparkSession to spark-api, and added two subclasses on both spark-core and spark-jvm-client.

Why are the changes needed?

Before this PR, the DF.observe method is only supported in the Python client.

Does this PR introduce any user-facing change?

Yes. The user can now issue DF.observe(name, metrics...) or DF.observe(observationObject, metrics...) to get stats of columns of a dataframe.

How was this patch tested?

Added new e2e tests.

Was this patch authored or co-authored using generative AI tooling?

Nope.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this is matched with Python verison.

@xupefei
Copy link
Contributor Author

xupefei commented Mar 26, 2024

Let's make sure this is matched with Python verison.

Thanks Hyukjin! Yes the behavior largely matches the Python version (internals and user-facing APIs). One difference though is the df.collectObservations() API which in Python has a different syntax df.attrs["observed_metrics"].

@xupefei xupefei marked this pull request as ready for review March 26, 2024 15:34
@xupefei xupefei changed the title [WIP] [SPARK-47545][Connect] Dataset observe support for the Scala client [SPARK-47545][Connect] Dataset observe support for the Scala client Mar 26, 2024
@HyukjinKwon HyukjinKwon changed the title [SPARK-47545][Connect] Dataset observe support for the Scala client [SPARK-47545][CONNECT] Dataset observe support for the Scala client Mar 28, 2024
@ueshin
Copy link
Member

ueshin commented Mar 28, 2024

So df.collectObservations() seems to be a new API available only in Spark Connect Scala client?

@xupefei
Copy link
Contributor Author

xupefei commented Apr 2, 2024

So df.collectObservations() seems to be a new API available only in Spark Connect Scala client?

Yes, similar to df.attrs["observed_metrics"] which is only in the Python client.

@xupefei xupefei requested a review from ueshin April 2, 2024 15:21
@xupefei xupefei requested a review from hvanhovell April 17, 2024 07:42
@xupefei xupefei requested a review from hvanhovell April 18, 2024 16:40
metrics: Option[Map[String, Any]]): Unit = {
observationRegistry.get(planId).map { observation =>
if (observation.setMetricsAndNotify(metrics)) {
observationRegistry.remove(planId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be tied to whether or not the observation has been successfully updated? Other question under what circumstance can the metrics be empty.

Copy link
Contributor Author

@xupefei xupefei Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question when I looked at the code. In Spark Core we only de-register the Observation when some non-empty metrics are set, so I decide to keep it the same in Connect. I am not sure under which circumstance the metrics can be empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code. It seems it's valid to check for non-empty metrics in Spark Core:

private[spark] def onFinish(qe: QueryExecution): Unit = {
  ...
  val row: Option[Row] = qe.observedMetrics.get(name)
  val metrics: Option[Map[String, Any]] = row.map(r => r.getValuesMap[Any](r.schema.fieldNames.toImmutableArraySeq))
  if (setMetricsAndNotify(metrics)) {
    unregister()
  }
}

The option is for a case when the query finishes without metric. Is this possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But nevertheless we don't need to handle this case in Connect because we look up using Plan Id.

val observedDf = df.observe(observation, min("id"), avg("id"), max("id"))

// Start a new thread to get the observation
val future = Future(observation.get)(ExecutionContext.global)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record. IMO the observation class should have been using a future from the get go.

@xupefei xupefei requested a review from hvanhovell April 19, 2024 13:48
(0 until metric.getKeysCount).map { i =>
val key = metric.getKeys(i)
val value = LiteralValueProtoConverter.toCatalystValue(metric.getValues(i))
schema = schema.add(key, LiteralValueProtoConverter.toDataType(value.getClass))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit of a twist here. So, LiteralValueProtoConverter, returns a Tuple for a nested struct. This is not really expected in a Row. We can address this in a follow-up.

@xupefei xupefei requested a review from hvanhovell April 30, 2024 13:56
Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Contributor

@xupefei there is a genuine test failure. Can you check what is going on?

@xupefei
Copy link
Contributor Author

xupefei commented May 1, 2024

@xupefei there is a genuine test failure. Can you check what is going on?

It seems the test is flaky, even after the previous attempt to fix it: #45173

@xupefei
Copy link
Contributor Author

xupefei commented May 1, 2024

@xupefei there is a genuine test failure. Can you check what is going on?

It seems the test is flaky, even after the previous attempt to fix it: #45173

I re-ran the test and it did pass. The remaining failure is doc generation: sh: 1: python: not found.

@hvanhovell
Copy link
Contributor

Merging!

@hvanhovell hvanhovell closed this in 21548a8 May 8, 2024
@xupefei xupefei deleted the scala-observe branch May 10, 2024 06:55
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

This PR adds support for `Dataset.observe` to the Spark Connect Scala client. Note that the support here does not include listener support as it runs on the serve side.

This PR includes a small refactoring to the `Observation` helper class. We extracted methods that are not bound to the SparkSession to `spark-api`, and added two subclasses on both `spark-core` and `spark-jvm-client`.

### Why are the changes needed?

Before this PR, the `DF.observe` method is only supported in the Python client.

### Does this PR introduce _any_ user-facing change?
Yes. The user can now issue `DF.observe(name, metrics...)` or `DF.observe(observationObject, metrics...)` to get stats of columns of a dataframe.

### How was this patch tested?

Added new e2e tests.

### Was this patch authored or co-authored using generative AI tooling?

Nope.

Closes apache#45701 from xupefei/scala-observe.

Authored-by: Paddy Xu <xupaddy@gmail.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants