[SPARK-47545][CONNECT] Dataset `observe` support for the Scala client #45701

xupefei · 2024-03-25T14:08:19Z

What changes were proposed in this pull request?

This PR adds support for Dataset.observe to the Spark Connect Scala client. Note that the support here does not include listener support as it runs on the serve side.

This PR includes a small refactoring to the Observation helper class. We extracted methods that are not bound to the SparkSession to spark-api, and added two subclasses on both spark-core and spark-jvm-client.

Why are the changes needed?

Before this PR, the DF.observe method is only supported in the Python client.

Does this PR introduce any user-facing change?

Yes. The user can now issue DF.observe(name, metrics...) or DF.observe(observationObject, metrics...) to get stats of columns of a dataframe.

How was this patch tested?

Added new e2e tests.

Was this patch authored or co-authored using generative AI tooling?

Nope.

HyukjinKwon

Let's make sure this is matched with Python verison.

xupefei · 2024-03-26T15:26:06Z

Let's make sure this is matched with Python verison.

Thanks Hyukjin! Yes the behavior largely matches the Python version (internals and user-facing APIs). One difference though is the df.collectObservations() API which in Python has a different syntax df.attrs["observed_metrics"].

ueshin · 2024-03-28T18:19:08Z

So df.collectObservations() seems to be a new API available only in Spark Connect Scala client?

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala

xupefei · 2024-04-02T08:58:57Z

So df.collectObservations() seems to be a new API available only in Spark Connect Scala client?

Yes, similar to df.attrs["observed_metrics"] which is only in the Python client.

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala

…bserve

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

hvanhovell · 2024-04-18T17:06:00Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

+      metrics: Option[Map[String, Any]]): Unit = {
+    observationRegistry.get(planId).map { observation =>
+      if (observation.setMetricsAndNotify(metrics)) {
+        observationRegistry.remove(planId)


Should this be tied to whether or not the observation has been successfully updated? Other question under what circumstance can the metrics be empty.

I had the same question when I looked at the code. In Spark Core we only de-register the Observation when some non-empty metrics are set, so I decide to keep it the same in Connect. I am not sure under which circumstance the metrics can be empty.

I looked at the code. It seems it's valid to check for non-empty metrics in Spark Core:

private[spark] def onFinish(qe: QueryExecution): Unit = { ... val row: Option[Row] = qe.observedMetrics.get(name) val metrics: Option[Map[String, Any]] = row.map(r => r.getValuesMap[Any](r.schema.fieldNames.toImmutableArraySeq)) if (setMetricsAndNotify(metrics)) { unregister() } }

The option is for a case when the query finishes without metric. Is this possible?

But nevertheless we don't need to handle this case in Connect because we look up using Plan Id.

hvanhovell · 2024-04-18T17:08:25Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala

+    val observedDf = df.observe(observation, min("id"), avg("id"), max("id"))
+
+    // Start a new thread to get the observation
+    val future = Future(observation.get)(ExecutionContext.global)


For the record. IMO the observation class should have been using a future from the get go.

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

hvanhovell · 2024-04-29T16:39:02Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala

+      (0 until metric.getKeysCount).map { i =>
+        val key = metric.getKeys(i)
+        val value = LiteralValueProtoConverter.toCatalystValue(metric.getValues(i))
+        schema = schema.add(key, LiteralValueProtoConverter.toDataType(value.getClass))


There is a bit of a twist here. So, LiteralValueProtoConverter, returns a Tuple for a nested struct. This is not really expected in a Row. We can address this in a follow-up.

hvanhovell

LGTM

hvanhovell · 2024-04-30T14:42:46Z

@xupefei there is a genuine test failure. Can you check what is going on?

xupefei · 2024-05-01T09:51:57Z

@xupefei there is a genuine test failure. Can you check what is going on?

It seems the test is flaky, even after the previous attempt to fix it: #45173

xupefei · 2024-05-01T15:22:38Z

@xupefei there is a genuine test failure. Can you check what is going on?

It seems the test is flaky, even after the previous attempt to fix it: #45173

I re-ran the test and it did pass. The remaining failure is doc generation: sh: 1: python: not found.

hvanhovell · 2024-05-08T19:43:22Z

Merging!

### What changes were proposed in this pull request? This PR adds support for `Dataset.observe` to the Spark Connect Scala client. Note that the support here does not include listener support as it runs on the serve side. This PR includes a small refactoring to the `Observation` helper class. We extracted methods that are not bound to the SparkSession to `spark-api`, and added two subclasses on both `spark-core` and `spark-jvm-client`. ### Why are the changes needed? Before this PR, the `DF.observe` method is only supported in the Python client. ### Does this PR introduce _any_ user-facing change? Yes. The user can now issue `DF.observe(name, metrics...)` or `DF.observe(observationObject, metrics...)` to get stats of columns of a dataframe. ### How was this patch tested? Added new e2e tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. Closes apache#45701 from xupefei/scala-observe. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

xupefei added 2 commits March 25, 2024 15:06

wip

fb38900

add tests

e534f82

github-actions bot added SQL CONNECT labels Mar 25, 2024

xupefei added 4 commits March 25, 2024 15:29

.

3441c8a

Merge branch 'master' of github.com:apache/spark into scala-observe

f14cca3

try fix tests

baa5ca3

oops

8efa252

HyukjinKwon reviewed Mar 26, 2024

View reviewed changes

avoid collecting all results

13eadb3

mima

2d0af64

xupefei marked this pull request as ready for review March 26, 2024 15:34

xupefei changed the title ~~[WIP] [SPARK-47545][Connect] Dataset observe support for the Scala client~~ [SPARK-47545][Connect] Dataset observe support for the Scala client Mar 26, 2024

mima again

103e57a

xupefei requested a review from HyukjinKwon March 27, 2024 11:05

HyukjinKwon changed the title ~~[SPARK-47545][Connect] Dataset observe support for the Scala client~~ [SPARK-47545][CONNECT] Dataset observe support for the Scala client Mar 28, 2024

ueshin reviewed Mar 28, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

docstring

7ce7a57

xupefei requested a review from ueshin April 2, 2024 15:21

Merge branch 'master' into scala-observe

8ddd42f

hvanhovell reviewed Apr 9, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 9, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 9, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala Show resolved Hide resolved

hvanhovell reviewed Apr 9, 2024

View reviewed changes

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala Outdated Show resolved Hide resolved

xupefei added 3 commits April 11, 2024 12:42

move to plan_id

cf3437b

Merge branch 'scala-observe' of github.com:xupefei/spark into scala-o…

459f2a6

…bserve

style

f9cc2a5

hvanhovell reviewed Apr 15, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 15, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 15, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Observation.scala Outdated Show resolved Hide resolved

address comments

b7d8cef

xupefei requested a review from hvanhovell April 17, 2024 07:42

xupefei added 2 commits April 17, 2024 11:25

fmt

3c0cf22

fmt

4e1772f

hvanhovell reviewed Apr 17, 2024

View reviewed changes

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala Show resolved Hide resolved

xupefei requested a review from hvanhovell April 18, 2024 16:40

hvanhovell reviewed Apr 18, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 18, 2024

View reviewed changes

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala Outdated Show resolved Hide resolved

address comments

1ddbe92

xupefei requested a review from hvanhovell April 19, 2024 13:48

hvanhovell reviewed Apr 29, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 29, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Apr 29, 2024

View reviewed changes

.

294ec85

xupefei requested a review from hvanhovell April 30, 2024 13:56

hvanhovell approved these changes Apr 30, 2024

View reviewed changes

Merge branch 'apache:master' into scala-observe

f949903

hvanhovell closed this in 21548a8 May 8, 2024

xupefei deleted the scala-observe branch May 10, 2024 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47545][CONNECT] Dataset `observe` support for the Scala client #45701

[SPARK-47545][CONNECT] Dataset `observe` support for the Scala client #45701

xupefei commented Mar 25, 2024 •

edited

HyukjinKwon left a comment

xupefei commented Mar 26, 2024

ueshin commented Mar 28, 2024

xupefei commented Apr 2, 2024

hvanhovell Apr 18, 2024

xupefei Apr 19, 2024 •

edited

xupefei Apr 19, 2024

xupefei Apr 19, 2024

hvanhovell Apr 18, 2024

hvanhovell Apr 29, 2024

hvanhovell left a comment

hvanhovell commented Apr 30, 2024

xupefei commented May 1, 2024

xupefei commented May 1, 2024

hvanhovell commented May 8, 2024

[SPARK-47545][CONNECT] Dataset observe support for the Scala client #45701

[SPARK-47545][CONNECT] Dataset observe support for the Scala client #45701

Conversation

xupefei commented Mar 25, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon left a comment

Choose a reason for hiding this comment

xupefei commented Mar 26, 2024

ueshin commented Mar 28, 2024

xupefei commented Apr 2, 2024

hvanhovell Apr 18, 2024

Choose a reason for hiding this comment

xupefei Apr 19, 2024 • edited

Choose a reason for hiding this comment

xupefei Apr 19, 2024

Choose a reason for hiding this comment

xupefei Apr 19, 2024

Choose a reason for hiding this comment

hvanhovell Apr 18, 2024

Choose a reason for hiding this comment

hvanhovell Apr 29, 2024

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell commented Apr 30, 2024

xupefei commented May 1, 2024

xupefei commented May 1, 2024

hvanhovell commented May 8, 2024

[SPARK-47545][CONNECT] Dataset `observe` support for the Scala client #45701

[SPARK-47545][CONNECT] Dataset `observe` support for the Scala client #45701

xupefei commented Mar 25, 2024 •

edited

xupefei Apr 19, 2024 •

edited