[SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe` #39091

beliefer · 2022-12-16T12:26:30Z

What changes were proposed in this pull request?

Implement DataFrame.observe with a proto message

Implement DataFrame.observe for scala API
Implement DataFrame.observe for python API

Why are the changes needed?

for Connect API coverage

Does this PR introduce any user-facing change?

'No'. New API

How was this patch tested?

New test cases.

hvanhovell · 2022-12-16T17:48:02Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+      Column(transformExpression(expr))
+    }
+
+    if (rel.getIsObservation) {


What is the difference between the code paths?

The Observation registers ObservationListener on ExecutionListenerManager.

This explanation is not fully correct. The Observation class uses event listeners to be able to fetch the metrics as soon as they appear without waiting for the DS command to finish. Since we're not having an event listener at this point in time it's not the same thing. Please simplify the PR accordingly.

We don't need Observation here. We just need to send the observed metrics as part of the response stream.

We don't need Observation here. We just need to send the observed metrics as part of the response stream.

But we should maintain the consistency of behavior between the API of spark connect and the API of pyspark. The observe of pyspark supports using Observation as parameter and the doc test checks the consistence.

Maybe we could keep the API of connect supports Observation and it will not be used at server side, but directly using CollectMetrics.

Please remove the is_observation code path.

@hvanhovell is_observation has been removed.

hvanhovell · 2022-12-16T17:52:47Z

@beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path.

amaliujia · 2022-12-16T19:02:31Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

@@ -126,6 +126,20 @@ package object dsl {
          Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e))
        .build()

+    def proto_max(e: Expression): Expression =


There is need to add proto_ prefix? Just call it max?

Same for below

I just follow up the existing proto_min.

amaliujia · 2022-12-16T19:05:25Z

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

@@ -45,7 +45,7 @@ import org.apache.spark.sql.util.QueryExecutionListener
 * @param name name of the metric
 * @since 3.3.0
 */
-class Observation(name: String) {
+class Observation(val name: String) {


ah without val, name is treated as a method. Nice catch on this.

beliefer · 2022-12-17T03:33:55Z

@beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path.

Good question. The result of datasets could passed by grpc server. But the ObservationListener runs on server, it seems we need another way to get.

grundprinzip · 2022-12-18T05:30:36Z

@beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path.

Good question. The result of datasets could passed by grpc server. But the ObservationListener runs on server, it seems we need another way to get.

I think it would be possible to add another result batch type for observed metrics and simply pass them at the end.

python/pyspark/sql/connect/dataframe.py

zhengruifeng · 2022-12-19T01:55:58Z

python/pyspark/sql/connect/dataframe.py

+        Notes
+        -----
+        When ``observation`` is :class:`Observation`, this method only supports batch queries.
+        When ``observation`` is a string, this method works for both batch and streaming queries.


streaming queries is out of scope now

zhengruifeng · 2022-12-19T02:02:28Z

python/pyspark/sql/connect/dataframe.py

+
+        if isinstance(observation, Observation):
+            return DataFrame.withPlan(
+                plan.CollectMetrics(self._plan, str(observation._name), list(exprs), True),


is the implementation in pyspark equivalent to this?

spark/python/pyspark/sql/observation.py

Lines 86 to 110 in c014fa2

def _on(self, df: DataFrame, *exprs: Column) -> DataFrame:

"""Attaches this observation to the given :class:`DataFrame` to observe aggregations.

Parameters

----------

df : :class:`DataFrame`

the :class:`DataFrame` to be observed

exprs : list of :class:`Column`

column expressions (:class:`Column`).

Returns

-------

:class:`DataFrame`

the observed :class:`DataFrame`.

"""

assert self._jo is None, "an Observation can be used with a DataFrame only once"

self._jvm = df._sc._jvm

assert self._jvm is not None

cls = self._jvm.org.apache.spark.sql.Observation

self._jo = cls(self._name) if self._name is not None else cls()

observed_df = self._jo.on(

df._jdf, exprs[0]._jc, column._to_seq(df._sc, [c._jc for c in exprs[1:]])

)

return DataFrame(observed_df, df.sparkSession)

beliefer · 2022-12-19T11:41:04Z

I think it would be possible to add another result batch type for observed metrics and simply pass them at the end.

I have an idea:

cache the Observation at server.
create a new relation GetObservation for get Observation from cache with timeout. If we can get the metrics successfully, wrap the metrics with a local relation and return it to client.

grundprinzip · 2022-12-19T18:53:28Z

2. create a new relation GetObservation for get Observation from cache with timeout. If we can get the metrics successfully, wrap the metrics with a local relation and return it to client.

Today we simply stuff the metrics in the pandas data frame into the pdf['attrs'] property. I'm wondering if we can just do the same here.

I don't have enough experience if it's worth it to do another full round trip to the server for that. Can we experiment for now in just immediately returning them? The observed metrics should be relatively small so it should not be a big deal?

hvanhovell · 2022-12-19T20:09:26Z

@beliefer can we just send them as part of the ExecutePlanResponse at the end of the query? Doing another RPC seems a bit wasteful, and it means we have to track query state in the server side session.

beliefer · 2022-12-20T11:17:33Z

@beliefer can we just send them as part of the ExecutePlanResponse at the end of the query? Doing another RPC seems a bit wasteful, and it means we have to track query state in the server side session.

This job was finished.

beliefer · 2022-12-20T11:18:04Z

I don't have enough experience if it's worth it to do another full round trip to the server for that. Can we experiment for now in just immediately returning them? The observed metrics should be relatively small so it should not be a big deal?

This job was finished.

beliefer · 2022-12-21T07:49:31Z

ping @hvanhovell @grundprinzip @HyukjinKwon @zhengruifeng @amaliujia

grundprinzip

I think we're getting there. The PR while big becomes really nice. I think there are still a couple of things to clarify but we're closing in.

grundprinzip · 2022-12-21T19:49:33Z

python/pyspark/sql/connect/dataframe.py

@@ -45,6 +45,7 @@
    UnresolvedRegex,
 )
 from pyspark.sql.connect.functions import col, lit
+from pyspark.sql import Observation


I'm not sure this is going to work. The tricky part is that the Observation instance heavily depends on the _jvm object. So I'm worried using the type here because it will not be possible to even construct this when using Spark Connect.

For now, we have two ways out here:

For now we just use string and do the Observation class as a follow up

We're adding an Observation class now.

Personally, to get this PR moving, I would vote for 1

At first, because users are used to using the Observation from pyspark.sql. I think we better use it too.
Second, this PR only use the name of Observation, not any action of it. So, the use is safely here.

You can't use it for type annotations here if it's not legal to construct the type. In addition you're using it to access the name IIRC.

#39091 (comment)

grundprinzip · 2022-12-21T19:54:31Z

connector/connect/common/src/main/protobuf/spark/connect/base.proto

@@ -158,6 +158,9 @@ message ExecutePlanResponse {
  // batch of results and then represent the overall state of the query execution.
  Metrics metrics = 4;

+  // The metrics observed during the execution of the query plan.
+  ObservedMetrics observed_metrics = 5;


Suggested change

ObservedMetrics observed_metrics = 5;

optional ObservedMetrics observed_metrics = 5;

grundprinzip · 2022-12-21T19:54:41Z

connector/connect/common/src/main/protobuf/spark/connect/base.proto

@@ -181,6 +184,16 @@ message ExecutePlanResponse {
      string metric_type = 3;
    }
  }
+
+  message ObservedMetrics {


grundprinzip · 2022-12-21T19:54:57Z

connector/connect/common/src/main/protobuf/spark/connect/base.proto

+
+    message ObservedMetricsObject {
+      string name = 1;
+      repeated string values = 2;


Is this equivalent to what we do for regular observations?

Not the same.

I'm not sure I understand, can you please expand your answer a bit?

SQLMetrics have name, value and metricType and ObservedMetrics have name and value.

Why use values to represent the data? You can just use literals. We may need to add a schema as well.

A metric may have multiple different value.
I will add the schema.

grundprinzip · 2022-12-21T19:55:47Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  repeated Expression metrics = 3;
+
+  // (Optional) The identity whether Observation are used.
+  bool is_observation = 4;


What happens if this is false?

Suggested change

bool is_observation = 4;

optional bool is_observation = 4;

grundprinzip · 2022-12-21T20:01:23Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+        .observe(observation, metrics.head, metrics.tail: _*)
+        .logicalPlan
+    } else {
+      CollectMetrics(rel.getName, metrics.map(_.named), transformRelation(rel.getInput))


Does this actually make sense? Looking the way this class is used it seems that only .observe actually creates an instance of CollectMetrics. I'm not sure we're actually exposing this operator as such.

@hvanhovell what's your perspective?

Many other api exposing operator directly.

Can you please give me an example of where we're exposing the catalyst operator directly in our API (in particular in the Dataset API)?

For example, Unpivot, UnresolvedHint and so on.

As outlined above, only this branch is needed, their behavior is identical

I am sorry, but where is the logicalplan cached in all of this?

@hvanhovell I'm sorry. I also supports df.randomSplits. The reply just now is confused. df.observe without cache behavior. We can only support string now.

grundprinzip · 2022-12-21T20:01:46Z

...t/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala

@@ -179,6 +183,29 @@ class SparkConnectStreamHandler(responseObserver: StreamObserver[ExecutePlanResp
      .build()
  }

+  def sendObservedMetricsToResponse(


Suggested change

def sendObservedMetricsToResponse(

private def sendObservedMetricsToResponse(

doc?

grundprinzip · 2022-12-21T20:02:36Z

...nect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

@@ -619,6 +619,50 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
    comparePlans(connectPlan1, sparkPlan1)
  }

+  test("Test observe") {


please add negative tests for throwing analysis exceptions when submitting non aggregation functions.

grundprinzip · 2022-12-22T14:18:58Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  repeated Expression metrics = 3;
+
+  // (Optional) The identity whether Observation are used.
+  optional bool is_observation = 4;


So I have checked the code for how DF.observe works. In Scala it has two different overloads, one for Observation and one for string. Both end up calling the same underlying method on the dataframe. Both end up using the CollectMetrics and wrap it around the logical plan.

There is no need to have this special type for using the Observation. The simplification for Observation should be created on the client side.

If we not using the Observation. The connect API will not consistent with pyspark API.
cc @zhengruifeng @HyukjinKwon @cloud-fan

Yeah you should be able to remove this.

...t/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala

…onnect/service/SparkConnectStreamHandler.scala Co-authored-by: Martin Grund <grundprinzip@gmail.com>

beliefer · 2023-03-06T01:03:25Z

@beliefer can you please remove the is_observation code path? And take another look at the protocol. Otherwise I think it looks good.

is_observation code path has been removed.

hvanhovell

LGTM

hvanhovell · 2023-03-06T02:17:55Z

Merging to master/3.4

### What changes were proposed in this pull request? Implement `DataFrame.observe` with a proto message Implement `DataFrame.observe` for scala API Implement `DataFrame.observe` for python API ### Why are the changes needed? for Connect API coverage ### Does this PR introduce _any_ user-facing change? 'No'. New API ### How was this patch tested? New test cases. Closes #39091 from beliefer/SPARK-41527. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0ce63f3) Signed-off-by: Herman van Hovell <herman@databricks.com>

beliefer · 2023-03-06T03:00:56Z

@hvanhovell @grundprinzip @HyukjinKwon @zhengruifeng @amaliujia Thank you.

### What changes were proposed in this pull request? Implement `DataFrame.observe` with a proto message Implement `DataFrame.observe` for scala API Implement `DataFrame.observe` for python API ### Why are the changes needed? for Connect API coverage ### Does this PR introduce _any_ user-facing change? 'No'. New API ### How was this patch tested? New test cases. Closes apache#39091 from beliefer/SPARK-41527. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0ce63f3) Signed-off-by: Herman van Hovell <herman@databricks.com>

github-actions bot added CONNECT CORE PYTHON SQL labels Dec 16, 2022

hvanhovell reviewed Dec 16, 2022

View reviewed changes

amaliujia reviewed Dec 16, 2022

View reviewed changes

beliefer changed the title ~~[SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe~~ [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe Dec 17, 2022

zhengruifeng reviewed Dec 19, 2022

View reviewed changes

beliefer force-pushed the SPARK-41527 branch from 534c28c to c3643d1 Compare December 19, 2022 05:18

beliefer force-pushed the SPARK-41527 branch from c3643d1 to c16a297 Compare December 20, 2022 11:10

beliefer force-pushed the SPARK-41527 branch from 738c162 to 109a48c Compare December 21, 2022 04:39

beliefer force-pushed the SPARK-41527 branch from 109a48c to f9e1523 Compare December 21, 2022 12:26

grundprinzip suggested changes Dec 21, 2022

View reviewed changes

beliefer force-pushed the SPARK-41527 branch from 22cfdd0 to 5827cf7 Compare December 22, 2022 13:47

grundprinzip reviewed Dec 22, 2022

View reviewed changes

...t/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala Outdated Show resolved Hide resolved

beliefer force-pushed the SPARK-41527 branch 4 times, most recently from 814a940 to a951c2c Compare December 27, 2022 01:17

beliefer and others added 21 commits March 4, 2023 15:30

Append observed metrics to the tail of execute plan response

6c97fdb

Fix conflicts

ad8129a

Update code

b0bfb78

Update code

d784517

Reformat scala

9278557

Update code

da3111b

Fix conflicts

57376bb

Update connector/connect/server/src/main/scala/org/apache/spark/sql/c…

fd2d6cd

…onnect/service/SparkConnectStreamHandler.scala Co-authored-by: Martin Grund <grundprinzip@gmail.com>

observe api only support observation is str

d12e0e6

Remove unused Observation

98b2dac

Update code

54761b3

Change value from str to expression.literal

70b827f

Update code

abd2467

Update code

7f827bd

Update code

8b5a7a5

Restore observation

010e003

Fix conflicts

3158055

Update code

ac2f2e0

Fix conflicts

7544a23

Fix conflicts

6fe9c70

Update code

e2f22bd

beliefer force-pushed the SPARK-41527 branch from f5565db to e2f22bd Compare March 4, 2023 12:48

Remove is_observation

7fccf7f

hvanhovell approved these changes Mar 6, 2023

View reviewed changes

hvanhovell closed this in 0ce63f3 Mar 6, 2023

	def _on(self, df: DataFrame, *exprs: Column) -> DataFrame:
	"""Attaches this observation to the given :class:`DataFrame` to observe aggregations.

	Parameters
	----------
	df : :class:`DataFrame`
	the :class:`DataFrame` to be observed
	exprs : list of :class:`Column`
	column expressions (:class:`Column`).

	Returns
	-------
	:class:`DataFrame`
	the observed :class:`DataFrame`.
	"""
	assert self._jo is None, "an Observation can be used with a DataFrame only once"

	self._jvm = df._sc._jvm
	assert self._jvm is not None
	cls = self._jvm.org.apache.spark.sql.Observation
	self._jo = cls(self._name) if self._name is not None else cls()
	observed_df = self._jo.on(
	df._jdf, exprs[0]._jc, column._to_seq(df._sc, [c._jc for c in exprs[1:]])
	)
	return DataFrame(observed_df, df.sparkSession)

	ObservedMetrics observed_metrics = 5;
	optional ObservedMetrics observed_metrics = 5;

	def sendObservedMetricsToResponse(
	private def sendObservedMetricsToResponse(

[SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe #39091

[SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe #39091

Conversation

beliefer commented Dec 16, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Dec 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Dec 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Dec 17, 2022 • edited

grundprinzip commented Dec 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Dec 19, 2022

grundprinzip commented Dec 19, 2022

hvanhovell commented Dec 19, 2022

beliefer commented Dec 20, 2022

beliefer commented Dec 20, 2022

beliefer commented Dec 21, 2022 • edited

grundprinzip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Dec 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Dec 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Mar 6, 2023

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell commented Mar 6, 2023

beliefer commented Mar 6, 2023

[SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe` #39091

[SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe` #39091

beliefer Dec 17, 2022 •

edited

beliefer commented Dec 17, 2022 •

edited

beliefer commented Dec 21, 2022 •

edited

beliefer Dec 22, 2022 •

edited

beliefer Dec 23, 2022 •

edited