-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
#39091
Conversation
Column(transformExpression(expr)) | ||
} | ||
|
||
if (rel.getIsObservation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Observation registers ObservationListener
on ExecutionListenerManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is not fully correct. The Observation
class uses event listeners to be able to fetch the metrics as soon as they appear without waiting for the DS command to finish. Since we're not having an event listener at this point in time it's not the same thing. Please simplify the PR accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need Observation here. We just need to send the observed metrics as part of the response stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need Observation here. We just need to send the observed metrics as part of the response stream.
But we should maintain the consistency of behavior between the API of spark connect and the API of pyspark. The observe of pyspark supports using Observation
as parameter and the doc test checks the consistence.
Maybe we could keep the API of connect supports Observation
and it will not be used at server side, but directly using CollectMetrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the is_observation
code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell is_observation has been removed.
@beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path. |
@@ -126,6 +126,20 @@ package object dsl { | |||
Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e)) | |||
.build() | |||
|
|||
def proto_max(e: Expression): Expression = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is need to add proto_
prefix? Just call it max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just follow up the existing proto_min
.
@@ -45,7 +45,7 @@ import org.apache.spark.sql.util.QueryExecutionListener | |||
* @param name name of the metric | |||
* @since 3.3.0 | |||
*/ | |||
class Observation(name: String) { | |||
class Observation(val name: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah without val
, name
is treated as a method. Nice catch on this.
Good question. The result of datasets could passed by grpc server. But the |
DataFrame.observe
I think it would be possible to add another result batch type for observed metrics and simply pass them at the end. |
Notes | ||
----- | ||
When ``observation`` is :class:`Observation`, this method only supports batch queries. | ||
When ``observation`` is a string, this method works for both batch and streaming queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streaming queries
is out of scope now
|
||
if isinstance(observation, Observation): | ||
return DataFrame.withPlan( | ||
plan.CollectMetrics(self._plan, str(observation._name), list(exprs), True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the implementation in pyspark equivalent to this?
spark/python/pyspark/sql/observation.py
Lines 86 to 110 in c014fa2
def _on(self, df: DataFrame, *exprs: Column) -> DataFrame: | |
"""Attaches this observation to the given :class:`DataFrame` to observe aggregations. | |
Parameters | |
---------- | |
df : :class:`DataFrame` | |
the :class:`DataFrame` to be observed | |
exprs : list of :class:`Column` | |
column expressions (:class:`Column`). | |
Returns | |
------- | |
:class:`DataFrame` | |
the observed :class:`DataFrame`. | |
""" | |
assert self._jo is None, "an Observation can be used with a DataFrame only once" | |
self._jvm = df._sc._jvm | |
assert self._jvm is not None | |
cls = self._jvm.org.apache.spark.sql.Observation | |
self._jo = cls(self._name) if self._name is not None else cls() | |
observed_df = self._jo.on( | |
df._jdf, exprs[0]._jc, column._to_seq(df._sc, [c._jc for c in exprs[1:]]) | |
) | |
return DataFrame(observed_df, df.sparkSession) |
534c28c
to
c3643d1
Compare
I have an idea:
|
Today we simply stuff the metrics in the pandas data frame into the I don't have enough experience if it's worth it to do another full round trip to the server for that. Can we experiment for now in just immediately returning them? The observed metrics should be relatively small so it should not be a big deal? |
@beliefer can we just send them as part of the |
c3643d1
to
c16a297
Compare
This job was finished. |
This job was finished. |
738c162
to
109a48c
Compare
109a48c
to
f9e1523
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're getting there. The PR while big becomes really nice. I think there are still a couple of things to clarify but we're closing in.
@@ -45,6 +45,7 @@ | |||
UnresolvedRegex, | |||
) | |||
from pyspark.sql.connect.functions import col, lit | |||
from pyspark.sql import Observation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is going to work. The tricky part is that the Observation instance heavily depends on the _jvm
object. So I'm worried using the type here because it will not be possible to even construct this when using Spark Connect.
For now, we have two ways out here:
- For now we just use string and do the Observation class as a follow up
- We're adding an Observation class now.
Personally, to get this PR moving, I would vote for 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, because users are used to using the Observation
from pyspark.sql. I think we better use it too.
Second, this PR only use the name of Observation
, not any action of it. So, the use is safely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use it for type annotations here if it's not legal to construct the type. In addition you're using it to access the name IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -158,6 +158,9 @@ message ExecutePlanResponse { | |||
// batch of results and then represent the overall state of the query execution. | |||
Metrics metrics = 4; | |||
|
|||
// The metrics observed during the execution of the query plan. | |||
ObservedMetrics observed_metrics = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObservedMetrics observed_metrics = 5; | |
optional ObservedMetrics observed_metrics = 5; |
@@ -181,6 +184,16 @@ message ExecutePlanResponse { | |||
string metric_type = 3; | |||
} | |||
} | |||
|
|||
message ObservedMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc?
|
||
message ObservedMetricsObject { | ||
string name = 1; | ||
repeated string values = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to what we do for regular observations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, can you please expand your answer a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLMetrics
have name, value and metricType and ObservedMetrics have name and value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use values to represent the data? You can just use literals. We may need to add a schema as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A metric may have multiple different value.
I will add the schema.
repeated Expression metrics = 3; | ||
|
||
// (Optional) The identity whether Observation are used. | ||
bool is_observation = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this is false?
bool is_observation = 4; | |
optional bool is_observation = 4; |
.observe(observation, metrics.head, metrics.tail: _*) | ||
.logicalPlan | ||
} else { | ||
CollectMetrics(rel.getName, metrics.map(_.named), transformRelation(rel.getInput)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually make sense? Looking the way this class is used it seems that only .observe
actually creates an instance of CollectMetrics
. I'm not sure we're actually exposing this operator as such.
@hvanhovell what's your perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many other api exposing operator directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please give me an example of where we're exposing the catalyst operator directly in our API (in particular in the Dataset API)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, Unpivot
, UnresolvedHint
and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As outlined above, only this branch is needed, their behavior is identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, but where is the logicalplan cached in all of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell I'm sorry. I also supports df.randomSplits
. The reply just now is confused. df.observe
without cache behavior. We can only support string now.
@@ -179,6 +183,29 @@ class SparkConnectStreamHandler(responseObserver: StreamObserver[ExecutePlanResp | |||
.build() | |||
} | |||
|
|||
def sendObservedMetricsToResponse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def sendObservedMetricsToResponse( | |
private def sendObservedMetricsToResponse( |
doc?
@@ -619,6 +619,50 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
comparePlans(connectPlan1, sparkPlan1) | |||
} | |||
|
|||
test("Test observe") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add negative tests for throwing analysis exceptions when submitting non aggregation functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
22cfdd0
to
5827cf7
Compare
repeated Expression metrics = 3; | ||
|
||
// (Optional) The identity whether Observation are used. | ||
optional bool is_observation = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have checked the code for how DF.observe works. In Scala it has two different overloads, one for Observation and one for string. Both end up calling the same underlying method on the dataframe. Both end up using the CollectMetrics
and wrap it around the logical plan.
There is no need to have this special type for using the Observation. The simplification for Observation should be created on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we not using the Observation. The connect API will not consistent with pyspark API.
cc @zhengruifeng @HyukjinKwon @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you should be able to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
...t/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Outdated
Show resolved
Hide resolved
814a940
to
a951c2c
Compare
…onnect/service/SparkConnectStreamHandler.scala Co-authored-by: Martin Grund <grundprinzip@gmail.com>
is_observation code path has been removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging to master/3.4 |
### What changes were proposed in this pull request? Implement `DataFrame.observe` with a proto message Implement `DataFrame.observe` for scala API Implement `DataFrame.observe` for python API ### Why are the changes needed? for Connect API coverage ### Does this PR introduce _any_ user-facing change? 'No'. New API ### How was this patch tested? New test cases. Closes #39091 from beliefer/SPARK-41527. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0ce63f3) Signed-off-by: Herman van Hovell <herman@databricks.com>
@hvanhovell @grundprinzip @HyukjinKwon @zhengruifeng @amaliujia Thank you. |
### What changes were proposed in this pull request? Implement `DataFrame.observe` with a proto message Implement `DataFrame.observe` for scala API Implement `DataFrame.observe` for python API ### Why are the changes needed? for Connect API coverage ### Does this PR introduce _any_ user-facing change? 'No'. New API ### How was this patch tested? New test cases. Closes apache#39091 from beliefer/SPARK-41527. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0ce63f3) Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
Implement
DataFrame.observe
with a proto messageImplement
DataFrame.observe
for scala APIImplement
DataFrame.observe
for python APIWhy are the changes needed?
for Connect API coverage
Does this PR introduce any user-facing change?
'No'. New API
How was this patch tested?
New test cases.