Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR moves Observation into sql/api. For classic I moved the wiring into the observe method itself, and the required listener is now part of the org.apache.spark.sql.internal package. I have also take the liberty to get rid of most of the homegrown locking, and I have replaced it with a promise.

Why are the changes needed?

We are creating a shared interface for the classic and connect Scala DataFrame API. This class is part of that API.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

}
}
private[spark] def setMetricsAndNotify(metrics: Row): Boolean = {
val metricsMap = metrics.getValuesMap(metrics.schema.map(_.name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still wondering why this needed to be a map...

@hvanhovell
Copy link
Contributor Author

@xupefei can you take a look?

@github-actions github-actions bot added the BUILD label Aug 28, 2024
@hvanhovell
Copy link
Contributor Author

Merging to master

@asfgit asfgit closed this in 528ba0e Aug 30, 2024
gengliangwang pushed a commit that referenced this pull request Oct 20, 2025
### What changes were proposed in this pull request?

This patch proposes a fix to a deadlock bug in `Observation`. It replaces `synchronized` locks with a promise to avoid deadlock happened between `get` and `onFinish` methods

### Why are the changes needed?

`Observation` class has been evolved a few times during Spark 3.5 to Spark 4.0.0. Previously it uses locking mechanism (`synchronized`) between `get` and `onFinish` methods to coordinate metrics update and retrieval.

But it has a potential deadlocking bug. If `get` is called before `ObservationListener` is triggered to call `onFinish`, `get` will forever be waiting for metrics because it locks the observation object by `synchronized` so later `onFinish` call is locked out from updating the metrics.

This locking mechanism was replaced by a promise by #47921 that is a large refacroring on observation feature. But in the PR, I don’t see the deadlock bug was mentioned, and there is no bug fix PR proposed to earlier versions. So I think that the bug was not known and the fix is unintentional in Spark 4.0.0. The bug is still in Spark 3.5 branch.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The deadlock bug was hit by customer and is tricky to reproducing by unit test. This patch should pass existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #52657 from viirya/fix_observation_deadlock.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants