[SPARK-36319][SQL][Python] Make Observation return Map instead of Row #33545

EnricoMi · 2021-07-27T19:44:51Z

What changes were proposed in this pull request?

The Observation API (Scala, Java, PySpark) now returns a Map / Dict. Before, it returned Row simply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to return Row.

While touching this code, this moves the unit tests from DataFrameSuite,scala to DatasetSuite.scala and from JavaDataFrameSuite.java to JavaDatasetSuite.java, which is a better place.

Why are the changes needed?

This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept Row when retrieving the observation result.

Does this PR introduce any user-facing change?

Yes, it changes the return type of get from Row to Map (Scala) / Dict (Python) and introduces getAsJavaMap (Java).

How was this patch tested?

This is tested in DatasetSuite.SPARK-34806: observation on datasets, JavaDatasetSuite.testObservation and test_dataframe.test_observe.

EnricoMi · 2021-07-27T19:45:56Z

@HyukjinKwon @cloud-fan if there is no value in getAsRow, then this could be removed entirely.

HyukjinKwon · 2021-07-27T22:25:47Z

Yeah I think we don't necessarily have to use Row here. Cc @hvanhovell too fyi

HyukjinKwon · 2021-07-27T22:26:01Z

OK to test

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

SparkQA · 2021-07-27T22:52:33Z

Test build #141736 has finished for PR 33545 at commit 8d7aa97.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-27T23:37:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46249/

SparkQA · 2021-07-28T00:17:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46249/

EnricoMi · 2021-07-28T07:30:59Z

@HyukjinKwon get now returns the Scala Map, getAsRow is removed. I would still like to keep getAsJavaMap as it simplifies usage in Java (see JavaDatasetSuite.java`.

There is one issue with returning a Map though: The map will lose some metrics when the aggregation expressions have duplicate column names:

> df.observe(observation, lit(1).as("a"), lit(2).as("a")).count
> observation.get
Map(a -> 2)

Should I add some prefix to the duplicate column names?

Map(a_1 -> 1, a_2 -> 2)

SparkQA · 2021-07-28T08:16:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46270/

SparkQA · 2021-07-28T08:52:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46270/

SparkQA · 2021-07-28T11:34:16Z

Test build #141759 has finished for PR 33545 at commit 9f81b32.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-28T14:09:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46288/

SparkQA · 2021-07-28T14:43:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46288/

SparkQA · 2021-07-28T17:47:57Z

Test build #141776 has finished for PR 33545 at commit 32c6ad5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/observation.py

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

python/pyspark/sql/observation.py

sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java

python/pyspark/sql/observation.py

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

HyukjinKwon

Looks good otherwise

SparkQA · 2021-07-29T14:27:24Z

Test build #141844 has finished for PR 33545 at commit 0b863e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-29T15:28:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46359/

SparkQA · 2021-07-29T16:21:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46359/

HyukjinKwon · 2021-07-30T00:06:58Z

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

   */
  @throws[InterruptedException]
-  def get: Row = {
+  def get: Map[String, Any] = {


sorry last comment. can we use sth like Map[String, _]?

HyukjinKwon · 2021-07-30T00:09:18Z

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala

+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def getAsJava: java.util.Map[String, Object] = {


Here too. BTW, I remember AnyRef corresponds to Object. If Map[String, _] doesn't work, can we switch to AnyRef?

Java expects a Map<String, ?> with Map[String, _] and Map<String, Object> with Map[String, AnyRef], so I'd go for the latter.

SparkQA · 2021-07-30T10:23:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46401/

SparkQA · 2021-07-30T10:58:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46401/

SparkQA · 2021-07-30T14:14:19Z

Test build #141892 has finished for PR 33545 at commit 80e35d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-08-01T01:40:02Z

Merged to master.

EnricoMi · 2021-08-01T08:59:16Z

@HyukjinKwon thanks a lot!

cloud-fan · 2021-08-02T04:35:16Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+        namedObservation,
+        min($"id").as("min_val"),
+        max($"id").as("max_val"),
+        sum($"id").as("sum_val"),


one minor comment: what happens if there are duplicated names like min(...).as("a"), max(...).as("a")? Do we silently drop one value or do we fail at runtime?

Yes, It behaves identical to Row.getValuesMap(Row.schema.fieldNames), which drops all but the last occurrence of a column name.

EnricoMi added 3 commits July 27, 2021 21:34

Move testObservation from JavaDataFrameSuite into JavaDatasetSuite

736c328

Move Observation tests from DataFrameSuite to DatasetSuite

bc33be9

Test for non-empty Observation name

fdcca6f

github-actions bot added CORE PYTHON SQL labels Jul 27, 2021

Add Observation.getAsMap and variants

8d7aa97

EnricoMi force-pushed the branch-observation-returns-map branch from 28333b6 to 8d7aa97 Compare July 27, 2021 20:55

HyukjinKwon reviewed Jul 27, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Observation.scala Outdated Show resolved Hide resolved

Have Observation.get return Scala Map, don't return Row

9f81b32

Fix example, import warning and long Java line

32c6ad5

Remove unused _to_row in Python

88e10c1

HyukjinKwon reviewed Jul 29, 2021

View reviewed changes

HyukjinKwon approved these changes Jul 29, 2021

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-36319][SQL][PySpark] Have Observation return Map instead of Row~~ [SPARK-36319][SQL][Python] Make Observation return Map instead of Row Jul 29, 2021

EnricoMi force-pushed the branch-observation-returns-map branch from dbfcbc2 to 88eacd5 Compare July 29, 2021 13:15

Address PR comments

0b863e0

EnricoMi force-pushed the branch-observation-returns-map branch from 88eacd5 to 0b863e0 Compare July 29, 2021 13:18

HyukjinKwon reviewed Jul 30, 2021

View reviewed changes

Change map value types

80e35d9

HyukjinKwon approved these changes Aug 1, 2021

View reviewed changes

HyukjinKwon closed this in a65eb36 Aug 1, 2021

cloud-fan reviewed Aug 2, 2021

View reviewed changes

EnricoMi deleted the branch-observation-returns-map branch August 2, 2021 05:59

[SPARK-36319][SQL][Python] Make Observation return Map instead of Row #33545

[SPARK-36319][SQL][Python] Make Observation return Map instead of Row #33545

Uh oh!

Conversation

EnricoMi commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

EnricoMi commented Jul 27, 2021

Uh oh!

HyukjinKwon commented Jul 27, 2021

Uh oh!

HyukjinKwon commented Jul 27, 2021

Uh oh!

Uh oh!

SparkQA commented Jul 27, 2021

Uh oh!

SparkQA commented Jul 27, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

EnricoMi commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

HyukjinKwon Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

HyukjinKwon commented Aug 1, 2021

Uh oh!

EnricoMi commented Aug 1, 2021

Uh oh!

cloud-fan Aug 2, 2021

Choose a reason for hiding this comment

EnricoMi commented Jul 27, 2021 •

edited

Loading