Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Jul 27, 2021

What changes were proposed in this pull request?

The Observation API (Scala, Java, PySpark) now returns a Map / Dict. Before, it returned Row simply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to return Row.

While touching this code, this moves the unit tests from DataFrameSuite,scala to DatasetSuite.scala and from JavaDataFrameSuite.java to JavaDatasetSuite.java, which is a better place.

Why are the changes needed?

This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept Row when retrieving the observation result.

Does this PR introduce any user-facing change?

Yes, it changes the return type of get from Row to Map (Scala) / Dict (Python) and introduces getAsJavaMap (Java).

How was this patch tested?

This is tested in DatasetSuite.SPARK-34806: observation on datasets, JavaDatasetSuite.testObservation and test_dataframe.test_observe.

@EnricoMi
Copy link
Contributor Author

@HyukjinKwon @cloud-fan if there is no value in getAsRow, then this could be removed entirely.

@EnricoMi EnricoMi force-pushed the branch-observation-returns-map branch from 28333b6 to 8d7aa97 Compare July 27, 2021 20:55
@HyukjinKwon
Copy link
Member

Yeah I think we don't necessarily have to use Row here. Cc @hvanhovell too fyi

@HyukjinKwon
Copy link
Member

OK to test

@SparkQA
Copy link

SparkQA commented Jul 27, 2021

Test build #141736 has finished for PR 33545 at commit 8d7aa97.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46249/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46249/

@EnricoMi
Copy link
Contributor Author

@HyukjinKwon get now returns the Scala Map, getAsRow is removed. I would still like to keep getAsJavaMap as it simplifies usage in Java (see JavaDatasetSuite.java`.

There is one issue with returning a Map though: The map will lose some metrics when the aggregation expressions have duplicate column names:

> df.observe(observation, lit(1).as("a"), lit(2).as("a")).count
> observation.get
Map(a -> 2)

Should I add some prefix to the duplicate column names?

Map(a_1 -> 1, a_2 -> 2)

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46270/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46270/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141759 has finished for PR 33545 at commit 9f81b32.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46288/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46288/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141776 has finished for PR 33545 at commit 32c6ad5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise

@HyukjinKwon HyukjinKwon changed the title [SPARK-36319][SQL][PySpark] Have Observation return Map instead of Row [SPARK-36319][SQL][Python] Make Observation return Map instead of Row Jul 29, 2021
@EnricoMi EnricoMi force-pushed the branch-observation-returns-map branch from dbfcbc2 to 88eacd5 Compare July 29, 2021 13:15
@EnricoMi EnricoMi force-pushed the branch-observation-returns-map branch from 88eacd5 to 0b863e0 Compare July 29, 2021 13:18
@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Test build #141844 has finished for PR 33545 at commit 0b863e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46359/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46359/

*/
@throws[InterruptedException]
def get: Row = {
def get: Map[String, Any] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry last comment. can we use sth like Map[String, _]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @throws InterruptedException interrupted while waiting
*/
@throws[InterruptedException]
def getAsJava: java.util.Map[String, Object] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too. BTW, I remember AnyRef corresponds to Object. If Map[String, _] doesn't work, can we switch to AnyRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java expects a Map<String, ?> with Map[String, _] and Map<String, Object> with Map[String, AnyRef], so I'd go for the latter.

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46401/

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46401/

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Test build #141892 has finished for PR 33545 at commit 80e35d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Aug 1, 2021

@HyukjinKwon thanks a lot!

namedObservation,
min($"id").as("min_val"),
max($"id").as("max_val"),
sum($"id").as("sum_val"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment: what happens if there are duplicated names like min(...).as("a"), max(...).as("a")? Do we silently drop one value or do we fail at runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It behaves identical to Row.getValuesMap(Row.schema.fieldNames), which drops all but the last occurrence of a column name.

@EnricoMi EnricoMi deleted the branch-observation-returns-map branch August 2, 2021 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants