-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-36319][SQL][Python] Make Observation return Map instead of Row #33545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @cloud-fan if there is no value in |
28333b6 to
8d7aa97
Compare
|
Yeah I think we don't necessarily have to use Row here. Cc @hvanhovell too fyi |
|
OK to test |
|
Test build #141736 has finished for PR 33545 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
@HyukjinKwon There is one issue with returning a Should I add some prefix to the duplicate column names? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141759 has finished for PR 33545 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141776 has finished for PR 33545 at commit
|
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
Outdated
Show resolved
Hide resolved
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise
dbfcbc2 to
88eacd5
Compare
88eacd5 to
0b863e0
Compare
|
Test build #141844 has finished for PR 33545 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| */ | ||
| @throws[InterruptedException] | ||
| def get: Row = { | ||
| def get: Map[String, Any] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry last comment. can we use sth like Map[String, _]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * @throws InterruptedException interrupted while waiting | ||
| */ | ||
| @throws[InterruptedException] | ||
| def getAsJava: java.util.Map[String, Object] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too. BTW, I remember AnyRef corresponds to Object. If Map[String, _] doesn't work, can we switch to AnyRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java expects a Map<String, ?> with Map[String, _] and Map<String, Object> with Map[String, AnyRef], so I'd go for the latter.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141892 has finished for PR 33545 at commit
|
|
Merged to master. |
|
@HyukjinKwon thanks a lot! |
| namedObservation, | ||
| min($"id").as("min_val"), | ||
| max($"id").as("max_val"), | ||
| sum($"id").as("sum_val"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment: what happens if there are duplicated names like min(...).as("a"), max(...).as("a")? Do we silently drop one value or do we fail at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It behaves identical to Row.getValuesMap(Row.schema.fieldNames), which drops all but the last occurrence of a column name.
What changes were proposed in this pull request?
The Observation API (Scala, Java, PySpark) now returns a
Map/Dict. Before, it returnedRowsimply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to returnRow.While touching this code, this moves the unit tests from
DataFrameSuite,scalatoDatasetSuite.scalaand fromJavaDataFrameSuite.javatoJavaDatasetSuite.java, which is a better place.Why are the changes needed?
This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept
Rowwhen retrieving the observation result.Does this PR introduce any user-facing change?
Yes, it changes the return type of
getfromRowtoMap(Scala) /Dict(Python) and introducesgetAsJavaMap(Java).How was this patch tested?
This is tested in
DatasetSuite.SPARK-34806: observation on datasets,JavaDatasetSuite.testObservationandtest_dataframe.test_observe.