[SPARK-47290][SQL] Extend CustomTaskMetric to allow metric values from multiple sources by parthchandra · Pull Request #45505 · apache/spark

parthchandra · 2024-03-13T22:41:16Z

What changes were proposed in this pull request?

Provides a new interface CustomFileTaskMetric that extends the CustomTaskMetric and allows updating of values.

Why are the changes needed?

The current interface to provide custom metrics does not work for adding file based metrics for the parquet reader where a single FilePartitionReader may need to collect metrics from multiple parquet file readers

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is just adding the interface. The implementation and tests will be done in a follow up PR that addresses https://issues.apache.org/jira/browse/SPARK-47291

Was this patch authored or co-authored using generative AI tooling?

No

…m multiple sources

dongjoon-hyun · 2024-03-14T14:54:41Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomFileTaskMetric.java

+import java.util.ArrayList;
+import java.util.List;
+import java.util.Optional;
+import org.apache.spark.annotation.Evolving;


We need a new line before this.

dongjoon-hyun · 2024-03-14T14:55:46Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomFileTaskMetric.java

+ * A custom file task metric. This allows file based data source V2 implementations
+ * to use a single PartitionReader with multiple file readers. Each file reader can
+ * provide its own metrics values and they can be added into the parent PartitionReader
+ * @since 4.0.0


nit. Please add a new empty line before this.

dongjoon-hyun · 2024-03-14T14:56:01Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomFileTaskMetric.java

+  /*
+  Merge(add) the values of the corresponding CustomTaskMetric from src array into target array
+  adding a new element if it doesn't already exist. Returns a new array without modifying the
+  target array


Indentation?

Reformatted

dongjoon-hyun · 2024-03-14T14:59:55Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomFileTaskMetric.java

+  adding a new element if it doesn't already exist. Returns a new array without modifying the
+  target array
+  */
+  static List<CustomFileTaskMetric> mergeMetricValues(List<CustomTaskMetric> src,


Shall we move this static method outside of this interface? I guess you couldn't find a proper existing utility class for this, right?

That's right. This is a utility method and there really isn't any other good place to put this. Anyway, I've moved it to org.apache.spark.sql.execution.metric.CustomMetrics. One difference is that the function was in a Java class earlier and is now in a Scala class. In general, that makes it slightly harder for a DSV2 implementation to use it because calling Scala from Java is still not perfect.

HyukjinKwon · 2024-03-15T00:44:25Z

cc @viirya

viirya · 2024-03-15T21:57:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala

+  /**
+   * Merge(add) the values of the corresponding CustomTaskMetric from src array into target array
+   * adding a new element if it doesn't already exist.
+   */


Suggested change

/**

* Merge(add) the values of the corresponding CustomTaskMetric from src array into target array

* adding a new element if it doesn't already exist.

*/

/**

* Update the values of the corresponding CustomTaskMetric with same metric name from src array into target array.

* If the source metric doesn't already exist, adding it to target metric array.

*/

viirya · 2024-03-15T22:07:07Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomFileTaskMetric.java

+ * @since 4.0.0
+ */
+@Evolving
+public interface CustomFileTaskMetric extends CustomTaskMetric {


Hmm, I am wondering why we need this interface? I think we can definitely do the same thing with CustomTaskMetric?

I wasn't sure if it is OK to change an existing public interface. If it is acceptable to simply add this to CustomTaskMetric that would be perfect.

I mean it seems feasible to use CustomTaskMetric to achieve same goal without adding new method.

The new API is update method. You can definitely update the current value of CustomTaskMetric without problem. CustomTaskMetric is simply an interface used by the instances collecting metrics to report task level metrics.

+1 for @viirya 's advice.

DSV1 uses SQLMetric which has a set(v: Long) and an add(v: Long) so a data source can set an initial value for a metric and then keep updating it. DSV2 uses CustomTaskMetric which really should have an equivalent but doesn't. That's what this interface is trying to provide.

In this case, FilePartitionReader needs to update the value of a CustomTaskMetric every time a file reader completes. So FilePartitionReader can have a metric say ReadTime which it updates every time a file reader completes. But, only the Parquet file reader provides this metric, so now FilePartitionReader can query the file reader and gets the metrics that are supported. But it has no way to update them. FilePartitionReader can of course create its own set of metrics corresponding to the supported ones, keep its own set of values and update them as needed, but having an interface to update the value which the metric supports seems so much more elegant.

Here's the usage of this interface - https://github.com/parthchandra/spark/blob/SPARK-47291/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala

And the corresponding metrics -
https://github.com/parthchandra/spark/blob/SPARK-47291/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetMetricsV2.scala

If you still think this is not reasonable, I will try to re-work it without this interface.

Actually, let me rework it without this interface

parthchandra · 2024-03-19T18:24:36Z

Closing this PR. As @viirya pointed out, it is possible to achieve the update to CustomTaskMetric without a new interface

viirya · 2024-03-19T18:40:25Z

Thank you @parthchandra

[SPARK-47290][SQL] Extend CustomTaskMetric to allow metric values fro…

406ae2a

…m multiple sources

github-actions bot added the SQL label Mar 13, 2024

dongjoon-hyun reviewed Mar 14, 2024

View reviewed changes

Move utility function to CustomMetrics

b3a6d2f

viirya reviewed Mar 15, 2024

View reviewed changes

parthchandra closed this Mar 19, 2024

Conversation

parthchandra commented Mar 13, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Mar 19, 2024

Uh oh!

viirya commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya Mar 15, 2024 •

edited

Loading

viirya Mar 15, 2024 •

edited

Loading