[SPARK-47290][SQL] Extend CustomTaskMetric to allow metric values from multiple sources#45505
[SPARK-47290][SQL] Extend CustomTaskMetric to allow metric values from multiple sources#45505parthchandra wants to merge 2 commits intoapache:masterfrom
Conversation
…m multiple sources
| import java.util.ArrayList; | ||
| import java.util.List; | ||
| import java.util.Optional; | ||
| import org.apache.spark.annotation.Evolving; |
There was a problem hiding this comment.
We need a new line before this.
| * A custom file task metric. This allows file based data source V2 implementations | ||
| * to use a single PartitionReader with multiple file readers. Each file reader can | ||
| * provide its own metrics values and they can be added into the parent PartitionReader | ||
| * @since 4.0.0 |
There was a problem hiding this comment.
nit. Please add a new empty line before this.
| /* | ||
| Merge(add) the values of the corresponding CustomTaskMetric from src array into target array | ||
| adding a new element if it doesn't already exist. Returns a new array without modifying the | ||
| target array |
| adding a new element if it doesn't already exist. Returns a new array without modifying the | ||
| target array | ||
| */ | ||
| static List<CustomFileTaskMetric> mergeMetricValues(List<CustomTaskMetric> src, |
There was a problem hiding this comment.
Shall we move this static method outside of this interface? I guess you couldn't find a proper existing utility class for this, right?
There was a problem hiding this comment.
That's right. This is a utility method and there really isn't any other good place to put this. Anyway, I've moved it to org.apache.spark.sql.execution.metric.CustomMetrics. One difference is that the function was in a Java class earlier and is now in a Scala class. In general, that makes it slightly harder for a DSV2 implementation to use it because calling Scala from Java is still not perfect.
|
cc @viirya |
| /** | ||
| * Merge(add) the values of the corresponding CustomTaskMetric from src array into target array | ||
| * adding a new element if it doesn't already exist. | ||
| */ |
There was a problem hiding this comment.
| /** | |
| * Merge(add) the values of the corresponding CustomTaskMetric from src array into target array | |
| * adding a new element if it doesn't already exist. | |
| */ | |
| /** | |
| * Update the values of the corresponding CustomTaskMetric with same metric name from src array into target array. | |
| * If the source metric doesn't already exist, adding it to target metric array. | |
| */ |
| * @since 4.0.0 | ||
| */ | ||
| @Evolving | ||
| public interface CustomFileTaskMetric extends CustomTaskMetric { |
There was a problem hiding this comment.
Hmm, I am wondering why we need this interface? I think we can definitely do the same thing with CustomTaskMetric?
There was a problem hiding this comment.
I wasn't sure if it is OK to change an existing public interface. If it is acceptable to simply add this to CustomTaskMetric that would be perfect.
There was a problem hiding this comment.
I mean it seems feasible to use CustomTaskMetric to achieve same goal without adding new method.
The new API is update method. You can definitely update the current value of CustomTaskMetric without problem. CustomTaskMetric is simply an interface used by the instances collecting metrics to report task level metrics.
There was a problem hiding this comment.
DSV1 uses SQLMetric which has a set(v: Long) and an add(v: Long) so a data source can set an initial value for a metric and then keep updating it. DSV2 uses CustomTaskMetric which really should have an equivalent but doesn't. That's what this interface is trying to provide.
In this case, FilePartitionReader needs to update the value of a CustomTaskMetric every time a file reader completes. So FilePartitionReader can have a metric say ReadTime which it updates every time a file reader completes. But, only the Parquet file reader provides this metric, so now FilePartitionReader can query the file reader and gets the metrics that are supported. But it has no way to update them. FilePartitionReader can of course create its own set of metrics corresponding to the supported ones, keep its own set of values and update them as needed, but having an interface to update the value which the metric supports seems so much more elegant.
Here's the usage of this interface - https://github.com/parthchandra/spark/blob/SPARK-47291/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
And the corresponding metrics -
https://github.com/parthchandra/spark/blob/SPARK-47291/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetMetricsV2.scala
If you still think this is not reasonable, I will try to re-work it without this interface.
There was a problem hiding this comment.
Actually, let me rework it without this interface
|
Closing this PR. As @viirya pointed out, it is possible to achieve the update to CustomTaskMetric without a new interface |
|
Thank you @parthchandra |
What changes were proposed in this pull request?
Provides a new interface
CustomFileTaskMetricthat extends theCustomTaskMetricand allows updating of values.Why are the changes needed?
The current interface to provide custom metrics does not work for adding file based metrics for the parquet reader where a single
FilePartitionReadermay need to collect metrics from multiple parquet file readersDoes this PR introduce any user-facing change?
No
How was this patch tested?
This is just adding the interface. The implementation and tests will be done in a follow up PR that addresses https://issues.apache.org/jira/browse/SPARK-47291
Was this patch authored or co-authored using generative AI tooling?
No