Ability to add multiple metrics reporters to scan #6919

karuppayya · 2023-02-23T14:26:08Z

Adds ability to add multiple reporter to the Scan

karuppayya · 2023-02-23T14:26:52Z

@aokolnychyi @RussellSpitzer @flyrain @anuragmantri @szehon-ho

nastra

We shouldn't be exposing ScanMetrics but rather introduce a custom MetricsReporter that receives a ScanReport once a scan is complete. This custom reporter can then add the results from the scan to the Spark UI.

api/src/main/java/org/apache/iceberg/BatchScanAdapter.java

core/src/main/java/org/apache/iceberg/SnapshotScan.java

api/src/main/java/org/apache/iceberg/Table.java

aokolnychyi · 2023-03-22T04:29:23Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -171,4 +172,7 @@ default ThisT select(String... columns) {

  /** Returns the split open file cost for this scan. */
  long splitOpenFileCost();
+
+  /** Create a new scan that will report the scan metrics to the {@code reporter} */
+  ThisT withMetricsReporter(Collection<MetricsReporter> reporter);


I don't think we use with prefixes throughout the Scan API. Also, we should accept a single reporter rather than a collection given the method name. Under the hood, it would be a list and we should be able to call this method more than once to add different reporters.

scan .filter(cold) .metricsReporter(reporter1) .metricsReporter(reporter2) .planFiles()

We may add a new method that accepts an iterable in the future if we have a use case for it.

+1 to that suggestion

aokolnychyi · 2023-03-22T04:31:31Z

core/src/main/java/org/apache/iceberg/TableScanContext.java

  }

-  MetricsReporter metricsReporter() {
-    return metricsReporter;
+  Collection<MetricsReporter> metricsReporter() {


Should the method name change too?

aokolnychyi · 2023-03-22T04:37:04Z

core/src/main/java/org/apache/iceberg/TableScanContext.java

  }

  TableScanContext reportWith(MetricsReporter reporter) {
+    metricsReporter().add(reporter);


If I remember correctly, this class is immutable and we create a new instance on every call. We should probably follow what is done for updating properties in withOption, where we copy existing options into a new list before adding.

TableScanContext reportWith(MetricsReporter reporter) { ImmutableList.Builder<MetricsReporter> builder = ImmutableList.builder(); builder.addAll(metricsReporters); builder.add(reporter); List<MetricsReporter> newMetricsReporters = builder.build(); return new TableScanContext( snapshotId, rowFilter, ignoreResiduals, caseSensitive, colStats, projectedSchema, selectedColumns, options, fromSnapshotId, toSnapshotId, planExecutor, fromSnapshotInclusive, newMetricsReporters); }

I agree that this seems like the better approach

aokolnychyi · 2023-03-22T04:37:43Z

core/src/main/java/org/apache/iceberg/TableScanContext.java

+        metricsReporter());
+  }
+
+  TableScanContext reportWith(Collection<MetricsReporter> reporters) {


I think it is OK to just have a method that accepts one reporter for now.

aokolnychyi · 2023-03-22T04:38:21Z

.palantir/revapi.yml

@@ -385,10 +388,6 @@ acceptedBreaks:
      old: "method void org.apache.iceberg.SnapshotProducer<ThisT>::validate(org.apache.iceberg.TableMetadata)\
        \ @ org.apache.iceberg.StreamingDelete"
      justification: "Removing deprecations for 1.2.0"
-    - code: "java.method.returnTypeChangedCovariantly"


Why is this changed?

aokolnychyi · 2023-03-22T04:40:00Z

.palantir/revapi.yml

@@ -66,6 +66,9 @@ acceptedBreaks:
      old: "method void org.apache.iceberg.io.DataWriter<T>::add(T)"
      justification: "Removing deprecated method"
  "1.1.0":
+    org.apache.iceberg:iceberg-api:
+    - code: "java.method.addedToInterface"
+      justification: "Add metricsreporter to Scan"


Instead of modifying checks, we should throw UnsupportedOperationException in the default implementation. Just like we do in some methods in TableScan today, which were added later.

aokolnychyi · 2023-03-22T04:43:08Z

core/src/main/java/org/apache/iceberg/SnapshotScan.java

@@ -144,7 +144,9 @@ public CloseableIterable<T> planFiles() {
                  .scanMetrics(ScanMetricsResult.fromScanMetrics(scanMetrics()))
                  .metadata(metadata)
                  .build();
-          context().metricsReporter().report(scanReport);
+          context()


nit: What about an explicit for each in this case? I am not sure such statements are easy to read since they are split on multiple lines.

for (MetricsReporter reporter : context().metricsReporters()) { reporter.report(scanReport); }

aokolnychyi · 2023-03-22T04:45:19Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkMetricsReporter.java

+    this.metricsReport = report;
+  }
+
+  MetricsReport getMetricsReport() {


nit: We don't use getXXX prefixes for getters, just metricsReport.

I also don't think this reporter is specific to Spark. We may think of a good name and put it in core. I believe it would be a common use case to intercept scan reports.

I believe @karuppayya meant to report from here directly to Spark metrics. @karuppayya were you planning to add that functionality as part of this PR?

nastra

@karuppayya it's a bit unclear to me whether you're planning to do the metrics reporting to Spark as part of this PR or not. Could you clarify please?
Also it would be good to add some tests that make sure reporting via multiple metrics reporters properly works

nastra · 2023-03-22T13:58:57Z

core/src/main/java/org/apache/iceberg/TableScanContext.java

  }

  TableScanContext reportWith(MetricsReporter reporter) {
+    metricsReporter().add(reporter);


I agree that this seems like the better approach

nastra · 2023-03-22T14:01:14Z

.palantir/revapi.yml

@@ -66,6 +66,9 @@ acceptedBreaks:
      old: "method void org.apache.iceberg.io.DataWriter<T>::add(T)"
      justification: "Removing deprecated method"
  "1.1.0":
+    org.apache.iceberg:iceberg-api:
+    - code: "java.method.addedToInterface"
+      justification: "Add metricsreporter to Scan"


nastra · 2023-03-22T14:01:45Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -171,4 +172,7 @@ default ThisT select(String... columns) {

  /** Returns the split open file cost for this scan. */
  long splitOpenFileCost();
+
+  /** Create a new scan that will report the scan metrics to the {@code reporter} */
+  ThisT withMetricsReporter(Collection<MetricsReporter> reporter);


+1 to that suggestion

nastra · 2023-03-22T14:05:03Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkMetricsReporter.java

+    this.metricsReport = report;
+  }
+
+  MetricsReport getMetricsReport() {


I believe @karuppayya meant to report from here directly to Spark metrics. @karuppayya were you planning to add that functionality as part of this PR?

karuppayya · 2023-03-22T14:11:41Z

I believe @karuppayya meant to report from here directly to Spark metrics. @karuppayya were you planning to add that functionality as part of this PR?

Yes, that was the idea.To access the metrics in SparkScanBuilder to use here

karuppayya · 2023-03-22T14:13:06Z

I am not sure of the reason for workflow needing approval to start. @RussellSpitzer @aokolnychyi @nastra @rdblue any idea why this would happen, any recent change?

nastra · 2023-03-22T14:25:22Z

I believe @karuppayya meant to report from here directly to Spark metrics. @karuppayya were you planning to add that functionality as part of this PR?

Yes, that was the idea.To access the metrics in SparkScanBuilder to use here

Ok I thought that the SparkMetricsReporter would have some custom code to report to Spark directly, but if that's not required, then I don't think we need that class. In that case you could just have a lambda similar to

iceberg/core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java

Line 361 in e340ad5

report -> reportMetrics(tableIdentifier, report, session::headers));

Something like

private void reportMetricsToSpark(MetricsReport report) {
    // report to spark
  }

BatchScan scan =
        table
            .newBatchScan()
            .caseSensitive(caseSensitive)
            .filter(filterExpression())
            .project(expectedSchema)
            .withMetricsReporter(this::reportMetricsToSpark);

aokolnychyi · 2023-03-22T22:48:36Z

Since we don't know how our metrics reporting will look like until we support Spark 3.4, what about focusing only on the core logic for adding custom reporters?

aokolnychyi · 2023-03-22T22:49:41Z

api/src/main/java/org/apache/iceberg/Scan.java

+  /** Create a new scan that will report the scan metrics to the {@code reporter} */
+  default ThisT metricsReporter(MetricsReporter reporter) {
+    throw new UnsupportedOperationException(
+        this.getClass().getName() + " doesn't implement metricReporter");


Typo in the method name inside the comment (should be metricsReporter).

Looks like this comment was missed. The exception message should include the correct method name.

aokolnychyi · 2023-03-22T22:50:37Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -171,4 +172,10 @@ default ThisT select(String... columns) {

  /** Returns the split open file cost for this scan. */
  long splitOpenFileCost();
+
+  /** Create a new scan that will report the scan metrics to the {@code reporter} */


The doc should indicate that this adds a reporter to the list of existing reporters, not overrides it.

aokolnychyi · 2023-03-22T22:51:07Z

.palantir/revapi.yml

@@ -67,45 +67,9 @@ acceptedBreaks:
      justification: "Removing deprecated method"
  "1.1.0":
    org.apache.iceberg:iceberg-core:
-    - code: "java.class.noLongerImplementsInterface"


Why are these changes needed?

aokolnychyi · 2023-03-22T22:52:52Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

@@ -420,12 +421,14 @@ private Scan buildBatchScan() {
  private Scan buildBatchScan(Long snapshotId, Long asOfTimestamp, String branch, String tag) {
    Schema expectedSchema = schemaWithMetadataColumns();

+    SparkMetricsReporter reporter = new SparkMetricsReporter();


I don't think there is any value in this class at this point. Let's remove Spark classes from this PR and add them when we know how the reporting logic will look like.

This reverts commit 531467d.

nastra

almost there, just a few small things to address and then I think we can get this merged

nastra · 2023-03-28T13:12:42Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -171,4 +172,10 @@ default ThisT select(String... columns) {

  /** Returns the split open file cost for this scan. */
  long splitOpenFileCost();
+
+  /** Create a new scan that will report the scan metrics to the {@code reporter} */


nastra · 2023-03-28T13:17:49Z

core/src/test/java/org/apache/iceberg/TestScanPlanningAndReporting.java

@@ -42,6 +43,33 @@ public TestScanPlanningAndReporting() {
    super(2);
  }

+  @Test
+  public void scanningWithMutipleReporters() throws IOException {


Suggested change

public void scanningWithMutipleReporters() throws IOException {

public void scanningWithMultipleReporters() throws IOException {

nastra · 2023-03-28T13:18:19Z

core/src/test/java/org/apache/iceberg/TestScanPlanningAndReporting.java

+            .metricsReporter(
+                (MetricsReporter) -> {
+                  reportedCount.getAndIncrement();
+                })
+            .metricsReporter(
+                (MetricsReporter) -> {
+                  reportedCount.getAndIncrement();
+                });


Suggested change

.metricsReporter(

(MetricsReporter) -> {

reportedCount.getAndIncrement();

})

.metricsReporter(

(MetricsReporter) -> {

reportedCount.getAndIncrement();

});

.metricsReporter((MetricsReporter) -> reportedCount.getAndIncrement())

.metricsReporter((MetricsReporter) -> reportedCount.getAndIncrement());

nastra · 2023-03-28T13:21:16Z

core/src/test/java/org/apache/iceberg/TestScanPlanningAndReporting.java

+    try (CloseableIterable<FileScanTask> fileScanTasks = tableScan.planFiles()) {
+      fileScanTasks.forEach(task -> {});
+    }
+    assertThat(reportedCount.get()).isEqualTo(2);


Suggested change

try (CloseableIterable<FileScanTask> fileScanTasks = tableScan.planFiles()) {

fileScanTasks.forEach(task -> {});

}

assertThat(reportedCount.get()).isEqualTo(2);

try (CloseableIterable<FileScanTask> fileScanTasks = tableScan.planFiles()) {

fileScanTasks.forEach(task -> {});

}

assertThat(reportedCount.get()).isEqualTo(2);

// make sure default metrics reporter is still reporting

ScanReport scanReport = reporter.lastReport();

assertThat(scanReport).isNotNull();

assertThat(scanReport.tableName()).isEqualTo(tableName);

assertThat(scanReport.snapshotId()).isEqualTo(1L);

ScanMetricsResult result = scanReport.scanMetrics();

assertThat(result.totalPlanningDuration().totalDuration()).isGreaterThan(Duration.ZERO);

assertThat(result.resultDataFiles().value()).isEqualTo(1);

this is also to make sure the default metrics reporter still reports

aokolnychyi · 2023-03-29T06:07:16Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -171,4 +172,13 @@ default ThisT select(String... columns) {

  /** Returns the split open file cost for this scan. */
  long splitOpenFileCost();
+
+  /**
+   * Create a new scan that will report the scan metrics to the {@code reporter} {@code reporter} is


nit: Duplicate {@code reporter}?

I'd actually consider adapting the message as below.

Create a new scan that will report scan metrics to the provided reporter in addition to reporters maintained by the scan.

aokolnychyi

Two minor nits and should be good to go.

nastra

LGTM once the two nits are fixed. Also we should probably adjust the PR title / commit msg when merging to better reflect the scope of this work

aokolnychyi · 2023-03-29T19:18:05Z

Thank you, @karuppayya! Thanks for reviewing, @nastra!

github-actions bot added API core labels Feb 23, 2023

nastra requested changes Feb 23, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/BatchScanAdapter.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/SnapshotScan.java Outdated Show resolved Hide resolved

github-actions bot added the spark label Mar 13, 2023

aokolnychyi reviewed Mar 15, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Outdated Show resolved Hide resolved

karuppayya force-pushed the scan_metrics branch from 21dd74c to 9202fb3 Compare March 20, 2023 16:22

aokolnychyi reviewed Mar 22, 2023

View reviewed changes

nastra reviewed Mar 22, 2023

View reviewed changes

aokolnychyi reviewed Mar 22, 2023

View reviewed changes

karuppayya added 6 commits March 27, 2023 06:18

Add HasScan metrics interface

63f0f5f

Revert "Add HasScan metrics interface"

1d39c8d

This reverts commit 531467d.

Use scanmetrics

b44dba5

Fix test failure

b7b3ee1

Add metricsreporter to scan api

95a1e7f

Remove unnecessary changes

9ff31be

karuppayya added 3 commits March 27, 2023 06:19

Address review commenst

7a93226

Add basic test

17063b4

Remove Spark changes

c87d5d8

karuppayya force-pushed the scan_metrics branch from 3f32214 to c87d5d8 Compare March 27, 2023 13:20

Revert revapi to that of master

d4e4778

nastra reviewed Mar 28, 2023

View reviewed changes

Address review commenst

03f0d0e

aokolnychyi reviewed Mar 29, 2023

View reviewed changes

aokolnychyi approved these changes Mar 29, 2023

View reviewed changes

nastra approved these changes Mar 29, 2023

View reviewed changes

Address review comments

dc083e9

karuppayya changed the title ~~Add HasScan metrics interface~~ Ability to add multiple metrics reporters to scan Mar 29, 2023

aokolnychyi merged commit f536c84 into apache:master Mar 29, 2023
33 checks passed

	public void scanningWithMutipleReporters() throws IOException {
	public void scanningWithMultipleReporters() throws IOException {

-    try (CloseableIterable<FileScanTask> fileScanTasks = tableScan.planFiles()) {
-      fileScanTasks.forEach(task -> {});
-    }
-    assertThat(reportedCount.get()).isEqualTo(2);
+    try (CloseableIterable<FileScanTask> fileScanTasks = tableScan.planFiles()) {
+      fileScanTasks.forEach(task -> {});
+    }
+    assertThat(reportedCount.get()).isEqualTo(2);
+    // make sure default metrics reporter is still reporting
+    ScanReport scanReport = reporter.lastReport();
+    assertThat(scanReport).isNotNull();
+    assertThat(scanReport.tableName()).isEqualTo(tableName);
+    assertThat(scanReport.snapshotId()).isEqualTo(1L);
+    ScanMetricsResult result = scanReport.scanMetrics();
+    assertThat(result.totalPlanningDuration().totalDuration()).isGreaterThan(Duration.ZERO);
+    assertThat(result.resultDataFiles().value()).isEqualTo(1);

Ability to add multiple metrics reporters to scan #6919

Ability to add multiple metrics reporters to scan #6919

Conversation

karuppayya commented Feb 23, 2023 • edited

karuppayya commented Feb 23, 2023

nastra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Mar 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karuppayya commented Mar 22, 2023

karuppayya commented Mar 22, 2023

nastra commented Mar 22, 2023

aokolnychyi commented Mar 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

nastra left a comment • edited

Choose a reason for hiding this comment

aokolnychyi commented Mar 29, 2023

karuppayya commented Feb 23, 2023 •

edited

aokolnychyi Mar 22, 2023 •

edited

nastra left a comment •

edited

nastra left a comment •

edited