Flink: Implement enumerator metrics for pending splits, pending recor… #9524

mas-chen · 2024-01-19T20:03:19Z

…ds, and split discovery

Enumerator metrics are now supported in Flink 1.18. NOTE: this will not be backported.

pvary · 2024-01-20T06:57:26Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssigner.java

@@ -115,4 +115,7 @@ default void onCompletedSplits(Collection<String> completedSplitIds) {}
   * snapshots and splits, which defeats the purpose of throttling.
   */
  int pendingSplitCount();
+
+  /** Pending records count */


How good is the split.task().estimatedRowsCount()) which we are using for the calculations? Shall we state in the comments, or in the method name, that the value is estimated? 🤔

I won't worry too much. the estimated part comes from a split large file. there is no record count for each split/chunk. hence the record count is estimated based on ratio of the split bytes / file bytes.

I still think that the method name should contain estimated.

I think of it as more an implementation detail whether it is estimated or not--perhaps there could be a more accurate way of computing this value in the future. Rather than leave the comment in the interface, are you ok if we just leave the comment on the implementing method?

We can leave more hints of how to use this method in the interface

This is a public facing interface.
What about:

/** Pending records count. Could be an estimation if exact numbersare not available*/

I am not comfortable stating one thing in the comments/docs and doing other things in the implementation.

pvary · 2024-01-20T07:05:14Z

flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/MiniClusterResource.java

@@ -50,4 +51,18 @@ public static MiniClusterWithClientResource createWithClassloaderCheckDisabled()
            .setConfiguration(DISABLE_CLASSLOADER_CHECK_CONFIG)
            .build());
  }
+
+  public static MiniClusterWithClientResource createWithClassloaderCheckDisabled(


There's an ongoing process to move from junit4 to junit5 tests. It would be good to add new features to the junit5 tests only

in this case, there is no new test class added. maybe junit5 should be handled separately?

I am concerned that we are adding new feature to the old testing harness here

See comments below

pvary · 2024-01-20T07:06:23Z

...k/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java

  @ClassRule
  public static final MiniClusterWithClientResource MINI_CLUSTER_RESOURCE =
-      MiniClusterResource.createWithClassloaderCheckDisabled();
+      MiniClusterResource.createWithClassloaderCheckDisabled(METRIC_REPORTER);


Could we use junit5 tests for testing?

That would require refactoring this whole class. I can refactor it but that is outside the scope of this PR. Otherwise, I can separate out the metric testing to a new class and use junit5 there.

It's just that these metrics rely on similar logic as continuous iceberg source testing, as to not duplicate tests/code

How about moving the existing test to junit5 in another PR, and rebasing this one above that one.

@nastra: Is there any ongoing work to move this test to junit5?

I haven't seen a PR that would include migrating this class here to JUnit5

@pvary @stevenzwu @nastra I'm fine with that and I can volunteer to migrate this class to JUnit5 after the PR is merged. I am limited in my bandwidth this week, so I can address it next week, while I would like these metrics to land in the upcoming iceberg release.

I agree with @mas-chen that the JUnit5 can be tackled as a separate PR. this PR doesn't add any new test classes.

@mas-chen: It seems more complicated this way (modify, migrate, remove), but I am mostly interested in the final result 😀

pvary · 2024-01-20T07:08:04Z

...k/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java

@@ -367,6 +382,8 @@ public void testSpecificSnapshotTimestamp() throws Exception {

      List<Row> result3 = waitForResult(iter, 2);
      TestHelpers.assertRecords(result3, batch3, tableResource.table().schema());
+
+      assertThatIcebergEnumeratorMetricsExist();


Do we want tests for asserting the metrics values too?

I guess it is probably difficult to reliably assert on the values of unassginedSplits and pendingRecords due to timing, unless we can add a listener to the metric reporter to track all value changes for a metric.

We might be able to wait for the expected metrics values, like we do it here:

iceberg/flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java

Line 338 in 13d2160

Awaitility.await()

polling won't work because the value may have changed from 0 to 1 and back to 0 within a polling interval.

Yeah polling is a challenge. I can do fine-grained unit tests to have better control on when it is invoked. However, I don't think it is possible to assert on a distinct value

Makes sense. Thanks for the explanation!

stevenzwu · 2024-01-20T05:43:45Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/ElapsedTimeGauge.java

+ * ElapsedTimeGauge#refreshLastRecordedTime()}.
+ */
+@Internal
+public class ElapsedTimeGauge implements Gauge<Long> {


nit: maybe move to util package?

stevenzwu · 2024-01-21T04:45:43Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssigner.java

@@ -115,4 +115,7 @@ default void onCompletedSplits(Collection<String> completedSplitIds) {}
   * snapshots and splits, which defeats the purpose of throttling.
   */
  int pendingSplitCount();
+
+  /** Pending records count */


I won't worry too much. the estimated part comes from a split large file. there is no record count for each split/chunk. hence the record count is estimated based on ratio of the split bytes / file bytes.

stevenzwu · 2024-01-21T04:49:04Z

flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/MiniClusterResource.java

@@ -50,4 +51,18 @@ public static MiniClusterWithClientResource createWithClassloaderCheckDisabled()
            .setConfiguration(DISABLE_CLASSLOADER_CHECK_CONFIG)
            .build());
  }
+
+  public static MiniClusterWithClientResource createWithClassloaderCheckDisabled(


in this case, there is no new test class added. maybe junit5 should be handled separately?

stevenzwu · 2024-01-21T04:50:00Z

...k/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java

+  private static void assertThatIcebergEnumeratorMetricsExist() {
+    assertThatIcebergSourceMetricExists(
+        "enumerator", "coordinator.enumerator.elapsedSecondsSinceLastSplitDiscovery");
+    assertThatIcebergSourceMetricExists("enumerator", "coordinator.enumerator.unassignedSplits");


what about pendingRecords?

stevenzwu · 2024-01-21T04:56:08Z

...k/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java

@@ -367,6 +382,8 @@ public void testSpecificSnapshotTimestamp() throws Exception {

      List<Row> result3 = waitForResult(iter, 2);
      TestHelpers.assertRecords(result3, batch3, tableResource.table().schema());
+
+      assertThatIcebergEnumeratorMetricsExist();


I guess it is probably difficult to reliably assert on the values of unassginedSplits and pendingRecords due to timing, unless we can add a listener to the metric reporter to track all value changes for a metric.

…ds, and split discovery

pvary · 2024-01-30T08:23:43Z

@mas-chen: Please backport this to 1.17, 1.18 and continue with the Junit5 migration PR.
Thanks @mas-chen for the PR and @stevenzwu for the review!

mas-chen · 2024-01-30T19:51:50Z

@pvary this is only supported by 1.18 and a 1.17 impl would cause a runtime error. As mentioned in the PR description, I don't think backporting is necessary here

…ds, and split discovery (apache#9524)

pvary · 2024-01-31T13:17:03Z

@pvary this is only supported by 1.18 and a 1.17 impl would cause a runtime error. As mentioned in the PR description, I don't think backporting is necessary here

Got it.. Thanks @mas-chen!

…ds, and split discovery (apache#9524)

github-actions bot added the flink label Jan 19, 2024

pvary reviewed Jan 20, 2024

View reviewed changes

stevenzwu reviewed Jan 21, 2024

View reviewed changes

mas-chen force-pushed the enumerator-metrics branch from 4c6c7f1 to 0c8da06 Compare January 26, 2024 07:21

Flink: Implement enumerator metrics for pending splits, pending recor…

341f9f4

…ds, and split discovery

mas-chen force-pushed the enumerator-metrics branch from 0c8da06 to 341f9f4 Compare January 26, 2024 07:24

stevenzwu approved these changes Jan 29, 2024

View reviewed changes

pvary approved these changes Jan 30, 2024

View reviewed changes

pvary merged commit e5dc5ec into apache:main Jan 30, 2024
13 checks passed

adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024

Flink: Implement enumerator metrics for pending splits, pending recor…

ff77d86

…ds, and split discovery (apache#9524)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Flink: Implement enumerator metrics for pending splits, pending recor…

219fcc0

…ds, and split discovery (apache#9524)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Implement enumerator metrics for pending splits, pending recor… #9524

Flink: Implement enumerator metrics for pending splits, pending recor… #9524

mas-chen commented Jan 19, 2024 •

edited

pvary Jan 20, 2024

stevenzwu Jan 21, 2024

pvary Jan 22, 2024

mas-chen Jan 22, 2024 •

edited

pvary Jan 23, 2024

pvary Jan 20, 2024

stevenzwu Jan 21, 2024

pvary Jan 22, 2024

mas-chen Jan 23, 2024

pvary Jan 20, 2024

mas-chen Jan 22, 2024 •

edited

pvary Jan 23, 2024

nastra Jan 23, 2024

mas-chen Jan 24, 2024

stevenzwu Jan 24, 2024

pvary Jan 24, 2024

pvary Jan 20, 2024

stevenzwu Jan 21, 2024

pvary Jan 22, 2024

stevenzwu Jan 22, 2024

mas-chen Jan 22, 2024 •

edited

pvary Jan 23, 2024

stevenzwu Jan 20, 2024

stevenzwu Jan 21, 2024

stevenzwu Jan 21, 2024

stevenzwu Jan 21, 2024

stevenzwu Jan 21, 2024

pvary commented Jan 30, 2024

mas-chen commented Jan 30, 2024

pvary commented Jan 31, 2024

Flink: Implement enumerator metrics for pending splits, pending recor… #9524

Flink: Implement enumerator metrics for pending splits, pending recor… #9524

Conversation

mas-chen commented Jan 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mas-chen Jan 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mas-chen Jan 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mas-chen Jan 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Jan 30, 2024

mas-chen commented Jan 30, 2024

pvary commented Jan 31, 2024

mas-chen commented Jan 19, 2024 •

edited

mas-chen Jan 22, 2024 •

edited

mas-chen Jan 22, 2024 •

edited

mas-chen Jan 22, 2024 •

edited