Skip to content

PHOENIX-7877 More granular metrics for ReplicationLogWriter#2492

Merged
tkhurana merged 1 commit into
apache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7877
Jun 4, 2026
Merged

PHOENIX-7877 More granular metrics for ReplicationLogWriter#2492
tkhurana merged 1 commit into
apache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7877

Conversation

@tkhurana
Copy link
Copy Markdown
Contributor

@tkhurana tkhurana commented Jun 2, 2026

Summary

Adds finer-grained histograms in MetricsReplicationLogGroupSource to give better visibility into where time is spent inside ReplicationLogGroup:

  • fsSyncTime — wall time of the underlying filesystem sync (fsync) inside processPendingSyncs.
  • pendingSyncWaitTime — per-pending-sync wait between consumer pickup of a SYNC event and the start of fsync. Helps quantify within-batch drain (events the consumer processes between picking up the SYNC marker and reaching endOfBatch).
  • pendingSyncCount — number of sync futures coalesced into a single fsync. Indicates how well the disruptor batches sync calls.
  • batchSize — number of events drained per disruptor batch. Counter is reset in a finally block on endOfBatch so the value never leaks across batches when an exception path bypasses processPendingSyncs.
  • rotationTime — was declared but never populated; now instrumented in ReplicationLog.LogRotationTask.run and recorded on both success and failure paths.

Test plan

  • mvn test -pl phoenix-core -Dtest=ReplicationLogGroupTest passes
  • mvn verify -pl phoenix-core -Dit.test=ReplicationLogGroupIT passes
  • mvn spotless:check passes
  • After deployment, verify new histograms appear under RegionServer,sub=ReplicationLogGroup,haGroup=... JMX context with non-zero values during sustained workload

Adds histograms to decompose ReplicationLogGroup sync latency: pendingSyncWaitTime
measures wait between consumer pickup of a SYNC event and start of fsync;
fsSyncTime (renamed from modeSyncTime) measures the underlying filesystem sync.
Also instruments rotationTime in ReplicationLog so existing metric is actually
emitted. Introduces a builder for ReplicationLogMetricValues to avoid positional
argument hazards as the metric set grows.
@tkhurana tkhurana requested a review from apurtell June 2, 2026 20:03
}
metrics.updatePendingSyncCount(pendingSyncCount);
// Record per-event wait between SYNC pickup and fsync start.
long syncStartNs = System.nanoTime();
Copy link
Copy Markdown
Contributor

@apurtell apurtell Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move store to syncStartNs below the histogram updates to only time the actual sync. Just a nit.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds finer-grained metrics to ReplicationLogGroup/ReplicationLogWriter to improve visibility into disruptor batching behavior, pending SYNC coalescing, and underlying fsync/rotation timing, plus an IT sanity check to ensure metrics are emitted.

Changes:

  • Instrument ReplicationLogGroup consumer to record fsSyncTime, pendingSyncWaitTime, pendingSyncCount, and per-batch batchSize.
  • Record rotationTime from ReplicationLog.LogRotationTask.run() on both success and failure paths.
  • Extend metric snapshot plumbing (MetricsReplicationLogGroupSource*, ReplicationLogMetricValues) and add an IT assertion that metrics were emitted.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
phoenix-core/src/it/java/org/apache/phoenix/replication/ReplicationLogGroupIT.java Adds a metrics-emission sanity check to guard against declared-but-never-emitted metrics.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/ReplicationLogGroup.java Adds disruptor-consumer instrumentation for per-batch sizing and pending-sync timing/coalescing.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/ReplicationLog.java Instruments rotation task wall time for rotationTime.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/metrics/ReplicationLogMetricValues.java Extends metric snapshot structure to include new histograms via a builder.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/metrics/MetricsReplicationLogGroupSourceImpl.java Registers new histograms and includes them in snapshot values.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/metrics/MetricsReplicationLogGroupSource.java Defines new metric keys/descriptions (and adjusts existing ones).
Comments suppressed due to low confidence (1)

phoenix-core-server/src/main/java/org/apache/phoenix/replication/ReplicationLogGroup.java:1181

  • The early return on fatalException != null happens before the try { ... } finally { ... } block, so the endOfBatch cleanup in finally never runs in that path. With the newly added batchEventCount, this can leak the counter across batches (and potentially grow unbounded) once a fatal exception is set.
        // so producer threads unblock without waiting for the sync timeout.
        if (event.type == EVENT_TYPE_SYNC) {
          event.syncFuture.completeExceptionally(fatalException);
        }
        return;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +37 to +41
// All time histograms in this source are nanoseconds.
String APPEND_TIME = "appendTime";
String APPEND_TIME_DESC = "Histogram of time taken for append operations in nanoseconds";

String SYNC_TIME = "syncTimeMs";
String SYNC_TIME = "syncTime";
Comment on lines +194 to +195
assertTrue("pendingSyncWaitTime should be > 0, got " + values.getPendingSyncWaitTime(),
values.getPendingSyncWaitTime() > 0);
@tkhurana tkhurana merged commit 97d34ae into apache:PHOENIX-7562-feature-new Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants