Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not emit negative lag because of stale offsets #14292

Merged
merged 8 commits into from
Jul 5, 2023

Conversation

AmatyaAvadhanula
Copy link
Contributor

Do not emit negative lag because of stale offsets.

Description

The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag .

This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.

Release note

This PR prevents emission of negative streaming ingestion lag when the fetched latest offsets are stale


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis()
);
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
log.warn("Skipping negative lag emission as fetched offsets are stale");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets rephrase it in a way that is more informative for someone reading this.
Lag is negative and will not be emitted because topic offsets have become stale. This will not impact data processing. Offsets become stale because....

&& sequenceLastUpdated.getMillis()
< System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis();
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For troubleshooting, I think it'll also be good to log the topic:partition info where the offsets may potentially be stale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".

@@ -4220,6 +4220,18 @@ protected void emitLag()
return;
}

// Try emitting lag even with stale metrics provided that none of the partitions has negative lag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Try emitting lag even with stale metrics provided that none of the partitions has negative lag
// Try emitting lag even with stale metrics provided that none of the partitions have negative lag

log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+ "This will not impact data processing. "
+ "Offsets may become stale because of connectivity issues.");
return;
Copy link
Contributor

@abhishekrb19 abhishekrb19 May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip emitting lag metrics only for the stale partitions? I think in general, it'll be helpful to emit metrics for partitions that have non-negative lag. For example, if a topic's partitions are spread across multiple brokers and only some have connectivity issues. Or for a topic where some partitions receive little to no data, those may selectively be considered "stale".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a separate metric which we can emit for partition-level lag, without actually reporting/affecting the overall lag at all. But I guess having them in the report should be enough too.

Copy link
Contributor

@abhishekrb19 abhishekrb19 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a per-partition lag metric would complement the existing metrics. My main concern with not reporting any lag for a topic in this scenario is we'd have periods of missing lag data for as long as there's at least one stale partition in a topic. The missing metrics data can hide problems silently and affect existing downstream consumers of the data on how they alert, present metrics for visualization, etc. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a missing metric data for a topic is easier to detect and be notified about than metric data missing some partitions. @AmatyaAvadhanula - do we already emit a lag metric for each partition in the topic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do emit metrics for every partition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These partitions usually go stale because supervisor can't connect to Kafka. We can revisit later if not having any metric becomes a pain point. ideally, users should also be alerting on missing metric.

if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+ "This will not impact data processing. "
+ "Offsets may become stale because of connectivity issues.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Offsets may become stale because of connectivity issues." - This isn't very helpful.

if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+ "This will not impact data processing. "
+ "Offsets may become stale because of connectivity issues.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ "Offsets may become stale because of connectivity issues.");
+ "Offsets usually become stale when tasks cannot connect to Kafka cluster.");

&& sequenceLastUpdated.getMillis()
< System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis();
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".

log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+ "This will not impact data processing. "
+ "Offsets may become stale because of connectivity issues.");
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.

@abhishekagarwal87 abhishekagarwal87 merged commit 609833c into apache:master Jul 5, 2023
73 checks passed
@AmatyaAvadhanula AmatyaAvadhanula added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag .

This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants