Skip to content

Fixes Kafka Supervisor Lag Report#13380

Merged
kfaraz merged 3 commits into
apache:masterfrom
tejaswini-imply:fix-kakfa_supervisor_lags_report
Nov 17, 2022
Merged

Fixes Kafka Supervisor Lag Report#13380
kfaraz merged 3 commits into
apache:masterfrom
tejaswini-imply:fix-kakfa_supervisor_lags_report

Conversation

@tejaswini-imply
Copy link
Copy Markdown
Member

@tejaswini-imply tejaswini-imply commented Nov 17, 2022

Fixes inclusion of all stream partitions in all tasks.

Description

The PR (Adds Idle feature to SeekableStreamSupervisor for inactive stream) - #13144 updates the resulting lag calculation map in KafkaSupervisor to include all the latest partitions from the stream to set the idle state accordingly rather than the previous way of lag calculation only for the partitions actively being read from the stream. This led to an explosion of metrics in lag reports in cases where 1000s of tasks per supervisor are present.

This PR creates a new method to generate lags for only those partitions a single task is actively reading from while updating the Supervisor reports.

Key changed/added classes in this PR
  • KafkaSupervisor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

return currentOffsets
.entrySet()
.stream()
.filter(e -> latestSequenceFromStream.get(e.getKey()) != null)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old method reports the lag for every partition present in latestSequenceFromStream.
But this new one also seems to be doing the same thing because you are filtering the entries using the latestSequenceFromStream.

I am perhaps missing something but not sure how the two methods are different.
Please also add a test to verify the difference between the two results.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous method includes lags for all the latest partitions of the stream whereas this method only includes the required partitions (as passed by currentOffsets) lags.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also will add a unit test for the reports.

@tejaswini-imply
Copy link
Copy Markdown
Member Author

@kfaraz Updated the PR with a unit test. PTAL.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests, @tejaswini-imply !

@kfaraz kfaraz merged commit bf10ff7 into apache:master Nov 17, 2022
@gianm gianm added this to the 25.0 milestone Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants