Fix kinesis IT flakiness by AmatyaAvadhanula · Pull Request #12821 · apache/druid

AmatyaAvadhanula · 2022-07-26T13:25:09Z

Kinesis ITs run with a task duration <= supervisor period and this may cause flakiness.
This is fixed by reverting task duration in test supervisors to 120 seconds.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

paul-rogers

Thanks much for fixing this flaky test!

I wonder, is a time-based duration the right strategy? If the test runs on a busy system, won't the task still time out? Is there a way to determine that things are running (perhaps slowly) rather than having hung, etc.?

AmatyaAvadhanula · 2022-07-27T06:22:08Z

Hi @paul-rogers

The supervisor continues to create tasks until the integration test duration elapses (or until success). So multiple tasks may be created in both cases.
I think the issue may have been caused due to the task duration being set to a very low value of 30s earlier which is why there is insufficient time for supervisor management. (The overlord is killed without a single run of the supervisor for the tasks and by the time it is up, the task duration has elapsed)

paul-rogers · 2022-07-27T19:53:22Z

@AmatyaAvadhanula, thanks for the explanation. I think that the key point is still valid: tests should not be assume anything about how long things take (except to perhaps impose an overall timeout). For this test, it seems we should poll to detect completion. Do we have the API we'd need to do the polling?

AmatyaAvadhanula · 2022-07-28T05:18:06Z

@paul-rogers, thank you for the feedback.

until the integration test duration elapses (or until success)

We are polling for a condition (the query count of ingestion to match the number of records in the stream) for success. The test duration here is the timeout you are referring to.

tests should not be assume anything about how long things take

Reverting the ingestion task duration to the older value of 2 minutes is to solve a possible "misconfiguration" in the ingestion spec for the tests which may not be ideal for druid. The test itself doesn't make any assumptions about the task duration

Fix kinesis IT flakiness

3e98cad

paul-rogers reviewed Jul 27, 2022

View reviewed changes

abhishekagarwal87 approved these changes Jul 28, 2022

View reviewed changes

abhishekagarwal87 added Area - Testing Flaky test labels Jul 28, 2022

abhishekagarwal87 merged commit fbd1a07 into apache:master Aug 3, 2022

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix kinesis IT flakiness#12821

Fix kinesis IT flakiness#12821
abhishekagarwal87 merged 1 commit intoapache:masterfrom
AmatyaAvadhanula:feature-kinesis_flaky_tests

AmatyaAvadhanula commented Jul 26, 2022

Uh oh!

paul-rogers left a comment

Uh oh!

AmatyaAvadhanula commented Jul 27, 2022

Uh oh!

paul-rogers commented Jul 27, 2022

Uh oh!

AmatyaAvadhanula commented Jul 28, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AmatyaAvadhanula commented Jul 26, 2022

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula commented Jul 27, 2022

Uh oh!

paul-rogers commented Jul 27, 2022

Uh oh!

AmatyaAvadhanula commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmatyaAvadhanula commented Jul 28, 2022 •

edited

Loading