Skip to content

Fix kinesis IT flakiness#12821

Merged
abhishekagarwal87 merged 1 commit intoapache:masterfrom
AmatyaAvadhanula:feature-kinesis_flaky_tests
Aug 3, 2022
Merged

Fix kinesis IT flakiness#12821
abhishekagarwal87 merged 1 commit intoapache:masterfrom
AmatyaAvadhanula:feature-kinesis_flaky_tests

Conversation

@AmatyaAvadhanula
Copy link
Contributor

Kinesis ITs run with a task duration <= supervisor period and this may cause flakiness.
This is fixed by reverting task duration in test supervisors to 120 seconds.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for fixing this flaky test!

I wonder, is a time-based duration the right strategy? If the test runs on a busy system, won't the task still time out? Is there a way to determine that things are running (perhaps slowly) rather than having hung, etc.?

@AmatyaAvadhanula
Copy link
Contributor Author

Hi @paul-rogers

The supervisor continues to create tasks until the integration test duration elapses (or until success). So multiple tasks may be created in both cases.
I think the issue may have been caused due to the task duration being set to a very low value of 30s earlier which is why there is insufficient time for supervisor management. (The overlord is killed without a single run of the supervisor for the tasks and by the time it is up, the task duration has elapsed)

@paul-rogers
Copy link
Contributor

@AmatyaAvadhanula, thanks for the explanation. I think that the key point is still valid: tests should not be assume anything about how long things take (except to perhaps impose an overall timeout). For this test, it seems we should poll to detect completion. Do we have the API we'd need to do the polling?

@AmatyaAvadhanula
Copy link
Contributor Author

AmatyaAvadhanula commented Jul 28, 2022

@paul-rogers, thank you for the feedback.

until the integration test duration elapses (or until success)

We are polling for a condition (the query count of ingestion to match the number of records in the stream) for success. The test duration here is the timeout you are referring to.

tests should not be assume anything about how long things take

Reverting the ingestion task duration to the older value of 2 minutes is to solve a possible "misconfiguration" in the ingestion spec for the tests which may not be ideal for druid. The test itself doesn't make any assumptions about the task duration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants