Skip to content

Fix flaky IT: ITPerfectRollupParallelBatchIndexTest#12737

Merged
abhishekagarwal87 merged 3 commits intoapache:masterfrom
kfaraz:fix_flaky_index_test
Jul 9, 2022
Merged

Fix flaky IT: ITPerfectRollupParallelBatchIndexTest#12737
abhishekagarwal87 merged 3 commits intoapache:masterfrom
kfaraz:fix_flaky_index_test

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jul 4, 2022

Partially fixes #12692

Description

Due to the race conditions described in this comment,
the ITPerfectRollupParallelBatchIndexTest and possibly other ITs can exhibit flaky behaviour.
This typically happens when the middle manager/indexer cleans up the partitions of a
supervisor task while a different peon is trying to list those partitions (to account for remaining
space in a storage location).

This PR tries to eliminate this behaviour by increasing the partition timeout, so that partitions are
effectively not cleaned up during the course of the test, but only after.

This is not a permanent fix and a better fix would be improving the way the cleanup is handled
between the peons and the middle manager.

Changes

  • Add druid_worker_intermediaryPartitionTimeout = PT30M in integration-test env configs

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 4, 2022

First build passed without incident.
Re-triggered a second build.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 5, 2022

First build failed on "indexing module test (jdk11)" but the failure does not seem related.
Re-triggering.

@kfaraz kfaraz closed this Jul 5, 2022
@kfaraz kfaraz reopened this Jul 5, 2022
@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 5, 2022

Some tests failed in second build, due to an unrelated download issue.

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [334 kB]
Err:3 http://deb.debian.org/debian buster InRelease
  Could not connect to deb.debian.org:80 (199.232.98.132), connection timed out
Err:4 http://deb.debian.org/debian buster-updates InRelease
  Unable to connect to deb.debian.org:http:
Fetched 399 kB in 30s (13.2 kB/s)
Reading package lists...
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Could not connect to deb.debian.org:80 (199.232.98.132), connection timed out
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease  Unable to connect to deb.debian.org:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package wget
The command '/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh' returned a non-zero code: 100
[ERROR] Command execution failed.

druid_startup_logging_logProperties=true
druid_server_https_crlPath=/tls/revocations.crl
druid_worker_capacity=10
druid_worker_intermediaryPartitionTimeout=PT60M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you notice a failure in indexer as well? I don't think it's a problem there since there is only one LocalIntermediaryDataManager instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about indexer. Let me try without it.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 6, 2022

Build failed on "indexing modules test (sql compatibility) (jdk11)"

@abhishekagarwal87
Copy link
Contributor

oops. Forgot to approve before merging. It was a +1 from me. The failures are unrelated.

@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
@kfaraz kfaraz deleted the fix_flaky_index_test branch September 30, 2022 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky IT: perfect rollup parallel batch index integration test

2 participants