-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failures due to indices not being cleaned up as expected #46091
Comments
Pinging @elastic/es-core-infra |
Some doc test failures that are probably the same / related: |
Also #45600 |
Have a look at the server side logs. Do they show something like:
If so then it’s not just the test client cleanup code that’s to blame, but the same server code that customers will be running in production, and this is looking like a big worry for 7.4. See #45600 (comment) |
@DaveCTurner I believe you're working on this now as well. So ping :) |
Yep I'm seeing if I can reproduce this, specifically the test failure #45956. I did see that test fail on my machine reporting a similar |
The
|
This failure appears to be related: https://gradle-enterprise.elastic.co/s/6q3xz77vvfo7m/console-log#L3863 |
Duplicate of #45605 ? |
Another build failure, seems to be related: https://gradle-enterprise.elastic.co/s/wbwkzcamo7a4e |
@original-brownbear looks like this is still causing builds to fail. Any thoughts ? |
@atorok the team is looking into this, this morning. |
In order to track down #46091: * Enables debug logging in REST tests for `master` and `coordination` packages since we suspect that issues are caused by failed and then retried publications
) In order to track down elastic#46091: * Enables debug logging in REST tests for `master` and `coordination` packages since we suspect that issues are caused by failed and then retried publications
Further investigation into elastic#46091, expanding on elastic#46363, to add even more detailed logging around the retry behaviour during index creation.
We have an explanation for why index creation might result in a We can reproduce this by waiting after the index creation completed for the client to time out: diff --git a/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java b/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
index 9bfbec9..1dc3163 100644
--- a/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
@@ -215,6 +215,11 @@ public class MetaDataCreateIndexService {
} else {
logger.trace("[{}] index created and shards acknowledged", request.index());
}
+ try {
+ Thread.sleep(30000);
+ } catch (InterruptedException e) {
+ throw new AssertionError(e);
+ }
listener.onResponse(new CreateIndexClusterStateUpdateResponse(response.isAcknowledged(), shardsAcknowledged));
}, listener::onFailure);
} else { We can extend the client's timeout to work around this, and I think it would be good to stop it from retrying at all to save us from going through the same deeply confusing investigation the next time round. However the fundamental issue is why it is taking over 30 seconds to create an index in CI. The one that we investigated in detail took ~6sec to write the metadata to disk when the index was first created (confirmed by |
We are seeing requests take more than the default 30s which leads to requests being retried and returning unexpected failures like e.g. "index already exists" because the initial requests that timed out, worked out functionally anyway. => double the timeout to reduce the likelihood of the failures described in elastic#46091 => As suggested in the issue, we should in a follow-up turn off retrying all-together probably
Cache pushing happens at the end of execution of each task, so it happens as the build progresses. Because it's part of task execution it's effectively limited by existing parallelism. With the local cache disabled the additional IO here should be minimal as we've already built the outputs and Gradle streams the result to the remote server. The only IO overhead is in the form of read, and this is probably going to be bottlenecked by network IO to the remote cache anyhow. From what I can tell it's mostly write IOPS hurting tests. In otherwords, I suspect the addition of the remote build cache amounts to a negligible IO overhead compared to the build/tests themselves. I've done experiments in the past that confirm this suspicion as well. |
Another relevant failure: https://gradle-enterprise.elastic.co/s/drahj2lhx6rvu/console-log?task=:docs:integTestRunner |
Not sure those are the same failure @matriv it's more likely something with that particular test. I suggest we open a different issue for it and mute |
I'm going ahead and close this issue to avoid confusion. |
The trace logging was added for #46091. Now that it's closed we can remove it.
Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also removed old debug log config that was enabled to investigate a build failure (elastic#46091), but has been closed. However the debug logging added many lines log lines to the log files. Relates to elastic#69973
…0361) Backport of the testing related changes from #70314: Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also ensured that the logstash-index-template and security-index-template aren't deleted between tests, these templates are builtin templates that ES will install if missing. So if tests remove these templates between tests then ES will add these template back almost immediately. These causes many log lines and a lot of cluster state updates, which slow tests down. Also removed old debug log config that was enabled to investigate a build failure (#46091), but has been closed. However the debug logging added many lines log lines to the log files. Note this change wasn't part of #70314. Relates to #69973
…astic#70361) Backport of the testing related changes from elastic#70314: Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also ensured that the logstash-index-template and security-index-template aren't deleted between tests, these templates are builtin templates that ES will install if missing. So if tests remove these templates between tests then ES will add these template back almost immediately. These causes many log lines and a lot of cluster state updates, which slow tests down. Also removed old debug log config that was enabled to investigate a build failure (elastic#46091), but has been closed. However the debug logging added many lines log lines to the log files. Note this change wasn't part of elastic#70314. Relates to elastic#69973
…astic#70361) Backport of the testing related changes from elastic#70314: Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also ensured that the logstash-index-template and security-index-template aren't deleted between tests, these templates are builtin templates that ES will install if missing. So if tests remove these templates between tests then ES will add these template back almost immediately. These causes many log lines and a lot of cluster state updates, which slow tests down. Also removed old debug log config that was enabled to investigate a build failure (elastic#46091), but has been closed. However the debug logging added many lines log lines to the log files. Note this change wasn't part of elastic#70314. Relates to elastic#69973
…70364) Backporting #70361 to 7.11 branch. Backport of the testing related changes from #70314: Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also ensured that the logstash-index-template and security-index-template aren't deleted between tests, these templates are builtin templates that ES will install if missing. So if tests remove these templates between tests then ES will add these template back almost immediately. These causes many log lines and a lot of cluster state updates, which slow tests down. Also removed old debug log config that was enabled to investigate a build failure (#46091), but has been closed. However the debug logging added many lines log lines to the log files. Note this change wasn't part of #70314. Relates to #69973
…0363) Backport of #70361 to 7.12 branch. Backport of the testing related changes from #70314: Older versions don't support component / composable index templates and/or data streams. Yet the test base class tries to remove objects after each test, which adds a significant number of lines to the log files (which slows the tests down). The ESRestTestCase will now check whether all nodes have a specific version and then decide whether data streams and component / composable index templates will be deleted. Also ensured that the logstash-index-template and security-index-template aren't deleted between tests, these templates are builtin templates that ES will install if missing. So if tests remove these templates between tests then ES will add these template back almost immediately. These causes many log lines and a lot of cluster state updates, which slow tests down. Also removed old debug log config that was enabled to investigate a build failure (#46091), but has been closed. However the debug logging added many lines log lines to the log files. Note this change wasn't part of #70314. Relates to #69973
There appears to be some kind of problem with the cleanup code that ensures all indices are cleaned up between tests. It's difficult to be sure, but there seems to be a trend of a wide variety of tests failing because of a
ResourceAlreadyExistsException
when creating an index with no visible cause in the test itself.I noticed this in this master intake build, which is
IndexLifecycleIT.testStartStopILM
, which creates an index as one of the first operations it does - the cluster should be blank at this point.Looking at build stats for failures with messages containing
resource_already_exists_exception
, we can see it was very quiet prior to July 23, and has built more and more, suggesting this may be related to recent build changes:[Note: This graph excludes a 10 minute period on Aug. 15 which had 608 failures, to make the scale clearer]
There is no clear relation between the tests which fail due to indices already existing that shouldn't and a specific functional area. It appears to happen most often to client and YAML tests, although not exclusively. In a few spot checks I've seen:
IndexLifecycleIT.testStartStopILM
CCSDuelIT.testPagination
test {yaml=search/240_date_nanos/date_nanos requires dates after 1970 and before 2262}
IndicesClientIT.testCloseExistingIndex
DatafeedJobsRestIT.testLookbackWithGeo
All of these appear to have failed while creating indices when the cluster should be a blank slate.
I believe this is also the cause of #45605 and #45805
The text was updated successfully, but these errors were encountered: