Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Nov 8, 2024

As title says, the archived setting needs to be moved out to not interfere with other test methods in the same test class.

Resolves: #111798
Resolves: #111777
Resolves: #111774
Resolves: #111799

As title says, the archived setting needs to be removed to not interfere
with other test methods in the same test class.

Resolves: elastic#111798
Resolves: elastic#111777
Resolves: elastic#111774
Resolves: elastic#111799
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.0.0 v8.17.0 labels Nov 8, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 8, 2024
@ywangd
Copy link
Member Author

ywangd commented Nov 8, 2024

This PR should definitely fixes the failures with the upgraded cluster, i.e. #111799 and #111798
The failures with the old cluster (#111774, #111777) are caused by version bump fallouts. I believe these are no longer issues. So this PR unmutes all of them.

@ywangd
Copy link
Member Author

ywangd commented Nov 8, 2024

Oh joy. Looks like the fix did not work. Let me take another look.

@ywangd
Copy link
Member Author

ywangd commented Nov 8, 2024

I pushed more changes for the fix 6eaca3c

  • It is not sufficient to just remove the archive settings at the end of test because other tests may run first in the upgraded cluster and fail. So I made the test for archive setting to always run first.
  • The version number fallout for 8.15 is still an issue. I think the automated commit (b8688b3) missed 8.15.4 in the csv files. I manually added them. Please let me know if this needs to be done differently.

Comment on lines 28 to 34
public int compare(TestMethodAndParams o1, TestMethodAndParams o2) {
return Integer.compare(
o1.getTestMethod().getAnnotation(Order.class).value(),
o2.getTestMethod().getAnnotation(Order.class).value()
);
return Integer.compare(getOrderValue(o1.getTestMethod()), getOrderValue(o2.getTestMethod()));
}

private int getOrderValue(Method method) {
return method.isAnnotationPresent(Order.class) ? method.getAnnotation(Order.class).value() : Integer.MAX_VALUE;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe controversial. It essentially allows annotating Order for a subset of tests instead of all the tests. Missing Order annotation defaults to Integer.MAX_VALUE, i.e. they run last. I am trying to avoid annotating every method in FullClusterRestartIT. It is really needed for just one method. The rest can run in whatever order. I also considered splitting the single method into a different test class. But CoreFullClusterRestartIT extends FullClusterRestartIT which means we need yet another test class to extends the new test class. Seems not worth it. So I went with this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. I just noticed the full-cluster-restart package has its own FullClusterRestartTestOrdering annotation. So I have moved the changes into it which should be less of an issue. The tests also need to respect the ordering determined by old and new cluster as well.

@ywangd
Copy link
Member Author

ywangd commented Nov 8, 2024

@DaveCTurner Now I am getting a new failure which I suspect is a legitimate issue.

The important bit of the exception is

[2024-11-08T04:54:27,351][WARN ][o.e.s.RestoreService     ] [test-cluster-0] [repo:old_snap/ZwUuFTIlQ1-qeTsp8Cehcg] failed to restore snapshot java.lang.IllegalArgumentException: illegal value can't update [cluster.routing.allocation.balance.threshold] from [1.0] to [0.999]
	at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:1304)
	at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.settings.AbstractScopedSettings.validateUpdate(AbstractScopedSettings.java:139)
	at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.snapshots.RestoreService$RestoreSnapshotStateTask.applyGlobalStateRestore(RestoreService.java:1546)
	at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.snapshots.RestoreService$RestoreSnapshotStateTask.execute(RestoreService.java:1477)
	at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.cluster.service.MasterService$UnbatchedExecutor.execute(MasterService.java:573)

IIUC, this means users cannot restore a snapshot taken on an old cluster if it has a less than 1.0 value for persistent setting cluster.routing.allocation.balance.threshold. While we allow index settings to be ignored during restore, the same thing does not seem to be available for cluster settings. The workaround is to not restore cluster state which may not always be acceptable. Or tempering with on disk metadata snapshot which is not optimal either. We may want to reconsider whether the strict validation is viable in 9.0?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is unfortunate. I think my preferred solution would be to run this specific test case in a separate Gradle project so it does not interfere with these other tests.

In the meantime, I'd suggest muting the problematic test case and unmuting the ones that were muted in error.

@ywangd
Copy link
Member Author

ywangd commented Nov 10, 2024

Once testBalancedShardsAllocatorThreshold is separated out in its own test class. All tests should run successfully. So I'll leave all of them running and raise a separate issue for the strict validation.

@ywangd ywangd changed the title [Test] Remove archived setting at the end of test [Test] Move archived setting test into its own test class Nov 11, 2024
@ywangd
Copy link
Member Author

ywangd commented Nov 11, 2024

Raised #116558 for the validation issue

@ywangd ywangd requested a review from DaveCTurner November 11, 2024 00:28
* version is started with the same data directories and then this is rerun
* with {@code tests.is_old_cluster} set to {@code false}.
*/
public class FullClusterRestartArchivedSettingsIT extends ParameterizedFullClusterRestartTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add a subclass for this new test class in x-pack because it does not seem necessary. The archived settings do not depend on whether security is enabled. But please let me know if you think have one to pair the coverage.

8.15.1,8512000
8.15.2,8512000
8.15.3,8512000
8.15.4,8512000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Otherwise the test would fail with a message similar to the following

java.lang.AssertionError: 
Expected: (<[8.14.4]> or <[8512000]> or <[8.14.0-8.14.4]>)
     but: was <[8.14.0-8.14.3]>

See #111777 for an example.

@mark-vieira Could you please take a look at this change and make sure it is OK to update it manually like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why this change needs to be made, I just don't understand why it's happening in this PR. Shouldn't this be added by the release process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be added by the release process?

I have the same question. That's reason I requested @mark-vieira 's review. Maybe I should have tagged @elastic/es-delivery
Also ping @thecoop since #111777 is assigned to you. Thank you!

Copy link
Member

@thecoop thecoop Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build automation I think is not quite working properly with the multiple release branches we have going on at the moment - there's also #114972. I need to look at this properly, but these should be added in a separate PR so we can confirm that it is actually the right thing to do at the right time (and backport it appropriately too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thecoop !
I also noticed the missing versions are now added in main 85b2bab
So these changes are indeed obsolete. I have removed them with e14dcb0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr the tests now complain about 8.16.1 which is not yet in the versions file. 😿

Expected: (<[8.16.1]> or <[8518000]> or <[8.16.0]>) |  
    but: was <[8.16.0-8.16.1]>

Cannot really move this PR forward without the versions to be fixed first.

@ywangd ywangd requested a review from DaveCTurner November 12, 2024 01:27
8.15.1,8512000
8.15.2,8512000
8.15.3,8512000
8.15.4,8512000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why this change needs to be made, I just don't understand why it's happening in this PR. Shouldn't this be added by the release process?

@ywangd ywangd requested a review from DaveCTurner November 12, 2024 22:55
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

I expect the bwc test failures relate to the recent release of 8.16.0 - they should be addressed in main (soon, if not already, I haven't checked) and then this can be merged in its current state.

@thecoop
Copy link
Member

thecoop commented Nov 14, 2024

The version check failure has been resolved by #116727

@ywangd ywangd added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Nov 15, 2024
@elasticsearchmachine elasticsearchmachine merged commit a7878a9 into elastic:main Nov 15, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 116460

@ywangd ywangd deleted the es-111798-fix branch November 15, 2024 14:19
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Nov 17, 2024
…6460)

As title says, the archived setting needs to be moved out to not
interfere with other test methods in the same test class.

Resolves: elastic#111798 Resolves: elastic#111777 Resolves: elastic#111774 Resolves: elastic#111799
(cherry picked from commit a7878a9)

# Conflicts:
#	muted-tests.yml
#	qa/full-cluster-restart/src/javaRestTest/java/org/elasticsearch/upgrades/FullClusterRestartIT.java
@ywangd
Copy link
Member Author

ywangd commented Nov 17, 2024

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

@ywangd ywangd removed the v8.17.0 label Nov 17, 2024
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Nov 18, 2024
…6460)

As title says, the archived setting needs to be moved out to not
interfere with other test methods in the same test class.

Resolves: elastic#111798 Resolves: elastic#111777 Resolves: elastic#111774 Resolves: elastic#111799
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
…6460)

As title says, the archived setting needs to be moved out to not
interfere with other test methods in the same test class.

Resolves: elastic#111798 Resolves: elastic#111777 Resolves: elastic#111774 Resolves: elastic#111799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.0.0

Projects

None yet

6 participants