Skip to content

Conversation

@jfreden
Copy link
Contributor

@jfreden jfreden commented Nov 4, 2025

This PR adds resilience to the metadata_flattened security migration that was reported to have failed on clusters where concurrent role modifications happened while the migration was running. In the normal case this is not expected to happen, but for a very large number of roles or very frequent role updates a version conflict could occur.

The change adds logic to:

  • Handle version conflicts
  • Handle shard failures
  • Handle timeouts
  • Trigger immediate retries in the framework if a failure occurs
  • Bump the number of retries

Resolves: #110532

@jfreden jfreden added the test-full-bwc Trigger full BWC version matrix tests label Nov 4, 2025
waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION);
// First migration is on a new index, so should skip all migrations. If we reset, it should re-trigger and run all migrations
resetMigration();
// Wait for the first migration to finish
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now the first migration so we don't need this line anymore.

masterNode,
SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION
);
CountDownLatch awaitMigrations = awaitMigrationVersionUpdates(masterNode, SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order of migrations changed, this should have been lastKey from the start.

project,
migrationsVersion
);
var persistentTaskCustomMetadata = PersistentTasksCustomMetadata.get(project.metadata());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a migration is running, its persistent task will be present in cluster state, when it's not it will not be present in cluster state. When a persistent task completes (failure or success) it's removed from cluster state. We want to make sure that an index state change is triggered when a persistent task fails to make sure it's retried immediately, that's why we need this state here.


public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 1;
public static final Integer CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION = 2;
public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bumping the version here to make sure this migration runs again with proper error handling, I'm also "removing" the old migration since we don't need it anymore.

if (response.getHits().getTotalHits().value() > 0) {
logger.info("Preparing to migrate [" + response.getHits().getTotalHits().value() + "] roles");
updateRolesByQuery(indexManager, client, filterQuery, listener);
if (response.isTimedOut() == false && response.getFailedShards() == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error handling here to make sure we don't mark as migrated if this initial search fails silently for some reason.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 436f987 to 37737b7 Compare November 4, 2025 15:12

public static final IndexVersion REENABLED_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_042_0_00, Version.LUCENE_10_3_1);
public static final IndexVersion SKIPPERS_ENABLED_BY_DEFAULT = def(9_043_0_00, Version.LUCENE_10_3_1);
public static final IndexVersion SECURITY_MIGRATIONS_METADATA = def(9_044_0_00, Version.LUCENE_10_3_1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New index version is needed to make sure we skip migration for brand new index.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 37737b7 to 682da2d Compare November 4, 2025 15:14
public static class Manager {

private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 10;
private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 1000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty significant bump because we have no idea how many times the migration would need to be retried before it's successful. In the extreme case where we have 2M roles and frequent updates 1000 doesn't feel like a crazy number, but it's also very difficult to verify this.

There is no good reason to not allow this to be very large. The point of this is to make sure that security migrations are not retried forever.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 682da2d to e5f599a Compare November 4, 2025 15:25
@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from e5f599a to 77ac083 Compare November 4, 2025 15:25
@jfreden jfreden added the :Security/Security Security issues without another label label Nov 4, 2025
@jfreden jfreden marked this pull request as ready for review November 4, 2025 15:27
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Nov 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@jfreden
Copy link
Contributor Author

jfreden commented Nov 14, 2025

Good thing the tests caught this. From the mapping of the metadata setting the dynamic=false attribute on the mapping without specifying any properties of the object will result in that field not being indexed for search, but still present in the source. IIRC this was one of the issues we ran into when trying to fix this with other approaches.

That means we can't use must(QueryBuilders.existsQuery("metadata")) because it will never exist since it's not indexed for search.

Unfortunately I don't think there is a good way around this, but it's probably acceptable to just reprocess all docs with empty metadata for every retry. WDYT @richard-dennehy ?

@richard-dennehy
Copy link
Contributor

That's unfortunate, but maybe if we keep an eye on the security QA project as this rolls out, we won't be caught totally unaware of any performance impact?

@jfreden
Copy link
Contributor Author

jfreden commented Nov 14, 2025

Yes, it would be part of upgrading the security cluster, which would happen in qa before prod so if there are any problems it would be caught there. Since that cluster has metadata for the roles, this logic should work for that cluster.

We can't really keep an eye on it though since we don't control that process (it's not a serverless project).

@richard-dennehy
Copy link
Contributor

Would it be more accurate to say that because we have the security QA project with 3 million roles, someone inside Elastic will shout at us if the new migration accidentally unleashes demons, before this gets out to customers?

@jfreden
Copy link
Contributor Author

jfreden commented Nov 14, 2025

Yes, but not sure if we should rely on that. I'll think about that a little. Might be good to check with the control-plane team.

@jfreden jfreden added the auto-backport Automatically create backport pull requests when merged label Nov 24, 2025
@jfreden jfreden merged commit c779717 into elastic:main Nov 24, 2025
35 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 137558

jfreden added a commit to jfreden/elasticsearch that referenced this pull request Nov 24, 2025
…lastic#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
jfreden added a commit to jfreden/elasticsearch that referenced this pull request Nov 24, 2025
…lastic#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java
@jfreden
Copy link
Contributor Author

jfreden commented Nov 24, 2025

💚 All backports created successfully

Status Branch Result
9.2
9.1
8.19

Questions ?

Please refer to the Backport tool documentation

jfreden added a commit to jfreden/elasticsearch that referenced this pull request Nov 24, 2025
…lastic#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/AuthenticationServiceTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/esnative/NativeRealmTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/support/mapper/NativeRoleMappingStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/CompositeRolesStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/NativePrivilegeStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistryTests.java
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
…lastic#137558)

* Improve security migration resilience by handling version conflicts
jfreden added a commit that referenced this pull request Nov 27, 2025
…icts (#137558) (#138476)

* Improve security migration resilience by handling version conflicts (#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java

* fixup! Merge issue

* fixup! Typo

* fixup! BWC test
jfreden added a commit that referenced this pull request Nov 27, 2025
…icts (#137558) (#138475)

* Improve security migration resilience by handling version conflicts (#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java

* fixup! Index version

* fixup! Typo

* fixup! BWC test
jfreden added a commit that referenced this pull request Nov 27, 2025
…licts (#137558) (#138477)

* Improve security migration resilience by handling version conflicts (#137558)

* Improve security migration resilience by handling version conflicts

(cherry picked from commit c779717)

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexVersions.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java
#	x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/AuthenticationServiceTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/esnative/NativeRealmTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/support/mapper/NativeRoleMappingStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/CompositeRolesStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/NativePrivilegeStoreTests.java
#	x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistryTests.java

* fixup! Merge issue

* fixup! Backported interface different

* fixup! Backport interface

* fixup! Backport...

* fixup! BWC test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >enhancement :Security/Security Security issues without another label Team:Security Meta label for security team test-full-bwc Trigger full BWC version matrix tests v8.19.8 v9.1.8 v9.2.2 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve .security index migration resiliency

4 participants