-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve security migration resilience by handling version conflicts #137558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve security migration resilience by handling version conflicts #137558
Conversation
| waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION); | ||
| // First migration is on a new index, so should skip all migrations. If we reset, it should re-trigger and run all migrations | ||
| resetMigration(); | ||
| // Wait for the first migration to finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now the first migration so we don't need this line anymore.
| masterNode, | ||
| SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION | ||
| ); | ||
| CountDownLatch awaitMigrations = awaitMigrationVersionUpdates(masterNode, SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order of migrations changed, this should have been lastKey from the start.
| project, | ||
| migrationsVersion | ||
| ); | ||
| var persistentTaskCustomMetadata = PersistentTasksCustomMetadata.get(project.metadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a migration is running, its persistent task will be present in cluster state, when it's not it will not be present in cluster state. When a persistent task completes (failure or success) it's removed from cluster state. We want to make sure that an index state change is triggered when a persistent task fails to make sure it's retried immediately, that's why we need this state here.
|
|
||
| public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 1; | ||
| public static final Integer CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION = 2; | ||
| public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm bumping the version here to make sure this migration runs again with proper error handling, I'm also "removing" the old migration since we don't need it anymore.
| if (response.getHits().getTotalHits().value() > 0) { | ||
| logger.info("Preparing to migrate [" + response.getHits().getTotalHits().value() + "] roles"); | ||
| updateRolesByQuery(indexManager, client, filterQuery, listener); | ||
| if (response.isTimedOut() == false && response.getFailedShards() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error handling here to make sure we don't mark as migrated if this initial search fails silently for some reason.
436f987 to
37737b7
Compare
|
|
||
| public static final IndexVersion REENABLED_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_042_0_00, Version.LUCENE_10_3_1); | ||
| public static final IndexVersion SKIPPERS_ENABLED_BY_DEFAULT = def(9_043_0_00, Version.LUCENE_10_3_1); | ||
| public static final IndexVersion SECURITY_MIGRATIONS_METADATA = def(9_044_0_00, Version.LUCENE_10_3_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New index version is needed to make sure we skip migration for brand new index.
37737b7 to
682da2d
Compare
| public static class Manager { | ||
|
|
||
| private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 10; | ||
| private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty significant bump because we have no idea how many times the migration would need to be retried before it's successful. In the extreme case where we have 2M roles and frequent updates 1000 doesn't feel like a crazy number, but it's also very difficult to verify this.
There is no good reason to not allow this to be very large. The point of this is to make sure that security migrations are not retried forever.
682da2d to
e5f599a
Compare
e5f599a to
77ac083
Compare
|
Pinging @elastic/es-security (Team:Security) |
|
Good thing the tests caught this. From the mapping of the metadata setting the That means we can't use Unfortunately I don't think there is a good way around this, but it's probably acceptable to just reprocess all docs with empty metadata for every retry. WDYT @richard-dennehy ? |
|
That's unfortunate, but maybe if we keep an eye on the security QA project as this rolls out, we won't be caught totally unaware of any performance impact? |
|
Yes, it would be part of upgrading the security cluster, which would happen in qa before prod so if there are any problems it would be caught there. Since that cluster has metadata for the roles, this logic should work for that cluster. We can't really keep an eye on it though since we don't control that process (it's not a serverless project). |
|
Would it be more accurate to say that because we have the security QA project with 3 million roles, someone inside Elastic will shout at us if the new migration accidentally unleashes demons, before this gets out to customers? |
|
Yes, but not sure if we should rely on that. I'll think about that a little. Might be good to check with the control-plane team. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…lastic#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java
…lastic#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…lastic#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/AuthenticationServiceTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/esnative/NativeRealmTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/support/mapper/NativeRoleMappingStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/CompositeRolesStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/NativePrivilegeStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistryTests.java
…lastic#137558) * Improve security migration resilience by handling version conflicts
…icts (#137558) (#138476) * Improve security migration resilience by handling version conflicts (#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java * fixup! Merge issue * fixup! Typo * fixup! BWC test
…icts (#137558) (#138475) * Improve security migration resilience by handling version conflicts (#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java * fixup! Index version * fixup! Typo * fixup! BWC test
…licts (#137558) (#138477) * Improve security migration resilience by handling version conflicts (#137558) * Improve security migration resilience by handling version conflicts (cherry picked from commit c779717) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java # x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/AuthenticationServiceTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/esnative/NativeRealmTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/support/mapper/NativeRoleMappingStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/CompositeRolesStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authz/store/NativePrivilegeStoreTests.java # x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistryTests.java * fixup! Merge issue * fixup! Backported interface different * fixup! Backport interface * fixup! Backport... * fixup! BWC test
This PR adds resilience to the
metadata_flattenedsecurity migration that was reported to have failed on clusters where concurrent role modifications happened while the migration was running. In the normal case this is not expected to happen, but for a very large number of roles or very frequent role updates a version conflict could occur.The change adds logic to:
Resolves: #110532