-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Use multiple saved object type for agent and package policies with opt-in migration #189387
[Fleet] Use multiple saved object type for agent and package policies with opt-in migration #189387
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems to be heading in the right direction. Nice work @nchaulet
I'm not sure if this was intended to address all aspects of the discussion. But here are a few initial thoughts categorized as what I think are "requirements" vs "robustness":
Requirements
- Concurrency: from the current implementation I'm not sure I understand how Kibana will know this migration is in progress and should lock all actions against policies
- Concurrency: how will other Kibanas know this migration is in progress and lock their (HTTP) APIs?
- Idempotent: if I hit the migrate endpoint multiple times we should always end in a well-defined end state. This is a one-off endpoint so calling it should
400
"while in progress" and "when done". Of course, if something errors we should be able to retry.
IMO would be worth de-risking these aspects in your PoC too. There is some prior art for this kind of thing in the upgrade assistant (x-pack/plugins/upgrade_assistant/server/lib/reindexing
)
Robustness
- Scale/speed: bulk loading objects can create a lot of memory pressure, we need to balance speed with robustness and ensure we bulk/bulkUpdate in batches for the largest sets of objects in a performant but memory safe way, better to be more on the safe than performant side in this case
- Scale/concurrency: have we considered making this a task with concurrency 1? For serverless/cloud offerings this will offload work to BT nodes so UI nodes can do other UI node things.
The other aspect is planning for how we want to surface this state in the UI but I guess that's coming later.
Here you mentioned that complexity is added to the querying (as expected) but I'm not sure I see this in the PoC?
Thanks @jloleysens for taking a look, We recommend a maximum of 500 agent policies, I expect that migration to be relatively fast, so I am not sure we should really block the Fleet API during the migration process, it seems to me this will add a lot of extra complexity on every endpoint (maybe this could be covered by documenting the migration process) For the scale perspective, we could easily batch the saved object creation do you have recommendation on the number of saved object we should use per batch? For the Idempotent part I am not sure there is too much issue, as we create saved object with overrides during migration, it should be idempotent? we could eventually add a pending state to avoid useless request It should be easy to return a 400 if the migration is already done
It's mostly in the |
Yeah for sure, the pain of migrating data 😄
It's tricky to say. Will these migrations always be under, e.g., 5s? If "yes", do we know that in those 5s none of these objects will be updated (see complications)?
That's a good point and might be a sweet spot for this change. Build the "lock" into end users by informing them. I can see this being less convenient of a UX and feels a bit scary, but I guess it could work 🤷🏻 Automated updates to objects might interfere still.
Aiming for a flat memory pressure target like max 2MB could be a good starting point for finding the right batch size. Depends on largest expected object size. Batch size ~= pressure target in bytes / largest expected object in bytes.
I'd be curious to hear what @kpollich thinks too :). By not locking, here are some complications we could see in the wild:
Unfortunately, above complications get more complicated with multiple Kibana nodes One other approach I thought could work, given relatively low object volume, to contain complexity to data layer: If opted-in, add logic to data layer that, when read/update/delete per object:
We'll still have to update our query/read logic to work with both legacy and new indices until 9.0. In addition, we could have a BT task (concurrency 1) that gets scheduled at the moment you opt-in. Basically does these steps in bulk. Tombstoning, writing new and deleting tombstoned polices. This would be safe to retry (avoids desync) but carries the risk that we tombstone objects indefinitely if there is a bug in our migration logic (assuming retries for intermittent ES issues). [made a few edits to my proposed algo] |
I think after speaking with @nchaulet I'm in favor of not introducing a locking mechanism around Fleet while this migration is pending. The overall time to migrate should be quite low (an assumption we should validate ASAP during implementation), and because we'll have a "snapshot" of the pre-migration state I don't feel the risk is very high for this migration. Thanks for summarizing the risks we might observe, @jloleysens - I'll offer some specific thoughts on these below.
Don't we have this problem even with a lock? There could be pending writes while we write the lock to ES and those writes could complete after the migration begins, resulting in data written to the previous indices that wouldn't be picked up by the migration. We'd have to bake some kind of bake period into the migration that allows any pending requests to complete before beginning the migration process. Even then would that be perfect unless we have a way to check for pending requests? Could we simply check counts of object types before the migration, then check again afterwards and re-run the migration on the objects we "missed"? Since it's idempotent this shouldn't be overly risky, but there is risk that we always have pending objects written while the migration is running and thus the migration never completes.
We'd write a record to ES when the migration completes that indicates the new SO types are ready to be used. If this record is never written (someone closed the lid on the laptop that runs production Kibana, pulled the plug, pod killed, etc), we'd prompt the user to retry the migration.
Yes I think this is discussed a bit above as well, but missing deletes is more problematic than missing writes. Curious if you have any thoughts on this point specifically, @nchaulet |
All I'm saying: with this approach AND without a cross-Kibana lock during migration there is risk of data loss.
With a lock you'd know there are no pending requests: you'd block them.
This is quite hacky and would not mitigate potential for data loss. Per your point "...there is risk that we always have pending": you'll never know the state of the world "now".
You still need to track that a mig is in progress (not just completed) to avoid servicing two migration requests at the same time or having other Kibana's continuing to perform actions against the old SO. Idempotent seemed like the best term, but I should have rather used the term "atomic" from the user's perspective to capture my meaning. |
If we store the pending state it seems it will not be too hard to add a lock to all of our route, it will add a an extra request to retrieve a saved object if the feature flag is enabled. I am going to try to implement this |
x-pack/plugins/fleet/server/services/spaces/enable_space_awareness.ts
Outdated
Show resolved
Hide resolved
@@ -82,6 +84,31 @@ export function makeRouterWithFleetAuthz<TContext extends FleetRequestHandlerCon | |||
logger.info(`User does not have required fleet authz to access path: ${request.route.path}`); | |||
return response.forbidden(); | |||
} | |||
|
|||
// No need to check for EPM requests | |||
if (request.route.path.includes('api/fleet')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what a lock will looks like, it will introduce an extra request to retrieve a saved object on every API request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess now that we lock at the client level we can remove this HTTP layer lock? Since all operations go through the client, right?
@kpollich @jloleysens I pushed a few modification to store a pending state, have a batch size, and potentially introduce a lock mechanism for the API, (I have still concerns on the gain of that vs the cost of the extra query) Let me know if you think this is the direction we want to go, and if it is I will work on a more production ready PR with tests, ... |
This approach LGTM - I think we should move forward with applying this to all our SO types. Let's continue to measure the performance impact along the way. We should also consider a UX approach for prompting users to opt into the migration. Seems like it should be something simple to build, though. |
/ci |
const settings = await getSettings(appContextService.getInternalUserSOClient()).catch( | ||
() => { | ||
// TODO handle not found | ||
} | ||
); | ||
|
||
if ( | ||
// Move condition to an helper | ||
settings?.use_space_awareness_migration_started_at && | ||
new Date(settings?.use_space_awareness_migration_started_at).getTime() > | ||
Date.now() - 60 * 60 * 100 | ||
) { | ||
return response.conflict({ | ||
body: { | ||
message: 'Space awareness migration pending', | ||
}, | ||
}); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to hear what you thought about this comment:
maybe this could be covered by documenting the migration process
That's a good point and might be a sweet spot for this change. Build the "lock" into end users by informing them.
Is this something that might be acceptable? Are there automated updates that might interfere? If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.
use_space_awareness_migration_started_at ... Date.now() - 60 * 60 * 100
WDYT about tracking migration status, not just when it started? We could have not_started | in_progress | error | complete
. Based on that we could make better decisions here?
Also could this logic live "lower" in the stack? Perhaps just before calls to SO client?
Will also avoid relying on macOS benchmarks that will not translate to serverless or other lower resource envs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could this logic live "lower" in the stack? Perhaps just before calls to SO client?
Not sure to totally understand this one, do you mean having a state that is not stored in a saved object? what kibana provides to do a shared state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.
@nimarezainia Will it be acceptable to document and ask the user to not do Fleet change when they opt-in for space awareness, (it should be a relatively fast migration probably less than 30s, to confirm after more tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be a relatively fast migration probably less than 30s
If this is the optimistic lower bound, what is the worst-case upper bound?
Not sure to totally understand this one, do you mean having a state that is not stored in a saved object?
The current approach will lock interactions at the HTTP API layer only. Is this the only source of updates to all SOs?
For example we have this .create
call for making a package policy. I was thinking we could wrap PackagePolicyClientImpl
or other such clients with something that wraps calls to create/update/etc and block any/all interactions during migration at a "lower" level. This wrapper could be written once and just maintain the contract. This way we can have the same logic shared by all clients.
Then again, with your current code we are already pointing this code to write/read/delete from the new index once migration starts so non-request related writes/reads/deletes will already happen against the new index.
One concern: it's possible that if you allow partial updates that any non-blocked code may write a partial object to the new index that will 409 with any migrated object (causing some data loss bc we will "lose" the object we are trying to migrate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach will lock interactions at the HTTP API layer only. Is this the only source of updates to all SOs?
Not you are right other plugins interact with fleet through our service it probably make sense to block the package policy client too, 👍
Then again, with your current code we are already pointing this code to write/read/delete from the new index once migration starts so non-request related writes/reads/deletes will already happen against the new index.
Actually we only point to the new index when the migration is complete, so if anything happen during the migration it will be loss, so probably make sense to block the package policy client too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.
@nimarezainia Will it be acceptable to document and ask the user to not do Fleet change when they opt-in for space awareness, (it should be a relatively fast migration probably less than 30s, to confirm after more tests)
We certainly should/will document this. However I don't want to leave it to the user. Any way we could lock it down?
We these changes we are allowing a lot more multi-tenanted usage of the platform, so I am not sure if the word of mouth or just documentation would suffice.
I think it's more than acceptable to lock it down as it's a one time operation. Maybe a progress bar (even if it's a bit fake)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I implemented a lock at the package policy service for the write operations, so API users will not be able to write package policies during the migration to multiple space .
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes under OSQuery and Security Solution look good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defend Workflows review:
- Left one question, if you're sure it was intended - I'll approve :)
Thanks!
@@ -686,7 +686,7 @@ export class ManifestManager { | |||
}, | |||
}); | |||
|
|||
for await (const policies of this.fetchAllPolicies()) { | |||
for await (const policies of await this.fetchAllPolicies()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax looks strange to me, are you sure this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomsonpl Yes it works, we now have some similar syntaxes in Fleet code too, the fetchAllPolicies
is now async as we dynamically retrieve the saved object type, and resolve as an async iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't test but the code changes look good to me. 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defend workflows Lgtm 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @nchaulet , I limited my review to the migration and left a few non-blocking comments. I did not test the code locally, let me know if you need any further help!
Overall I do still have a concern about resilience during the migration: will any/all updates that occur during migration window be handled gracefully (i.e. not crash Kibana)?
|
||
const logs = loggerMock.collect(logger); | ||
expect(res.filter((p) => p.status === 'fulfilled')).toHaveLength(1); | ||
// It should start and complete the migration only once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! TBH I was expecting 3 of enableSpaceAwarenessMigration
calls to error out, but if we know it's completing once that LGTM
@@ -82,6 +80,7 @@ export function makeRouterWithFleetAuthz<TContext extends FleetRequestHandlerCon | |||
logger.info(`User does not have required fleet authz to access path: ${request.route.path}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocker: one thing I thought we could do in the router (now that you've removed the lock) is to map FleetError('migration in progress')
to res.badRequest(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already handle that error in our error handler with a 400 https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/errors/handlers.ts#L178-L184
We currently have a try catch in all of our handler, could be great to refacto that to have it at the router
…y-so-multiple-optin
@jloleysens from what I know/tested nothing should crash Kibana, some API requests could fail, if a new Kibana instance start, the initial Fleet setup may fail but this will not crash Kibana |
… src/core/server/integration_tests/ci_checks'
…y-so-multiple-optin
… src/core/server/integration_tests/ci_checks'
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]Public APIs missing comments
Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: |
Description
Related to #188420
Add an opt-in migration to enable space awareness, and use new saved object type to get multiple space affinity for agent and package policy.
Details
Storing feature state
The opt-in is done via the fleet settings saved object
use_space_awareness_migration_status
.I introduced a new helper
isSpaceAwarenessEnabled
to know if a user opt-in for space awareness that method return true if the feature flag is set and the migration is complete.Saved object
I introduced a new saved object type
.fleet-agent-policies
that is space aware (multiple
). The service that query that saved object use either the space aware type oringest-agent-policies
based on the settings state.I did the same for the package policies.
I also use the dynamic saved object type in the diverse method that directly access the saved objects.
The kuery is rewritten to match the used object type so both the old and the new saved object type are now working.
Migration
I introduced a new API endpoint
/internal/fleet/enable_space_awareness
That endpoint set
use_space_awareness_status
and copy the agent policies from.ingest-agent-policies
to.fleet-agent-policies
same for package policies.The API acces to
/api/fleet
is locked during the migration as the package policy write methods, to avoid user loosing data.How to test
with the feature flag
useSpaceAwareness
enabled.You can enable space awareness
Than verify that policies created in different space are not available in others.
You can also check there is no issue when not enabling space awareness.
Test sharing
You can test sharing a saved object between multiple saved by updating the namespaces field of the saved object and check it's visible in both space, it seems to work as expected.
Todo
isSpaceAwareness
helperNot done in that PR