Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Use multiple saved object type for agent and package policies with opt-in migration #189387

Merged
merged 49 commits into from
Aug 14, 2024

Conversation

nchaulet
Copy link
Member

@nchaulet nchaulet commented Jul 29, 2024

Description

Related to #188420

Add an opt-in migration to enable space awareness, and use new saved object type to get multiple space affinity for agent and package policy.

Details

Storing feature state

The opt-in is done via the fleet settings saved object use_space_awareness_migration_status.

I introduced a new helper isSpaceAwarenessEnabled to know if a user opt-in for space awareness that method return true if the feature flag is set and the migration is complete.

Saved object

I introduced a new saved object type .fleet-agent-policies that is space aware (multiple). The service that query that saved object use either the space aware type or ingest-agent-policies based on the settings state.

I did the same for the package policies.

I also use the dynamic saved object type in the diverse method that directly access the saved objects.

The kuery is rewritten to match the used object type so both the old and the new saved object type are now working.

Migration

I introduced a new API endpoint /internal/fleet/enable_space_awareness

That endpoint set use_space_awareness_status and copy the agent policies from .ingest-agent-policies to .fleet-agent-policies same for package policies.

The API acces to /api/fleet is locked during the migration as the package policy write methods, to avoid user loosing data.

How to test

with the feature flag useSpaceAwareness enabled.

You can enable space awareness

 curl -u elastic:changeme -XPOST "http://localhost:5601/internal/fleet/enable_space_awareness" -H "kbn-xsrf: reporting" -H 'elastic-api-version: 1'

Than verify that policies created in different space are not available in others.

You can also check there is no issue when not enabling space awareness.

Test sharing

You can test sharing a saved object between multiple saved by updating the namespaces field of the saved object and check it's visible in both space, it seems to work as expected.

"namespaces": [
            "default",
            "test"
          ],

Todo

  • Replace use of feature flag, by isSpaceAwareness helper
  • Lock package policy
  • API integration
    • enable space awareness for all existing space tests
    • Test migration

Not done in that PR

@nchaulet nchaulet added release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team labels Jul 29, 2024
@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems to be heading in the right direction. Nice work @nchaulet


I'm not sure if this was intended to address all aspects of the discussion. But here are a few initial thoughts categorized as what I think are "requirements" vs "robustness":

Requirements

  1. Concurrency: from the current implementation I'm not sure I understand how Kibana will know this migration is in progress and should lock all actions against policies
  2. Concurrency: how will other Kibanas know this migration is in progress and lock their (HTTP) APIs?
  3. Idempotent: if I hit the migrate endpoint multiple times we should always end in a well-defined end state. This is a one-off endpoint so calling it should 400 "while in progress" and "when done". Of course, if something errors we should be able to retry.

IMO would be worth de-risking these aspects in your PoC too. There is some prior art for this kind of thing in the upgrade assistant (x-pack/plugins/upgrade_assistant/server/lib/reindexing)

Robustness

  1. Scale/speed: bulk loading objects can create a lot of memory pressure, we need to balance speed with robustness and ensure we bulk/bulkUpdate in batches for the largest sets of objects in a performant but memory safe way, better to be more on the safe than performant side in this case
  2. Scale/concurrency: have we considered making this a task with concurrency 1? For serverless/cloud offerings this will offload work to BT nodes so UI nodes can do other UI node things.

The other aspect is planning for how we want to surface this state in the UI but I guess that's coming later.


Here you mentioned that complexity is added to the querying (as expected) but I'm not sure I see this in the PoC?

@nchaulet
Copy link
Member Author

Thanks @jloleysens for taking a look,

We recommend a maximum of 500 agent policies, I expect that migration to be relatively fast, so I am not sure we should really block the Fleet API during the migration process, it seems to me this will add a lot of extra complexity on every endpoint (maybe this could be covered by documenting the migration process)
Do you have some ideas how much saved object creation could be problematic (the migration taking more than a minute)?

For the scale perspective, we could easily batch the saved object creation do you have recommendation on the number of saved object we should use per batch?

For the Idempotent part I am not sure there is too much issue, as we create saved object with overrides during migration, it should be idempotent? we could eventually add a pending state to avoid useless request

It should be easy to return a 400 if the migration is already done

#188420 (comment) you mentioned that complexity is added to the querying (as expected) but I'm not sure I see this in the PoC?

It's mostly in the services/agent_policy.ts where some extra complexity is added, but it's not that bad as all the queries to the agent policy saved object are regrouped into that service.

@jloleysens
Copy link
Contributor

jloleysens commented Jul 31, 2024

it seems to me this will add a lot of extra complexity on every endpoint (maybe this could be covered by documenting the migration process)

Yeah for sure, the pain of migrating data 😄

Do you have some ideas how much saved object creation could be problematic (the migration taking more than a minute)?

It's tricky to say. Will these migrations always be under, e.g., 5s? If "yes", do we know that in those 5s none of these objects will be updated (see complications)?

maybe this could be covered by documenting the migration process

That's a good point and might be a sweet spot for this change. Build the "lock" into end users by informing them. I can see this being less convenient of a UX and feels a bit scary, but I guess it could work 🤷🏻 Automated updates to objects might interfere still.

do you have recommendation on the number of saved object we should use per batch?

Aiming for a flat memory pressure target like max 2MB could be a good starting point for finding the right batch size. Depends on largest expected object size. Batch size ~= pressure target in bytes / largest expected object in bytes.

I expect that migration to be relatively fast, so I am not sure we should really block the Fleet API during the migration process

I'd be curious to hear what @kpollich thinks too :). By not locking, here are some complications we could see in the wild:

  • a request to start migrating, shortly after another a request to create/update/delete, where will the create/updated/delete show up?
  • how do we know when to start reading from the new index? i.e., when is it ready/fully synced? The legacy index becomes essentially a snapshot at this point
  • If something goes wrong, we have to retry the migration, this increases migration time and likelihood of desync between indices, i.e. this will not be idempotent as you may "miss" deletes

Unfortunately, above complications get more complicated with multiple Kibana nodes


One other approach I thought could work, given relatively low object volume, to contain complexity to data layer:

If opted-in, add logic to data layer that, when read/update/delete per object:

  1. Object from legacy?
  2. Write tombstone on legacy object (i.e., lock it, likely a new field on the old SOs)
    a. Already tombstoned and attempting update/delete? Throw error
  3. If update, perform update, if delete skip to (5)
  4. Write object to new index
  5. Delete legacy, tombstoned object

We'll still have to update our query/read logic to work with both legacy and new indices until 9.0.

In addition, we could have a BT task (concurrency 1) that gets scheduled at the moment you opt-in. Basically does these steps in bulk. Tombstoning, writing new and deleting tombstoned polices. This would be safe to retry (avoids desync) but carries the risk that we tombstone objects indefinitely if there is a bug in our migration logic (assuming retries for intermittent ES issues).

[made a few edits to my proposed algo]

@kpollich
Copy link
Member

I'd be curious to hear what @kpollich thinks too :). By not locking, here are some complications we could see in the wild:

I think after speaking with @nchaulet I'm in favor of not introducing a locking mechanism around Fleet while this migration is pending.

The overall time to migrate should be quite low (an assumption we should validate ASAP during implementation), and because we'll have a "snapshot" of the pre-migration state I don't feel the risk is very high for this migration.

Thanks for summarizing the risks we might observe, @jloleysens - I'll offer some specific thoughts on these below.

a request to start migrating, shortly after another a request to create/update/delete, where will the create/updated/delete show up?

Don't we have this problem even with a lock? There could be pending writes while we write the lock to ES and those writes could complete after the migration begins, resulting in data written to the previous indices that wouldn't be picked up by the migration. We'd have to bake some kind of bake period into the migration that allows any pending requests to complete before beginning the migration process. Even then would that be perfect unless we have a way to check for pending requests?

Could we simply check counts of object types before the migration, then check again afterwards and re-run the migration on the objects we "missed"? Since it's idempotent this shouldn't be overly risky, but there is risk that we always have pending objects written while the migration is running and thus the migration never completes.

how do we know when to start reading from the new index? i.e., when is it ready/fully synced? The legacy index becomes essentially a snapshot at this point

We'd write a record to ES when the migration completes that indicates the new SO types are ready to be used. If this record is never written (someone closed the lid on the laptop that runs production Kibana, pulled the plug, pod killed, etc), we'd prompt the user to retry the migration.

If something goes wrong, we have to retry the migration, this increases migration time and likelihood of desync between indices, i.e. this will not be idempotent as you may "miss" deletes

Yes I think this is discussed a bit above as well, but missing deletes is more problematic than missing writes. Curious if you have any thoughts on this point specifically, @nchaulet

@jloleysens
Copy link
Contributor

jloleysens commented Jul 31, 2024

I'm in favor of not introducing a locking mechanism around Fleet while this migration is pending.

All I'm saying: with this approach AND without a cross-Kibana lock during migration there is risk of data loss.

Don't we have this problem even with a lock?

With a lock you'd know there are no pending requests: you'd block them.

Could we simply check counts of object types before the migration...

This is quite hacky and would not mitigate potential for data loss. Per your point "...there is risk that we always have pending": you'll never know the state of the world "now".

We'd write a record to ES when the migration completes that indicates the new SO types are ready to be used

You still need to track that a mig is in progress (not just completed) to avoid servicing two migration requests at the same time or having other Kibana's continuing to perform actions against the old SO. Idempotent seemed like the best term, but I should have rather used the term "atomic" from the user's perspective to capture my meaning.

@nchaulet
Copy link
Member Author

nchaulet commented Aug 1, 2024

If we store the pending state it seems it will not be too hard to add a lock to all of our route, it will add a an extra request to retrieve a saved object if the feature flag is enabled. I am going to try to implement this

@nchaulet
Copy link
Member Author

nchaulet commented Aug 1, 2024

Just did some local benchmark of the migration with larger datasets, I added the package policy migration and for every agent policy I created two system package policy.

With 500 agent policies (the max we officialy recommend) the migration is really fast ~3s
With 10000 agent policies is still reasonable and take ~15s
Screenshot 2024-08-01 at 3 14 33 PM

@@ -82,6 +84,31 @@ export function makeRouterWithFleetAuthz<TContext extends FleetRequestHandlerCon
logger.info(`User does not have required fleet authz to access path: ${request.route.path}`);
return response.forbidden();
}

// No need to check for EPM requests
if (request.route.path.includes('api/fleet')) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what a lock will looks like, it will introduce an extra request to retrieve a saved object on every API request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess now that we lock at the client level we can remove this HTTP layer lock? Since all operations go through the client, right?

@nchaulet
Copy link
Member Author

nchaulet commented Aug 1, 2024

@kpollich @jloleysens I pushed a few modification to store a pending state, have a batch size, and potentially introduce a lock mechanism for the API, (I have still concerns on the gain of that vs the cost of the extra query)

Let me know if you think this is the direction we want to go, and if it is I will work on a more production ready PR with tests, ...

@kpollich
Copy link
Member

kpollich commented Aug 2, 2024

This approach LGTM - I think we should move forward with applying this to all our SO types. Let's continue to measure the performance impact along the way.

We should also consider a UX approach for prompting users to opt into the migration. Seems like it should be something simple to build, though.

@nchaulet
Copy link
Member Author

nchaulet commented Aug 2, 2024

/ci

x-pack/plugins/fleet/server/services/agent_policy_watch.ts Outdated Show resolved Hide resolved
Comment on lines 89 to 108
const settings = await getSettings(appContextService.getInternalUserSOClient()).catch(
() => {
// TODO handle not found
}
);

if (
// Move condition to an helper
settings?.use_space_awareness_migration_started_at &&
new Date(settings?.use_space_awareness_migration_started_at).getTime() >
Date.now() - 60 * 60 * 100
) {
return response.conflict({
body: {
message: 'Space awareness migration pending',
},
});
}
}
}
Copy link
Contributor

@jloleysens jloleysens Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to hear what you thought about this comment:

maybe this could be covered by documenting the migration process

That's a good point and might be a sweet spot for this change. Build the "lock" into end users by informing them.

Is this something that might be acceptable? Are there automated updates that might interfere? If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.


use_space_awareness_migration_started_at ... Date.now() - 60 * 60 * 100

WDYT about tracking migration status, not just when it started? We could have not_started | in_progress | error | complete. Based on that we could make better decisions here?

Also could this logic live "lower" in the stack? Perhaps just before calls to SO client?

Will also avoid relying on macOS benchmarks that will not translate to serverless or other lower resource envs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could this logic live "lower" in the stack? Perhaps just before calls to SO client?

Not sure to totally understand this one, do you mean having a state that is not stored in a saved object? what kibana provides to do a shared state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.

@nimarezainia Will it be acceptable to document and ask the user to not do Fleet change when they opt-in for space awareness, (it should be a relatively fast migration probably less than 30s, to confirm after more tests)

Copy link
Contributor

@jloleysens jloleysens Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be a relatively fast migration probably less than 30s

If this is the optimistic lower bound, what is the worst-case upper bound?

Not sure to totally understand this one, do you mean having a state that is not stored in a saved object?

The current approach will lock interactions at the HTTP API layer only. Is this the only source of updates to all SOs?

For example we have this .create call for making a package policy. I was thinking we could wrap PackagePolicyClientImpl or other such clients with something that wraps calls to create/update/etc and block any/all interactions during migration at a "lower" level. This wrapper could be written once and just maintain the contract. This way we can have the same logic shared by all clients.

Then again, with your current code we are already pointing this code to write/read/delete from the new index once migration starts so non-request related writes/reads/deletes will already happen against the new index.

One concern: it's possible that if you allow partial updates that any non-blocked code may write a partial object to the new index that will 409 with any migrated object (causing some data loss bc we will "lose" the object we are trying to migrate).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach will lock interactions at the HTTP API layer only. Is this the only source of updates to all SOs?

Not you are right other plugins interact with fleet through our service it probably make sense to block the package policy client too, 👍

Then again, with your current code we are already pointing this code to write/read/delete from the new index once migration starts so non-request related writes/reads/deletes will already happen against the new index.

Actually we only point to the new index when the migration is complete, so if anything happen during the migration it will be loss, so probably make sense to block the package policy client too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can describe this as an experimental migration and ask users to backup data before starting and not perform any updates to fleet ("lock" users), could that be a good enough? It's a slight product q, but I think it could avoid [some of] the complexity @nchaulet raised.

@nimarezainia Will it be acceptable to document and ask the user to not do Fleet change when they opt-in for space awareness, (it should be a relatively fast migration probably less than 30s, to confirm after more tests)

We certainly should/will document this. However I don't want to leave it to the user. Any way we could lock it down?
We these changes we are allowing a lot more multi-tenanted usage of the platform, so I am not sure if the word of mouth or just documentation would suffice.

I think it's more than acceptable to lock it down as it's a one time operation. Maybe a progress bar (even if it's a bit fake)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I implemented a lock at the package policy service for the write operations, so API users will not be able to write package policies during the migration to multiple space .

@nchaulet
Copy link
Member Author

@elasticmachine merge upstream

@kc13greiner kc13greiner self-requested a review August 12, 2024 13:17
Copy link
Contributor

@kc13greiner kc13greiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@paul-tavares paul-tavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes under OSQuery and Security Solution look good 👍

Copy link
Contributor

@tomsonpl tomsonpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defend Workflows review:

  • Left one question, if you're sure it was intended - I'll approve :)
    Thanks!

@@ -686,7 +686,7 @@ export class ManifestManager {
},
});

for await (const policies of this.fetchAllPolicies()) {
for await (const policies of await this.fetchAllPolicies()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax looks strange to me, are you sure this works?

Copy link
Member Author

@nchaulet nchaulet Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomsonpl Yes it works, we now have some similar syntaxes in Fleet code too, the fetchAllPolicies is now async as we dynamically retrieve the saved object type, and resolve as an async iterator

Copy link
Member

@ashokaditya ashokaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't test but the code changes look good to me. 🚀

Copy link
Contributor

@tomsonpl tomsonpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defend workflows Lgtm 👍

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @nchaulet , I limited my review to the migration and left a few non-blocking comments. I did not test the code locally, let me know if you need any further help!


Overall I do still have a concern about resilience during the migration: will any/all updates that occur during migration window be handled gracefully (i.e. not crash Kibana)?


const logs = loggerMock.collect(logger);
expect(res.filter((p) => p.status === 'fulfilled')).toHaveLength(1);
// It should start and complete the migration only once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! TBH I was expecting 3 of enableSpaceAwarenessMigration calls to error out, but if we know it's completing once that LGTM

@@ -82,6 +80,7 @@ export function makeRouterWithFleetAuthz<TContext extends FleetRequestHandlerCon
logger.info(`User does not have required fleet authz to access path: ${request.route.path}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocker: one thing I thought we could do in the router (now that you've removed the lock) is to map FleetError('migration in progress') to res.badRequest(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already handle that error in our error handler with a 400 https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/errors/handlers.ts#L178-L184
We currently have a try catch in all of our handler, could be great to refacto that to have it at the router

@nchaulet
Copy link
Member Author

Overall I do still have a concern about resilience during the migration: will any/all updates that occur during migration window be handled gracefully (i.e. not crash Kibana)?

@jloleysens from what I know/tested nothing should crash Kibana, some API requests could fail, if a new Kibana instance start, the initial Fleet setup may fail but this will not crash Kibana

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1235 1236 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 1.8MB 1.8MB +90.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 169.0KB 169.3KB +276.0B
Unknown metric groups

API count

id before after diff
fleet 1358 1359 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@nchaulet nchaulet merged commit 708fcd2 into elastic:main Aug 14, 2024
43 checks passed
@nchaulet nchaulet deleted the poc-agent-policy-so-multiple-optin branch August 14, 2024 21:22
@kibanamachine kibanamachine added v8.16.0 backport:skip This commit does not require backporting labels Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.