-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Prevent concurrent runs of Fleet setup #183636
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
/ci |
/ci |
/ci |
/ci |
… src/core/server/integration_tests/ci_checks'
/ci |
/ci |
Pinging @elastic/fleet (Team:Fleet) |
return await awaitIfPending(async () => createSetupSideEffects(soClient, esClient)); | ||
} catch (error) { | ||
apm.captureError(error); | ||
t.setOutcome('failure'); | ||
throw error; | ||
} finally { | ||
t.end(); | ||
try { | ||
await settingsService.saveSettings(soClient, { | ||
fleet_setup_status: 'completed', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added writing out the completed
status to the finally
branch to make sure it is written out in case of any errors.
Leaving this flag in_progress
would prevent any subsequent Fleet setup calls from running.
We might want to have an escape hatch if this happens, e.g. a force
flag to be able to run Fleet setup if the status is stuck in in_progress
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a force flag makes sense for edge cases and troubleshooting, imo
// check if fleet setup is already started | ||
const settings = await settingsService.getSettingsOrUndefined(soClient); | ||
|
||
if (settings && settings.fleet_setup_status === 'in_progress') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still a possibility for a race condition no? I am wondering if we should introduce a real distributed lock for the setup.
Not sure how easy it will be to implement this with saved object, maybe something like this with no retry on conflict, happy to discuss this more
if (!settings.fleet_setup_status) {
const setupId = uuid();
await settingsService.saveSettings(soClient, {
fleet_setup_status: 'in_progress',
fleet_setup_id: uuid(), // to check who get the lock
fleet_setup_started_at: Date.now(), // to get a TTL for the lock
}, {
retryOnConflict: false
});
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, what are the exact concurrency semantics here? Is this actually atomic? A distributed lock would work if you had a perfect one, but those can also run into problems around retries depending on exactly how it works.
A better solution would be to make it so that concurrent attempts to create the policy don't matter by making the action idempotent. What if the requests to create the policy went through a queue as a way of serializing the action and getting a consistent result each time?
What if each instance of Kibana writes to it's own copy of the same object with a unique name, and we pick the first one as the one that gets used, or an arbitrary one, assuming they are all the same.
There might be other options as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking with the in progress flag was to avoid the unnecessary work of each kibana instance creating the preconfigured policies. We are using a similar approach in package install where the package is set to installing
status and can't be installed concurrently until completed.
Something like selecting a leader kibana instance could work, though I'm not sure how easy it is to implement.
Using a queue would need a separate component that executes it, isn't it?
Using a unique policy id for each attempt would potentially generate a lot of documents, as Fleet setup runs every time Kibana restarts or Fleet UI is visited. We are also relying on a specific name of the cloud policy.
I think a less intrusive change would be to catch version conflict errors and check the existence of the policy before retrying/rollback.
Also how we are bumping the revision field now is not atomic, it is being read and updated separately, we could improve it with an update painless script, to make sure two instances are not overwriting the same revision doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliaElastic If we go to avoid concurrent calls could we catch the version conflict when updating the settings object, (and use that document as the lock)?
Using a queue would need a separate component that executes it, isn't it?
We could eventually use the task manager for that, that is the available queue in Kibana, but it will add some delay to bump the policy not sure if it's acceptable for us. But it could solve a few scalabilty issue to when bumping agent policies revision.
/ci |
/ci |
/ci |
// Retry Fleet setup w/ backoff | ||
await backOff( | ||
async () => { | ||
await setupFleet( | ||
new SavedObjectsClient(core.savedObjects.createInternalRepository()), | ||
core.elasticsearch.client.asInternalUser | ||
core.elasticsearch.client.asInternalUser, | ||
{ useLock: true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using this option to only use the lock when called from plugin start logic, not from API
/ci |
/** | ||
* Verifies that multiple Kibana instances running in parallel will not create duplicate preconfiguration objects. | ||
*/ | ||
describe.skip('Fleet setup preconfiguration with multiple instances Kibana', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used these tests locally to start es and run 2 kibana instances as separate processes.
I've tried to run it as a CI task, but didn't see any output, not sure how to check the result of processes running in the background.
Alternatively we can use multiple zones in scheduled scale tests to make sure the bug doesn't surface again. (the test would fail if fleet-server doesn't come online) - this is done in daily 10k and daily 50vm runs to use 3 zones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check the log files of these Kibana instances maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that the jest integration
step is picking up these tests if not skipped, and they fail because the fleet_setup.test.ts
test requires es to be started separately in es.test.ts
. I tried moving the files to another folder, but then the jest_integration
script doesn't find them, and it doesn't seem easy to add a new script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nitpicks comments but otherwise LGTM 🚀
💚 Build Succeeded
Metrics [docs]Module Count
Public APIs missing comments
Unknown metric groupsAPI count
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
Closes elastic/ingest-dev#3346 - [x] Unit and integration tests are created or updated - [x] Turn down info logging The linked issue seems to be caused by multiple kibana instances running Fleet setup at the same time, trying to create the preconfigured cloud policy concurrently, and in case of failures, the agent policy is left with a revision with no inputs, this way preventing fleet-server to start properly. See the concurrent errors in the logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/tUpMP This fix introduces a `fleet-setup-lock` SO type, which is used to create a document as a lock by Fleet setup, and is deleted when the setup is completed. Concurrent calls to Fleet setup will return early if this doc exists. To verify: Run the test `./run_fleet_setup_parallel.sh` from local kibana, and verify the generated logs that only one of them ran Fleet setup. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Closes https://github.com/elastic/ingest-dev/issues/3346
The linked issue seems to be caused by multiple kibana instances running Fleet setup at the same time, trying to create the preconfigured cloud policy concurrently, and in case of failures, the agent policy is left with a revision with no inputs, this way preventing fleet-server to start properly.
See the concurrent errors in the logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/tUpMP
This fix introduces a
fleet-setup-lock
SO type, which is used to create a document as a lock by Fleet setup, and is deleted when the setup is completed. Concurrent calls to Fleet setup will return early if this doc exists.To verify:
Run the test
./run_fleet_setup_parallel.sh
from local kibana, and verify the generated logs that only one of them ran Fleet setup.