Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] ActionId for Tag removal completes but the tag is still in the list of tags #144161

Closed
pjbertels opened this issue Oct 20, 2022 · 27 comments · Fixed by #147594
Closed

[Fleet] ActionId for Tag removal completes but the tag is still in the list of tags #144161

pjbertels opened this issue Oct 20, 2022 · 27 comments · Fixed by #147594
Assignees
Labels
Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@pjbertels
Copy link

image_tag:8.5.0-792499b4-SNAPSHOT

When doing a 50k run and adding and removing tags we found that the ActionId for Tag removal completes but the tag is still in the list of tags and this causes the test to fail.

process_bulk_response
{'actionId': '8ba23269-b8f3-4f6f-9869-b6fe68f83886'}
got actiodId
[12:53:16] INFO     Remove tag: test_tag_d54827c0-508d-11ed-8ccc-39cdadc9db68                                                                                                   test_perf02.py:493
[12:53:16]           UPDATE_TAGS COMPLETE 0/100000/100000/0                                                                                                                         perf_lib.py:32
                     UPDATE_TAGS COMPLETE 0/100000/100000/0 0.01 min                                                                                                                perf_lib.py:32
[12:53:17] INFO     Tags: [AgentTag(name='0.2.276-6196b58'), AgentTag(name='test_tag_d54827c0-508d-11ed-8ccc-39cdadc9db68'), AgentTag(name='Hit4brWvjkiUNAFwMedLrD'),           test_perf02.py:498
                    AgentTag(name='jVpauSgi5csmZ3L3GsG38g')]                                                                                                                                      
           INFO     Looking for "online" greaterThanOrEqualTo 50000                                                                                                                perf_lib.py:169
[12:53:17]           "online" 50000/50000 (current/desired) 0.01 min                                                                                                                perf_lib.py:32

@pjbertels pjbertels added Team:Fleet Team label for Observability Data Collection Fleet team Project:FleetScaling labels Oct 20, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 24, 2022

@pjbertels Could you give some more info about the order of events here? which tags were added/removed?
We have a known issue with multiple tags being removed quickly, I am wondering if this is the same root cause.

@pjbertels
Copy link
Author

We add the tag test_tag_d54827c0-508d-11ed-8ccc-39cdadc9db68 (test_tag_) to all the agents. When the actionId completes we verify that the tag is also in the list of tags. We remove the tag and wait for the actionId to complete and then we check the list of tags to check if it is also removed.

@pjbertels
Copy link
Author

An example from a 50K run on 8.5.0-bdb8ff4d. https://apm-ci.elastic.co/job/perf/job/observability-perf-mbp/job/main/191/consoleText

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 28, 2022

@pjbertels I wouldn't reproduce this so far. Could you share the admin link to this cluster and the password to log in to kibana? I would like to check if there are any errors in kibana logs.

One thing that could be happening that the ack count calculation in /action_status is not accurate, and the action is not yet finished when you check the tags. Could we try adding some wait to the test to see if the tags are removed after some time? (e.g. wait a few seconds).

I found an issue locally that the ack count is not correct (showed less than the real count), submitted a pr for that.

@juliaElastic juliaElastic transferred this issue from elastic/fleet-server Oct 28, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Nov 3, 2022

@pjbertels Could you try to reproduce again with the latest snapshot that includes the linked fix? If my theory is correct, the issue shouldn't be happening again.

@kpollich kpollich changed the title ActionId for Tag removal completes but the tag is still in the list of tags [Fleet] ActionId for Tag removal completes but the tag is still in the list of tags Nov 3, 2022
@pjbertels
Copy link
Author

pjbertels commented Nov 8, 2022

Will retest. Based on some checking... I think 8.5.1 is where we want to pick this up.

@jlind23
Copy link
Contributor

jlind23 commented Nov 8, 2022

As discussed - This should rather be tested on 8.6.0.

@jlind23
Copy link
Contributor

jlind23 commented Nov 10, 2022

Closing this for now as fixed. Will reopen if it occurs again.

@jlind23 jlind23 closed this as completed Nov 10, 2022
@ablnk
Copy link

ablnk commented Dec 6, 2022

I retested it with 8.6.0-8cf9e954, the issue is still reproducible

@ablnk ablnk reopened this Dec 6, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Dec 7, 2022

I could reproduce this on a cloud instance with horde. I found one problem where the retry task keeps retrying the action even after 3rd retry is failed (kibana task manager retries the task after 5m if it is not removed or not throwing an error).
I think this might be the root cause, though I haven't been able to reproduce locally with mock agent documents.
I'll test the fix on cloud.

juliaElastic added a commit that referenced this issue Dec 8, 2022
## Summary

Related to #144161

Found that on a bulk update tags task failure, the task didn't stop
after 3 retries (should be over in less then a minute), the retries kept
happening for 2 hours.
This change removes the retry task if 3 retries are reached.

Also testing in cloud deployment to see if the tags error can be
reproduced with this fix.
I could reproduce the reported error locally, and seeing it goes away
with this fix.

To verify:
- Add at least 50k agents with the `create_agents` script in kibana repo
- open Kibana, select the 50k agents, and open Actions / Add tags
- Try this in a few seconds: add 2 new tags, and remove one of them
- Wait about 30s, the agents should reflect the changes
- Check the logs to see that the tasks are removed after 3rd retry is
reached or successful.
- Check that there are no more running tasks. Any running task can be
found in Kibana Console by running this query: `GET
.kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"`

Locally simulated an error to test that the retry (and check) task is
removed:

```
[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task
[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
```
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Dec 8, 2022
## Summary

Related to elastic#144161

Found that on a bulk update tags task failure, the task didn't stop
after 3 retries (should be over in less then a minute), the retries kept
happening for 2 hours.
This change removes the retry task if 3 retries are reached.

Also testing in cloud deployment to see if the tags error can be
reproduced with this fix.
I could reproduce the reported error locally, and seeing it goes away
with this fix.

To verify:
- Add at least 50k agents with the `create_agents` script in kibana repo
- open Kibana, select the 50k agents, and open Actions / Add tags
- Try this in a few seconds: add 2 new tags, and remove one of them
- Wait about 30s, the agents should reflect the changes
- Check the logs to see that the tasks are removed after 3rd retry is
reached or successful.
- Check that there are no more running tasks. Any running task can be
found in Kibana Console by running this query: `GET
.kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"`

Locally simulated an error to test that the retry (and check) task is
removed:

```
[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry elastic#3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task
[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
```

(cherry picked from commit 431c32b)
@juliaElastic
Copy link
Contributor

Merged a bugfix, though I could still reproduce the issue when I add/remove multiple tags quickly, but it happens less frequently. Will test more to see if I can fix the remaining issue.

kibanamachine added a commit that referenced this issue Dec 8, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [[Fleet] cancel tasks when 3rd retry failed
(#147190)](#147190)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-08T08:14:33Z","message":"[Fleet]
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
#144161 that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147190,"url":"#147190
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
#144161 that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"#147190
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
#144161 that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@ablnk
Copy link

ablnk commented Dec 12, 2022

Retested this against the most recent RC build of 8.6.0 (2207fc20). Adding or removal of tags from 75k agents took a very long time (>2 hours) so I keep this ticket open. @juliaElastic

@juliaElastic
Copy link
Contributor

The fix is not included in this build (BC6), we have to wait for BC7 or check in the latest SNAPSHOT build.

@juliaElastic
Copy link
Contributor

juliaElastic commented Dec 13, 2022

As discussed with @ablnk , the issue is still reproducible on the latest snapshot build.
I checked the logs of the test cluster, and noticed that the update tags action fails with version conflicts, even if there is only one action running.
I think this could be because the agent documents are continuously updated by checkin, so there is a high chance of having a conflict of a large batch like 75k at any given time.
During development I tested with dummy agent documents (as I can't enroll 50-75k agents locally), so I didn't have any checkin updates, that's why I didn't come across this issue. When I quickly updated two agent tags locally, the retries eventually completed.

I think we need a different implementation here. Previously I changed the logic to abort on version conflict, because that fixed the concurrent update tags scenario.
If we go back to continue the update on errors, we have to add a retry logic to update the remaining agent documents. This would reduce the chance of conflict (as we update less documents on retry), though in theory this can result in conflicts after 3 retries too.

Do we want to go down this path? I think there are plans to change how the .fleet-agents index is updated on the long term. cc @joshdover @jlind23

https://admin.found.no/deployments/828e5ad0562547d1bf29647c925b35fd/kibana
image

@jlind23
Copy link
Contributor

jlind23 commented Dec 13, 2022

I defer to Josh on the long term plan here but I believe that a retry mechanism might solve most of our issues for now without having to change the entire logic on which we rely today.

@pjbertels
Copy link
Author

Just an FYI I'm seeing this in the latest 8.6.0 BC(8.6.0-75d87829). On both add and remove with 5000 agents and the issue is that I never get back actionIds on add or remove.

@juliaElastic
Copy link
Contributor

juliaElastic commented Dec 16, 2022

I am working on a fix here that solves the version conflict errors that Andrei reported a few days ago on 75k agents.
5000 agents is interesting, as it should finish synchronously with the API call, I'll check it.

@pjbertels can you share the admin link/kibana logs where you experinced this issue? it might be the same root cause as Andrei reported.

EDIT: I could reproduce the issue with 5k agents, there can be conflicts and the logic currently doesn't retry on <10k. I can change this to retry update tags even on smaller agent count.

juliaElastic added a commit that referenced this issue Dec 20, 2022
## Summary

Fixes #144161

As discussed
[here](#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Dec 20, 2022
## Summary

Fixes elastic#144161

As discussed
[here](elastic#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 687987a)
@juliaElastic
Copy link
Contributor

Merged the latest fix and backported to 8.6. I think it will be available in the BC on dec 27, as we missed the today's build: https://github.com/elastic/dev/issues/2162

kibanamachine added a commit that referenced this issue Dec 20, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [[Fleet] refactored bulk update tags retry
(#147594)](#147594)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-20T09:36:36Z","message":"[Fleet]
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147594,"url":"#147594
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"#147594
refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes
#144161
discussed\r\n[here](#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@ablnk
Copy link

ablnk commented Dec 21, 2022

Retested against the latest snapshot 8.7.0-19f3018, no changes observed. Will recheck when the next build is available

@joshdover
Copy link
Contributor

It looks like this morning's snapshot failed which would explain this. Let's wait until a new build is available.

crespocarlos pushed a commit to crespocarlos/kibana that referenced this issue Dec 23, 2022
## Summary

Fixes elastic#144161

As discussed
[here](elastic#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd54-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@ablnk
Copy link

ablnk commented Dec 28, 2022

Retested in the latest 8.6.0 BC 0410b9b5
With 75k agents no more issues.
With 100k, some agents remain untouched when adding or removing tags. In case of removing, tag has not been removed from 948 agents. In case of adding, tag has not been added to 33 agents.

From logs:
Action failed: version conflict of 45962 agents
Stopping after 3rd retry. Error: version conflict of 948 agents

@ablnk ablnk reopened this Dec 28, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Dec 29, 2022

Retested in the latest 8.6.0 BC 0410b9b5 With 75k agents no more issues. With 100k, some agents remain untouched when adding or removing tags. In case of removing, tag has not been removed from 948 agents. In case of adding, tag has not been added to 33 agents.

From logs: Action failed: version conflict of 45962 agents Stopping after 3rd retry. Error: version conflict of 948 agents

Yes, so I expected this to happen eventually. The logic retries 3 times on version conflict, and updates the remaining agents. It can happen that some conflicts still happen on the last retry. One thing we can do is increase the number of retries to let's say 5. It is still not guaranteed that there will be no more conflicts after n retries.
Maybe this will be fully resolved if the checkin logic changes and does not update agents continuously.

At least the improvement works, so most of the agents are updated successfully.

@ablnk
Copy link

ablnk commented Dec 29, 2022

Maybe this will be fully resolved if the checkin logic changes and does not update agents continuously.

@juliaElastic to test that theory I conducted another test on QA environment, where checkin time increased to 30 minutes. Can confirm that helped for cases of adding/removing a single tag to 100k. In case of adding/removing multiple tag at once, I still see the problem that tags are not applied to some of the agents. However, it seems like increasing retry attempts could really help - on each retry, the number of agents to which tag was not applied is reduced, and if there were more retries, perhaps tag would be applied to all agents.

juliaElastic added a commit that referenced this issue Dec 29, 2022
## Summary

Increase retry count to 5 to help retry on agent doc version conflict.
It looks like 3 retries are not enough for 100k agents update tags.
#144161

This can be tested on an ECE high memory instance with 100k horde
agents.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Dec 29, 2022
## Summary

Increase retry count to 5 to help retry on agent doc version conflict.
It looks like 3 retries are not enough for 100k agents update tags.
elastic#144161

This can be tested on an ECE high memory instance with 100k horde
agents.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

(cherry picked from commit a9ac5ae)
kibanamachine added a commit that referenced this issue Dec 29, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [increase bulk action retry to 5
(#148169)](#148169)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-29T14:28:36Z","message":"increase
bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry
count to 5 to help retry on agent doc version conflict.\r\nIt looks like
3 retries are not enough for 100k agents update
tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can
be tested on an ECE high memory instance with 100k
horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","ci:cloud-deploy","v8.7.0","v8.6.1"],"number":148169,"url":"#148169
bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry
count to 5 to help retry on agent doc version conflict.\r\nIt looks like
3 retries are not enough for 100k agents update
tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can
be tested on an ECE high memory instance with 100k
horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"#148169
bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry
count to 5 to help retry on agent doc version conflict.\r\nIt looks like
3 retries are not enough for 100k agents update
tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can
be tested on an ECE high memory instance with 100k
horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@juliaElastic
Copy link
Contributor

Found a UI bug, reported here. It causes bulk update tags to not work with default status filters.

@jlind23
Copy link
Contributor

jlind23 commented Jan 3, 2023

Closing this issue in order to keep only: #148233

@jlind23 jlind23 closed this as completed Jan 3, 2023
@juliaElastic
Copy link
Contributor

@ablnk please test with the latest snapshot if the increase to 5 retries helped.

@ablnk
Copy link

ablnk commented Jan 3, 2023

@juliaElastic will retest as soon as #148233 fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants