fix for action_seq_no being overwritten with 0 #2582

juliaElastic · 2023-05-12T12:03:16Z

Bugfix

What is the problem this PR solves?

Fix for a bug noticed in upgrades, where the upgrade action is the first action (_seq_no:0), and the agent doesn't get the action delivered on the next checkin if it misses the first checkin.

How does this PR solve the problem?

In resolveSeqNo, the agent's action_seq_no was incorrectly overwritten with 0 if the ackToken was not provided, which is the case for the first action.
In this case the agent doesn't get the action with seq_no:0, as Fleet Server considers it already delivered.
The agent should keep action_seq_on:-1 until it gets an action delivered.

The overwrite is also wrong after Fleet Server restart, where the ackToken is empty (if I understand this correctly).

How to test this PR locally

Start a fresh ES, kibana, fleet-server (no actions created before)
Enroll 1 drone with horde
restart fleet-server
trigger an upgrade action, the action is pesristed with _seq_no:0
wait for the drone to checkin, it should receive the action and upgrade successfully.

Agent (drone) after enroll:

        "_source": {
          "action_seq_no": [
            -1
          ],
          "local_metadata": {
            "elastic": {
              "agent": {
                "id": "e28a2fa7-346f-42da-a7e9-76d8f0fe17fd",
                "version": "8.2.0"
              }
            },

After upgrade:

          "action_seq_no": [
            0
          ],
          "local_metadata": {
            "elastic": {
              "agent": {
                "id": "e28a2fa7-346f-42da-a7e9-76d8f0fe17fd",
                "version": "8.2.1"
              }
            },

Action with seq_no:

 {
        "_index": ".fleet-actions-7",
        "_id": "5347065f-2b26-4c5b-a253-9f34e6fcddc5",
        "_seq_no": 0,
        "_primary_term": 1,
        "_score": 1,
        "_source": {
          "@timestamp": "2023-05-12T12:12:27.247Z",
          "expiration": "2023-06-11T12:12:27.247Z",
          "agents": [
            "e28a2fa7-346f-42da-a7e9-76d8f0fe17fd"
          ],
          "action_id": "3d8ade67-388a-48ba-aa15-842bd8dd64a8",
          "data": {
            "version": "8.2.1"
          },
          "type": "UPGRADE",
          "traceparent": "00-330370b9dfdceb2b46606abefe21224d-7600178b8a212d24-01"
        }
      }

Fleet server delivered the action:

12:13:21.535 INF Action delivered to agent on checkin ackToken=5347065f-2b26-4c5b-a253-9f34e6fcddc5 createdAt=2023-05-12T12:12:27.247Z ecs.version=1.6.0 fleet.access.apikey.id=Fs7fD4gBC7gel4XV8JLp fleet.agent.id=e28a2fa7-346f-42da-a7e9-76d8f0fe17fd http.request.id=aea58b3c-18f8-4675-9ad5-74077d5a3228 id=3d8ade67-388a-48ba-aa15-842bd8dd64a8 inputType= server.address= service.name=fleet-server timeout=0 type=UPGRADE

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates #2519

juliaElastic · 2023-05-12T12:04:25Z

internal/pkg/api/handleCheckin.go

-	span.Context.SetLabel("action_count", len(fromPtr(resp.Actions)))
-	span.Context.SetLabel("agent_id", agent.Id)
-	defer span.End()
+	if len(fromPtr(resp.Actions)) > 0 {


this is just moving the trace into a condition to reduce noise ("action deliver" span was showing up even though there were no actions)

juliaElastic · 2023-05-12T12:05:30Z

internal/pkg/api/handleCheckin.go

@@ -384,6 +386,8 @@ func (ct *CheckinT) resolveSeqNo(ctx context.Context, zlog zerolog.Logger, req C
 			if errors.Is(err, dl.ErrNotFound) {
 				zlog.Debug().Str("token", *ackToken).Msg("revision token not found")
 				err = nil
+				// should be left the ActionSeqNo if no ackToken, otherwise would be overwritten with 0 on a Fleet Server restart
+				return seqno, nil


without the return, the code returned seqno = []int64{sn}, and sn had the value 0 if ackToken not defined.

juliaElastic · 2023-05-12T12:06:50Z

Plan to add an automated test for this.

elasticmachine · 2023-05-12T12:31:31Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-05-16T07:27:37.793+0000
Duration: 26 min 23 sec

Test stats 🧪

Test	Results
Failed	0
Passed	277
Skipped	1
Total	278

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

kpollich

Code LGTM - thanks for tracking this down

juliaElastic · 2023-05-15T13:47:02Z

internal/pkg/checkin/bulk.go

+					// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack
+					// is never received.
+					dl.FieldUpgradeStartedAt: nil,
+					dl.FieldUpgradedAt:       nowTimestamp,


Revived this draft as a safety net, in case the ack is lost and the agent checks in with a new version, we can safely reset updating status (upgrade_started_at)

juliaElastic · 2023-05-15T16:03:49Z

internal/pkg/api/handleCheckin_test.go

+		t.Run(tc.name, func(t *testing.T) {
+			// setup mock CheckinT
+			logger := testlog.SetLogger(t)
+			ctx, cancel := context.WithCancel(context.Background())


I took most of the setup code from server_test.go, might be simplified for a unit test

juliaElastic · 2023-05-15T16:19:27Z

internal/pkg/checkin/bulk.go

+					// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack
+					// is never received.
+					dl.FieldUpgradeStartedAt: nil,
+					dl.FieldUpgradedAt:       nowTimestamp,


it's not really clear how to assert in the unit test that extra fields were set, can we capture the args that the mock was called with in go?

fleet-server/internal/pkg/checkin/bulk_test.go

Line 184 in c8966ed

mockBulk.On("MUpdate", mock.Anything, mock.MatchedBy(matchOp(t, c, start)), mock.Anything).Return([]bulk.BulkIndexerResponseItem{}, nil).Once()

juliaElastic · 2023-05-15T16:28:06Z

Ran a few rounds of test with the pr fleet-server image with 25k drones, and didn't see issues so far. Plan to merge this tomorrow.

juliaElastic · 2023-05-16T07:25:07Z

internal/pkg/checkin/bulk.go

+				// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack
+				// is never received.
+				fields[dl.FieldUpgradeStartedAt] = nil
+				fields[dl.FieldUpgradedAt] = nowTimestamp


Did some local testing, and I think this doesn't work well, the version is coming in every checkin, and the upgraded_at field keeps updating, which can be misleading.
Added logging in line 212:

07:19:07.751 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1 07:19:37.751 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1 07:20:07.750 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1 07:20:37.749 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1 07:21:07.748 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1

I think the right thing to do here would be to read the existing action doc, and only update the upgrade fields if the version is newer, and the upgrade_started_at field is set.
I'll revert for now.

I also think it might mess up the action status tracking in the UI if we don't create the action result document?

yes, the action would stay in progress

fix for action_seq_no being overwritten with 0

1a64d09

juliaElastic self-assigned this May 12, 2023

juliaElastic requested review from a team as code owners May 12, 2023 12:03

juliaElastic requested review from blakerouse and michel-laterman May 12, 2023 12:03

juliaElastic commented May 12, 2023

View reviewed changes

fix linter

fde1b6d

juliaElastic added the v8.9.0 label May 12, 2023

juliaElastic mentioned this pull request May 12, 2023

Upgrade action sometimes leaves agent in Updating state #2519

Closed

kpollich approved these changes May 12, 2023

View reviewed changes

added changelog, resetting updating state on new version

5ffa247

juliaElastic commented May 15, 2023

View reviewed changes

added unit test on resolveSeqNo change

c8966ed

juliaElastic commented May 15, 2023

View reviewed changes

Merge branch 'main' into fix-upgrade

85f102b

juliaElastic added 2 commits May 16, 2023 08:24

unit test, fixed bug

63da1f7

fix lint

905d073

juliaElastic commented May 16, 2023

View reviewed changes

reverted changes in bulk.go

836b0fd

juliaElastic mentioned this pull request May 16, 2023

Update upgrade fields on version change #1731

Closed

5 tasks

juliaElastic merged commit e7d04c9 into elastic:main May 16, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for action_seq_no being overwritten with 0 #2582

fix for action_seq_no being overwritten with 0 #2582

juliaElastic commented May 12, 2023 •

edited

juliaElastic May 12, 2023

juliaElastic May 12, 2023

juliaElastic commented May 12, 2023

elasticmachine commented May 12, 2023 •

edited

Build stats

Test stats 🧪

kpollich left a comment

juliaElastic May 15, 2023 •

edited

juliaElastic May 15, 2023

juliaElastic May 15, 2023

juliaElastic commented May 15, 2023

juliaElastic May 16, 2023

joshdover May 16, 2023

juliaElastic May 16, 2023

fix for action_seq_no being overwritten with 0 #2582

fix for action_seq_no being overwritten with 0 #2582

Conversation

juliaElastic commented May 12, 2023 • edited

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

juliaElastic May 12, 2023

Choose a reason for hiding this comment

juliaElastic May 12, 2023

Choose a reason for hiding this comment

juliaElastic commented May 12, 2023

elasticmachine commented May 12, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

kpollich left a comment

Choose a reason for hiding this comment

juliaElastic May 15, 2023 • edited

Choose a reason for hiding this comment

juliaElastic May 15, 2023

Choose a reason for hiding this comment

juliaElastic May 15, 2023

Choose a reason for hiding this comment

juliaElastic commented May 15, 2023

juliaElastic May 16, 2023

Choose a reason for hiding this comment

joshdover May 16, 2023

Choose a reason for hiding this comment

juliaElastic May 16, 2023

Choose a reason for hiding this comment

juliaElastic commented May 12, 2023 •

edited

elasticmachine commented May 12, 2023 •

edited

juliaElastic May 15, 2023 •

edited