New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for action_seq_no being overwritten with 0 #2582
Conversation
span.Context.SetLabel("action_count", len(fromPtr(resp.Actions))) | ||
span.Context.SetLabel("agent_id", agent.Id) | ||
defer span.End() | ||
if len(fromPtr(resp.Actions)) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just moving the trace into a condition to reduce noise ("action deliver" span was showing up even though there were no actions)
internal/pkg/api/handleCheckin.go
Outdated
@@ -384,6 +386,8 @@ func (ct *CheckinT) resolveSeqNo(ctx context.Context, zlog zerolog.Logger, req C | |||
if errors.Is(err, dl.ErrNotFound) { | |||
zlog.Debug().Str("token", *ackToken).Msg("revision token not found") | |||
err = nil | |||
// should be left the ActionSeqNo if no ackToken, otherwise would be overwritten with 0 on a Fleet Server restart | |||
return seqno, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without the return, the code returned seqno = []int64{sn}
, and sn
had the value 0 if ackToken
not defined.
Plan to add an automated test for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM - thanks for tracking this down
internal/pkg/checkin/bulk.go
Outdated
// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack | ||
// is never received. | ||
dl.FieldUpgradeStartedAt: nil, | ||
dl.FieldUpgradedAt: nowTimestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revived this draft as a safety net, in case the ack is lost and the agent checks in with a new version, we can safely reset updating status (upgrade_started_at
)
t.Run(tc.name, func(t *testing.T) { | ||
// setup mock CheckinT | ||
logger := testlog.SetLogger(t) | ||
ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took most of the setup code from server_test.go
, might be simplified for a unit test
internal/pkg/checkin/bulk.go
Outdated
// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack | ||
// is never received. | ||
dl.FieldUpgradeStartedAt: nil, | ||
dl.FieldUpgradedAt: nowTimestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really clear how to assert in the unit test that extra fields were set, can we capture the args that the mock was called with in go?
mockBulk.On("MUpdate", mock.Anything, mock.MatchedBy(matchOp(t, c, start)), mock.Anything).Return([]bulk.BulkIndexerResponseItem{}, nil).Once() |
Ran a few rounds of test with the pr fleet-server image with 25k drones, and didn't see issues so far. Plan to merge this tomorrow. |
internal/pkg/checkin/bulk.go
Outdated
// This helps eliminate the case where the agent is stuck in upgrading after it's already upgraded if the ack | ||
// is never received. | ||
fields[dl.FieldUpgradeStartedAt] = nil | ||
fields[dl.FieldUpgradedAt] = nowTimestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some local testing, and I think this doesn't work well, the version is coming in every checkin, and the upgraded_at
field keeps updating, which can be misleading.
Added logging in line 212:
07:19:07.751 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1
07:19:37.751 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1
07:20:07.750 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1
07:20:37.749 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1
07:21:07.748 INF new version ecs.version=1.6.0 service.name=fleet-server ver=8.2.1
I think the right thing to do here would be to read the existing action doc, and only update the upgrade fields if the version is newer, and the upgrade_started_at
field is set.
I'll revert for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it might mess up the action status tracking in the UI if we don't create the action result document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the action would stay in progress
Bugfix
What is the problem this PR solves?
Fix for a bug noticed in upgrades, where the upgrade action is the first action (
_seq_no:0
), and the agent doesn't get the action delivered on the next checkin if it misses the first checkin.How does this PR solve the problem?
In
resolveSeqNo
, the agent'saction_seq_no
was incorrectly overwritten with 0 if theackToken
was not provided, which is the case for the first action.In this case the agent doesn't get the action with
seq_no:0
, as Fleet Server considers it already delivered.The agent should keep
action_seq_on:-1
until it gets an action delivered.The overwrite is also wrong after Fleet Server restart, where the
ackToken
is empty (if I understand this correctly).How to test this PR locally
_seq_no:0
Agent (drone) after enroll:
After upgrade:
Action with seq_no:
Fleet server delivered the action:
Design Checklist
Checklist
./changelog/fragments
using the changelog toolRelated issues
Relates #2519