Fix flaky TestEnqueueMDMCommand test by sgress454 · Pull Request #24697 · fleetdm/fleet

sgress454 · 2024-12-12T05:35:02Z

FYI this was diagnosed and fixed using the RandoKiller.

This PR fixes the TestEnqueueMDMCommand, which has been failing intermittently here. Most of the time the /api/latest/fleet/mdm/apple/commands API is returning one result as expected, but occasionally it returns 2, for example:

[
  {
    "device_id": "B11F1FC1-F176-48CF-88A4-CB7A3DFEF987",
    "command_uuid": "63bb4313-ccbf-4647-ac07-7d15df5f92d7",
    "updated_at": "2024-12-12T02:41:36Z",
    "request_type": "ProfileList",
    "status": "Acknowledged",
    "hostname": "test-host"
  },
  {
    "device_id": "B11F1FC1-F176-48CF-88A4-CB7A3DFEF987",
    "command_uuid": "7de9d712-7524-4443-a20a-7127e6064f6e",
    "updated_at": "2024-12-12T02:41:36.141498Z",
    "request_type": "InstallEnterpriseApplication",
    "status": "Pending",
    "hostname": "test-host"
  }
]

It seems that the second command is related to trying to install a bootstrap package (uploaded by a previous test) to the newly-enrolled host.

The fix in this PR is to filter the API response to only the command we're verifying the presence of. It's a decent solve, but leaves open the edge case of a bug that causes multiple commands to be sent unexpectedly. The ideal solution would be to remove the interaction between the two tests, perhaps by deleting any created bootstraps before those tests complete, or by re-initializing the state in some other way. I don't currently have enough context to easily implement a solution like that (i.e. I know there's a "delete bootstrap" API, but not sure if that's enough to solve this issue).

codecov · 2024-12-12T05:55:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.57%. Comparing base (e361073) to head (79c1573).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #24697      +/-   ##
==========================================
- Coverage   63.58%   63.57%   -0.01%     
==========================================
  Files        1601     1601              
  Lines      151666   151666              
  Branches     3898     3898              
==========================================
- Hits        96431    96422       -9     
- Misses      47571    47577       +6     
- Partials     7664     7667       +3

Flag	Coverage Δ
backend	`64.37% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jahzielv · 2024-12-12T14:29:08Z

-	require.Len(t, listCmdResp.Results, 1)
-	require.NotZero(t, listCmdResp.Results[0].UpdatedAt)
-	listCmdResp.Results[0].UpdatedAt = time.Time{}
+	results, err := json.Marshal(listCmdResp.Results)


tiny linting issue here (needs an error check)

jahzielv

Nice solve! LGTM (aside from the linting thing) because this does fix it, but to your point maybe some cleanup would be a better long-term solution?

We do have this cleanup in the TearDownTest method for the MDM suite. Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table? Or also the nano_commands table, seems like we're not cleaning that one out either.

sgress454 · 2024-12-12T14:56:14Z

Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table?

That sounds promising, I can try that!

sgress454 · 2024-12-12T17:38:00Z

Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table?

@jahzielv alas this did not fix the issue -- I tried clearing host_mdm_apple_bootstrap_packages, nano_commands and nano_cert_auth_associations (see main...refs/heads/sgress454/enqueue-mdm-command-randokiller#diff-2bd8ca2ddaad7aac0c438a2afd76a26872378249f757c9c81a31005d0e57cf1fR688-R702) but still got a run where there were two commands returned 😞

I fixed the lint issue if you're comfortable 👍 with the current approach, or have other suggestions.

jahzielv · 2024-12-12T17:56:22Z

@sgress454 aw nuts! Thanks for trying that anyway!

I'd say let's go with this fix for now, and we can do a deeper investigation later.

This PR adds a new workflow called "Stress Test Go Test" (aka the RandoKiller) that allows running one or more tests repeatedly up to a set number of times, or until a test fails. This is useful for: * Trying to diagnose and debug a flaky test * Verifying that a proposed fix for a flaky test actually works. To use: 1. Create a branch whose name ends with "-randokiller" 2. Modify the .github/workflows/config/randokiller.json file to your specifications (choosing the packages and tests to run, the mysql matrix, and the number of runs to do) 3. Push up the branch Since the stress test is intended to run a branch that you'll never merge, you should feel free to add whatever logs to your tests or code that will help diagnose failures. I used this to diagnose and fix #24697!

sgress454 · 2024-12-12T18:30:31Z

lol i accidentally closed this by saying that i used the randokiller to "fix" this issue 🤦

fix test

175969d

sgress454 requested a review from a team as a code owner December 12, 2024 05:35

sgress454 temporarily deployed to Docker Hub December 12, 2024 05:35 — with GitHub Actions Inactive

sgress454 mentioned this pull request Dec 12, 2024

Create RandoKiller™ to help diagnose and fix flaky tests #24696

Merged

jahzielv reviewed Dec 12, 2024

View reviewed changes

lint

52c94fe

sgress454 had a problem deploying to Docker Hub December 12, 2024 17:36 — with GitHub Actions Error

Merge branch 'main' into sgress454/fix-flaky-enqueuemdmcommand-test

79c1573

sgress454 temporarily deployed to Docker Hub December 12, 2024 17:37 — with GitHub Actions Inactive

jahzielv approved these changes Dec 12, 2024

View reviewed changes

sgress454 closed this in #24696 Dec 12, 2024

sgress454 reopened this Dec 12, 2024

sgress454 temporarily deployed to Docker Hub December 12, 2024 18:30 — with GitHub Actions Inactive

sgress454 merged commit cdae174 into main Dec 12, 2024

sgress454 deleted the sgress454/fix-flaky-enqueuemdmcommand-test branch December 12, 2024 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky TestEnqueueMDMCommand test#24697

Fix flaky TestEnqueueMDMCommand test#24697
sgress454 merged 3 commits intomainfrom
sgress454/fix-flaky-enqueuemdmcommand-test

sgress454 commented Dec 12, 2024

Uh oh!

codecov Bot commented Dec 12, 2024 •

edited

Loading

Uh oh!

jahzielv Dec 12, 2024 •

edited

Loading

Uh oh!

jahzielv left a comment

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

jahzielv commented Dec 12, 2024

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sgress454 commented Dec 12, 2024

Uh oh!

codecov Bot commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jahzielv Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jahzielv left a comment

Choose a reason for hiding this comment

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

jahzielv commented Dec 12, 2024

Uh oh!

sgress454 commented Dec 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Dec 12, 2024 •

edited

Loading

jahzielv Dec 12, 2024 •

edited

Loading