Skip to content

Fix flaky TestEnqueueMDMCommand test#24697

Merged
sgress454 merged 3 commits intomainfrom
sgress454/fix-flaky-enqueuemdmcommand-test
Dec 12, 2024
Merged

Fix flaky TestEnqueueMDMCommand test#24697
sgress454 merged 3 commits intomainfrom
sgress454/fix-flaky-enqueuemdmcommand-test

Conversation

@sgress454
Copy link
Copy Markdown
Contributor

FYI this was diagnosed and fixed using the RandoKiller.


This PR fixes the TestEnqueueMDMCommand, which has been failing intermittently here. Most of the time the /api/latest/fleet/mdm/apple/commands API is returning one result as expected, but occasionally it returns 2, for example:

[
  {
    "device_id": "B11F1FC1-F176-48CF-88A4-CB7A3DFEF987",
    "command_uuid": "63bb4313-ccbf-4647-ac07-7d15df5f92d7",
    "updated_at": "2024-12-12T02:41:36Z",
    "request_type": "ProfileList",
    "status": "Acknowledged",
    "hostname": "test-host"
  },
  {
    "device_id": "B11F1FC1-F176-48CF-88A4-CB7A3DFEF987",
    "command_uuid": "7de9d712-7524-4443-a20a-7127e6064f6e",
    "updated_at": "2024-12-12T02:41:36.141498Z",
    "request_type": "InstallEnterpriseApplication",
    "status": "Pending",
    "hostname": "test-host"
  }
]

It seems that the second command is related to trying to install a bootstrap package (uploaded by a previous test) to the newly-enrolled host.

The fix in this PR is to filter the API response to only the command we're verifying the presence of. It's a decent solve, but leaves open the edge case of a bug that causes multiple commands to be sent unexpectedly. The ideal solution would be to remove the interaction between the two tests, perhaps by deleting any created bootstraps before those tests complete, or by re-initializing the state in some other way. I don't currently have enough context to easily implement a solution like that (i.e. I know there's a "delete bootstrap" API, but not sure if that's enough to solve this issue).

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.57%. Comparing base (e361073) to head (79c1573).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #24697      +/-   ##
==========================================
- Coverage   63.58%   63.57%   -0.01%     
==========================================
  Files        1601     1601              
  Lines      151666   151666              
  Branches     3898     3898              
==========================================
- Hits        96431    96422       -9     
- Misses      47571    47577       +6     
- Partials     7664     7667       +3     
Flag Coverage Δ
backend 64.37% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

require.Len(t, listCmdResp.Results, 1)
require.NotZero(t, listCmdResp.Results[0].UpdatedAt)
listCmdResp.Results[0].UpdatedAt = time.Time{}
results, err := json.Marshal(listCmdResp.Results)
Copy link
Copy Markdown
Contributor

@jahzielv jahzielv Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny linting issue here (needs an error check)

Copy link
Copy Markdown
Contributor

@jahzielv jahzielv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solve! LGTM (aside from the linting thing) because this does fix it, but to your point maybe some cleanup would be a better long-term solution?

We do have this cleanup in the TearDownTest method for the MDM suite. Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table? Or also the nano_commands table, seems like we're not cleaning that one out either.

@sgress454
Copy link
Copy Markdown
Contributor Author

Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table?

That sounds promising, I can try that!

@sgress454
Copy link
Copy Markdown
Contributor Author

Perhaps we also need to delete from the host_mdm_apple_bootstrap_packages table?

@jahzielv alas this did not fix the issue -- I tried clearing host_mdm_apple_bootstrap_packages, nano_commands and nano_cert_auth_associations (see main...refs/heads/sgress454/enqueue-mdm-command-randokiller#diff-2bd8ca2ddaad7aac0c438a2afd76a26872378249f757c9c81a31005d0e57cf1fR688-R702) but still got a run where there were two commands returned 😞

I fixed the lint issue if you're comfortable 👍 with the current approach, or have other suggestions.

@jahzielv
Copy link
Copy Markdown
Contributor

@sgress454 aw nuts! Thanks for trying that anyway!

I'd say let's go with this fix for now, and we can do a deeper investigation later.

sgress454 added a commit that referenced this pull request Dec 12, 2024
This PR adds a new workflow called "Stress Test Go Test" (aka the
RandoKiller) that allows running one or more tests repeatedly up to a
set number of times, or until a test fails. This is useful for:

* Trying to diagnose and debug a flaky test
* Verifying that a proposed fix for a flaky test actually works.

To use:

1. Create a branch whose name ends with "-randokiller"
2. Modify the .github/workflows/config/randokiller.json file to your
specifications (choosing the packages and tests to run, the mysql
matrix, and the number of runs to do)
3. Push up the branch

Since the stress test is intended to run a branch that you'll never
merge, you should feel free to add whatever logs to your tests or code
that will help diagnose failures.

I used this to diagnose and fix
#24697!
@sgress454
Copy link
Copy Markdown
Contributor Author

lol i accidentally closed this by saying that i used the randokiller to "fix" this issue 🤦

@sgress454 sgress454 reopened this Dec 12, 2024
@sgress454 sgress454 merged commit cdae174 into main Dec 12, 2024
@sgress454 sgress454 deleted the sgress454/fix-flaky-enqueuemdmcommand-test branch December 12, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants