Add critical policy and resolution data to device health API #16206

dherder · 2024-01-18T20:44:59Z

Goal

User story
As an endpoint operator,
I want to get a count of failing critical policies and resolution steps in Fleet's device health API (`GET /hosts/:id/health`)
so that I can block end users' access to third party tools if they're failing > 1 critical policy and show them the resolution steps.

Context

Requestor(s): @dherder
Product designer: @noahtalerman

Changes

Product

REST API changes: API design is included in the PR to the REST API docs: API design: Add critical policy and resolution data to device health API #16206 #16982
Outdated documentation changes: Covered by the PR to the REST API docs: API design: Add critical policy and resolution data to device health API #16206 #16982
Changes to paid features or tiers: The new failing_critical_policies and critical properties are only available in Fleet Premium.

Engineering

Database schema migrations: TODO
Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

Engineer (@____): Added comment to user story confirming successful completion of QA.
QA (@____): Added comment to user story confirming successful completion of QA.

The text was updated successfully, but these errors were encountered:

harrisonravazzolo · 2024-01-19T20:55:39Z

+1 🙇🏼

dherder · 2024-02-12T23:02:41Z

bringing back to Feature Fest as per @mikermcneil

harrisonravazzolo · 2024-02-15T20:09:07Z

Ideally, the endpoint would look something like this when hitting /api/v1/fleet/hosts/{deviceID}/health

Note the device_attestation value now in the payload.
{
"host_id": 1,
"health": {
"updated_at": "2023-09-16T18:52:19Z",
"os_version": "MacOS 14.1.2",
"disk_encryption_enabled": true,
"device_attestation" : "passing"
"failing_policies": [
{
"id": 123,
"name": "Google Chrome is not up to date",
}
],
"vulnerable_software": [
{
"id": 321,
"name": "Firefox.app",
"version": "116.0.3",
}
]
}
}

As we want this to be self-service remediation, we would use the failing_policies and vulnerable_software to craft the Slack message to the end user, like such:

dherder · 2024-02-15T21:06:00Z

@noahtalerman can we do this first (and then later do the suggestion @harrisonravazzolo notes in the above comment)
: If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

harrisonravazzolo · 2024-02-15T21:14:38Z

Yes, there are a couple of iterations of this flowing in my head - ask @mikermcneil about the 'weighted' concept per policy we talked about.

But as @dherder stated, as a start any policy marked as critical that failing, makes the device_attestation value go from passing to failing. Or a boolean value or whatever the smarter than me people think is the best verbiage.

noahtalerman · 2024-02-16T14:28:37Z

Hey @harrisonravazzolo for the first iteration, would a failing_policies_count work for your use case?

cc @dherder

harrisonravazzolo · 2024-02-16T18:27:29Z

Hey @noahtalerman - not really, I can explain.

We already have this value in the /hosts endpoint that I am currently leveraging for the device trust flows but it requires a bit of computation on my side to calculate a threshold of policies I want to determine if the device is considered passing or failing in my environment. Like I know we have 40 policies (this can change often), so off this number I need to see, per device, how many are failing and which ones. Not all policies are critical.

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

For example, let's say that a zero-day in Chrome is patched and I want to create a basic policy that checks if Chrome.app > 121.0.6167.184. Create the policy, mark it as critical and now that is part of the calculation to the new key value returned.

If I wanted to do this now, I would have to hard-code this new policy into my automation to say, go fetch the device health for this host, now iterate through the list of failing policies, do a look up, if this value exists in failing policies, do this. It's a lot of computation that I think Fleet should be able to do easiliy to allow this to scale up.

noahtalerman · 2024-02-19T17:40:47Z

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

@harrisonravazzolo ah, ok! Thanks.

If I'm understanding correctly, if a host is failing > 0 critical policies it's considered "unhealthy." At this point, the end user is blocked from third-party tools until they resolve the critical policies.

Would a failing_critical_policies_count work? If > 0, then the end user is blocked.

I imagine it would also be useful to add a critical property to each policy in the failing_policies array (GET /hosts/:id/health) so you can show the resolution steps for these policies to the user.

harrisonravazzolo · 2024-02-19T18:56:27Z

@noahtalerman I think this would work as a launching-off point!

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

…API #16206 - API changes for the following story: Add critical policy and resolution data to device health API (#16206)

noahtalerman · 2024-02-20T14:36:39Z

Hey @dherder I moved your original issue description here:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem

Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

noahtalerman · 2024-02-20T14:42:40Z

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

@harrisonravazzolo re including the info the the /health endpoint: agreed 💯

I also think this endpoint should return the resolution instructions so that you can show them to the user.

This pull request includes the proposed API changes: #16982

What do you think? Does this work for you?

harrisonravazzolo · 2024-02-20T18:47:15Z

noahtalerman · 2024-02-22T16:07:40Z

I reviewed this air guitar with @mikermcneil.

Let's move this user story forward with the formal drafting process leading to engineering.

@sharon-fdm heads up, I assigned this user story to you and moved it over to settled. I think it's ready for specs + estimation.

sharon-fdm · 2024-02-22T17:13:47Z

Sounds good @noahtalerman.
I'll catch up on the thread here.

sharon-fdm · 2024-02-28T21:11:43Z

@noahtalerman long conversation here.
What I understand is that this is the TL;DR:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem
Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.
https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

noahtalerman · 2024-02-29T14:42:46Z

I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

@sharon-fdm not quite. Instead, I want to make a single call to the device health endpoint and get the count of failing critical policies and the count of failing policies per host.

For user stories, the comment section is used for ideating during drafting/design. We don't clean it up when a story is "Settled."

When a story is "Settled," please use the issue description for the summary of what we want to change: #16206 (comment)

mostlikelee · 2024-03-20T14:34:38Z

waiting on comments in draft PR

noahtalerman · 2024-04-01T19:53:50Z

Hey @harrisonravazzolo, heads up, this improvement won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

cc @sharon-fdm @dherder @Patagonia121

noahtalerman · 2024-04-01T20:26:47Z

this feature won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

Also, FYI @spokanemac for the release article.

noahtalerman · 2024-04-25T18:46:17Z

Hey @dherder and @Patagonia121, heads up, this customer request was shipped in 4.49 🎉

Docs are still TODO. PR is here: #16982

rachaelshaw · 2024-05-02T18:37:23Z

New PR here: #18715 (to avoid messing with PR open time KPI)

rachaelshaw · 2024-05-02T21:29:59Z

Docs are merged

fleet-release · 2024-05-02T21:30:02Z

Policy data clear,
Device health API secure,
Fleet's path shines bright here.

dherder added the ~feature fest Will be reviewed at next Feature Fest label Jan 18, 2024

noahtalerman removed the ~feature fest Will be reviewed at next Feature Fest label Jan 29, 2024

dherder added ~feature fest Will be reviewed at next Feature Fest customer-denlea labels Feb 8, 2024

noahtalerman changed the title ~~Add policy failure count and percentage to device health report~~ 🎸Add policy failure count and percentage to device health report Feb 15, 2024

noahtalerman added the ~air-guitar label Feb 15, 2024

noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Feb 16, 2024

noahtalerman self-assigned this Feb 16, 2024

noahtalerman changed the title ~~🎸Add policy failure count and percentage to device health report~~ Add critical policy data to device health API Feb 20, 2024

noahtalerman changed the title ~~Add critical policy data to device health API~~ Add critical policy and resolution data to device health API Feb 20, 2024

noahtalerman added a commit that referenced this issue Feb 20, 2024

API design: Add critical policy and resolution data to device health …

5ceff32

…API #16206 - API changes for the following story: Add critical policy and resolution data to device health API (#16206)

noahtalerman mentioned this issue Feb 20, 2024

API design: Add critical policy and resolution data to device health API #16206 #16982

Closed

noahtalerman added the story A user story defining an entire feature label Feb 20, 2024

noahtalerman changed the title ~~Add critical policy and resolution data to device health API~~ 🎸Add critical policy and resolution data to device health API Feb 20, 2024

noahtalerman changed the title ~~🎸Add critical policy and resolution data to device health API~~ Add critical policy and resolution data to device health API Feb 22, 2024

noahtalerman removed the ~air-guitar label Feb 22, 2024

noahtalerman assigned sharon-fdm Feb 22, 2024

noahtalerman removed their assignment Feb 22, 2024

sharon-fdm added the #g-endpoint-ops Endpoint ops product group label Feb 22, 2024

sharon-fdm assigned mostlikelee and unassigned sharon-fdm Mar 11, 2024

sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. and removed :product Product Design department (shows up on 🦢 Drafting board) labels Mar 11, 2024

sharon-fdm added this to the 4.48.0-tentative milestone Mar 11, 2024

spokanemac mentioned this issue Mar 11, 2024

Release article: v4.48.0 #17531

Closed

1 task

mostlikelee mentioned this issue Mar 21, 2024

Add Failing Policy Counts to Health API #17758

Merged

4 tasks

sharon-fdm modified the milestones: 4.48.0-tentative, 4.49.0-tentative Apr 1, 2024

spokanemac mentioned this issue Apr 2, 2024

Release article: v4.49.0 #18018

Closed

1 task

lukeheath added :product Product Design department (shows up on 🦢 Drafting board) #g-endpoint-ops Endpoint ops product group and removed #g-endpoint-ops Endpoint ops product group labels Apr 24, 2024

rachaelshaw closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add critical policy and resolution data to device health API #16206

Add critical policy and resolution data to device health API #16206

dherder commented Jan 18, 2024 •

edited by rachaelshaw

Loading

harrisonravazzolo commented Jan 19, 2024

dherder commented Feb 12, 2024

harrisonravazzolo commented Feb 15, 2024 •

edited

Loading

dherder commented Feb 15, 2024

harrisonravazzolo commented Feb 15, 2024

noahtalerman commented Feb 16, 2024

harrisonravazzolo commented Feb 16, 2024

noahtalerman commented Feb 19, 2024 •

edited

Loading

harrisonravazzolo commented Feb 19, 2024

noahtalerman commented Feb 20, 2024

noahtalerman commented Feb 20, 2024

harrisonravazzolo commented Feb 20, 2024

noahtalerman commented Feb 22, 2024

sharon-fdm commented Feb 22, 2024 •

edited

Loading

sharon-fdm commented Feb 28, 2024 •

edited

Loading

noahtalerman commented Feb 29, 2024 •

edited

Loading

mostlikelee commented Mar 20, 2024

noahtalerman commented Apr 1, 2024

noahtalerman commented Apr 1, 2024

noahtalerman commented Apr 25, 2024 •

edited

Loading

rachaelshaw commented May 2, 2024

rachaelshaw commented May 2, 2024

fleet-release commented May 2, 2024

Add critical policy and resolution data to device health API #16206

Add critical policy and resolution data to device health API #16206

Comments

dherder commented Jan 18, 2024 • edited by rachaelshaw Loading

Goal

Context

Changes

Product

Engineering

QA

Risk assessment

Manual testing steps

Testing notes

Confirmation

harrisonravazzolo commented Jan 19, 2024

dherder commented Feb 12, 2024

harrisonravazzolo commented Feb 15, 2024 • edited Loading

dherder commented Feb 15, 2024

harrisonravazzolo commented Feb 15, 2024

noahtalerman commented Feb 16, 2024

harrisonravazzolo commented Feb 16, 2024

noahtalerman commented Feb 19, 2024 • edited Loading

harrisonravazzolo commented Feb 19, 2024

noahtalerman commented Feb 20, 2024

Problem

noahtalerman commented Feb 20, 2024

harrisonravazzolo commented Feb 20, 2024

noahtalerman commented Feb 22, 2024

sharon-fdm commented Feb 22, 2024 • edited Loading

sharon-fdm commented Feb 28, 2024 • edited Loading

noahtalerman commented Feb 29, 2024 • edited Loading

mostlikelee commented Mar 20, 2024

noahtalerman commented Apr 1, 2024

noahtalerman commented Apr 1, 2024

noahtalerman commented Apr 25, 2024 • edited Loading

rachaelshaw commented May 2, 2024

rachaelshaw commented May 2, 2024

fleet-release commented May 2, 2024

dherder commented Jan 18, 2024 •

edited by rachaelshaw

Loading

harrisonravazzolo commented Feb 15, 2024 •

edited

Loading

noahtalerman commented Feb 19, 2024 •

edited

Loading

sharon-fdm commented Feb 22, 2024 •

edited

Loading

sharon-fdm commented Feb 28, 2024 •

edited

Loading

noahtalerman commented Feb 29, 2024 •

edited

Loading

noahtalerman commented Apr 25, 2024 •

edited

Loading