Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add critical policy and resolution data to device health API #16206

Closed
7 tasks
dherder opened this issue Jan 18, 2024 · 23 comments
Closed
7 tasks

Add critical policy and resolution data to device health API #16206

dherder opened this issue Jan 18, 2024 · 23 comments
Assignees
Labels
customer-denlea #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. story A user story defining an entire feature
Milestone

Comments

@dherder
Copy link
Contributor

dherder commented Jan 18, 2024

Goal

User story
As an endpoint operator,
I want to get a count of failing critical policies and resolution steps in Fleet's device health API (GET /hosts/:id/health)
so that I can block end users' access to third party tools if they're failing > 1 critical policy and show them the resolution steps.

Context

Changes

Product

Engineering

  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@dherder dherder added the ~feature fest Will be reviewed at next Feature Fest label Jan 18, 2024
@harrisonravazzolo
Copy link
Member

+1 🙇🏼

@noahtalerman noahtalerman removed the ~feature fest Will be reviewed at next Feature Fest label Jan 29, 2024
@dherder dherder added ~feature fest Will be reviewed at next Feature Fest customer-denlea labels Feb 8, 2024
@dherder
Copy link
Contributor Author

dherder commented Feb 12, 2024

bringing back to Feature Fest as per @mikermcneil

@harrisonravazzolo
Copy link
Member

harrisonravazzolo commented Feb 15, 2024

Ideally, the endpoint would look something like this when hitting /api/v1/fleet/hosts/{deviceID}/health

Note the device_attestation value now in the payload.
{
"host_id": 1,
"health": {
"updated_at": "2023-09-16T18:52:19Z",
"os_version": "MacOS 14.1.2",
"disk_encryption_enabled": true,
"device_attestation" : "passing"
"failing_policies": [
{
"id": 123,
"name": "Google Chrome is not up to date",
}
],
"vulnerable_software": [
{
"id": 321,
"name": "Firefox.app",
"version": "116.0.3",
}
]
}
}

As we want this to be self-service remediation, we would use the failing_policies and vulnerable_software to craft the Slack message to the end user, like such:

slack_msg

@noahtalerman noahtalerman changed the title Add policy failure count and percentage to device health report 🎸Add policy failure count and percentage to device health report Feb 15, 2024
@dherder
Copy link
Contributor Author

dherder commented Feb 15, 2024

@noahtalerman can we do this first (and then later do the suggestion @harrisonravazzolo notes in the above comment)
: If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

@harrisonravazzolo
Copy link
Member

Yes, there are a couple of iterations of this flowing in my head - ask @mikermcneil about the 'weighted' concept per policy we talked about.

But as @dherder stated, as a start any policy marked as critical that failing, makes the device_attestation value go from passing to failing. Or a boolean value or whatever the smarter than me people think is the best verbiage.

@noahtalerman
Copy link
Member

Hey @harrisonravazzolo for the first iteration, would a failing_policies_count work for your use case?

cc @dherder

@noahtalerman noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Feb 16, 2024
@noahtalerman noahtalerman self-assigned this Feb 16, 2024
@harrisonravazzolo
Copy link
Member

Hey @noahtalerman - not really, I can explain.

We already have this value in the /hosts endpoint that I am currently leveraging for the device trust flows but it requires a bit of computation on my side to calculate a threshold of policies I want to determine if the device is considered passing or failing in my environment. Like I know we have 40 policies (this can change often), so off this number I need to see, per device, how many are failing and which ones. Not all policies are critical.

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

For example, let's say that a zero-day in Chrome is patched and I want to create a basic policy that checks if Chrome.app > 121.0.6167.184. Create the policy, mark it as critical and now that is part of the calculation to the new key value returned.

If I wanted to do this now, I would have to hard-code this new policy into my automation to say, go fetch the device health for this host, now iterate through the list of failing policies, do a look up, if this value exists in failing policies, do this. It's a lot of computation that I think Fleet should be able to do easiliy to allow this to scale up.

@noahtalerman
Copy link
Member

noahtalerman commented Feb 19, 2024

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

@harrisonravazzolo ah, ok! Thanks.

If I'm understanding correctly, if a host is failing > 0 critical policies it's considered "unhealthy." At this point, the end user is blocked from third-party tools until they resolve the critical policies.

Would a failing_critical_policies_count work? If > 0, then the end user is blocked.

I imagine it would also be useful to add a critical property to each policy in the failing_policies array (GET /hosts/:id/health) so you can show the resolution steps for these policies to the user.

@harrisonravazzolo
Copy link
Member

@noahtalerman I think this would work as a launching-off point!

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

@noahtalerman noahtalerman changed the title 🎸Add policy failure count and percentage to device health report Add critical policy data to device health API Feb 20, 2024
@noahtalerman noahtalerman changed the title Add critical policy data to device health API Add critical policy and resolution data to device health API Feb 20, 2024
noahtalerman added a commit that referenced this issue Feb 20, 2024
…API #16206

- API changes for the following story: Add critical policy and resolution data to device health API (#16206)
@noahtalerman
Copy link
Member

Hey @dherder I moved your original issue description here:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem

Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

@noahtalerman
Copy link
Member

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

@harrisonravazzolo re including the info the the /health endpoint: agreed 💯

I also think this endpoint should return the resolution instructions so that you can show them to the user.

This pull request includes the proposed API changes: #16982

What do you think? Does this work for you?

@noahtalerman noahtalerman added the story A user story defining an entire feature label Feb 20, 2024
@noahtalerman noahtalerman changed the title Add critical policy and resolution data to device health API 🎸Add critical policy and resolution data to device health API Feb 20, 2024
@harrisonravazzolo
Copy link
Member

giphy

@noahtalerman noahtalerman changed the title 🎸Add critical policy and resolution data to device health API Add critical policy and resolution data to device health API Feb 22, 2024
@noahtalerman noahtalerman removed their assignment Feb 22, 2024
@noahtalerman
Copy link
Member

I reviewed this air guitar with @mikermcneil.

Let's move this user story forward with the formal drafting process leading to engineering.

@sharon-fdm heads up, I assigned this user story to you and moved it over to settled. I think it's ready for specs + estimation.

@sharon-fdm sharon-fdm added the #g-endpoint-ops Endpoint ops product group label Feb 22, 2024
@sharon-fdm
Copy link
Collaborator

sharon-fdm commented Feb 22, 2024

Sounds good @noahtalerman.
I'll catch up on the thread here.

@sharon-fdm
Copy link
Collaborator

sharon-fdm commented Feb 28, 2024

@noahtalerman long conversation here.
What I understand is that this is the TL;DR:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem
Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.
https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

@noahtalerman
Copy link
Member

noahtalerman commented Feb 29, 2024

I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

@sharon-fdm not quite. Instead, I want to make a single call to the device health endpoint and get the count of failing critical policies and the count of failing policies per host.

For user stories, the comment section is used for ideating during drafting/design. We don't clean it up when a story is "Settled."

When a story is "Settled," please use the issue description for the summary of what we want to change: #16206 (comment)

@sharon-fdm sharon-fdm assigned mostlikelee and unassigned sharon-fdm Mar 11, 2024
@sharon-fdm sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. and removed :product Product Design department (shows up on 🦢 Drafting board) labels Mar 11, 2024
@sharon-fdm sharon-fdm added this to the 4.48.0-tentative milestone Mar 11, 2024
@mostlikelee
Copy link
Contributor

waiting on comments in draft PR

@noahtalerman
Copy link
Member

Hey @harrisonravazzolo, heads up, this improvement won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

cc @sharon-fdm @dherder @Patagonia121

@noahtalerman
Copy link
Member

this feature won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

Also, FYI @spokanemac for the release article.

@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) #g-endpoint-ops Endpoint ops product group and removed #g-endpoint-ops Endpoint ops product group labels Apr 24, 2024
@noahtalerman
Copy link
Member

noahtalerman commented Apr 25, 2024

Hey @dherder and @Patagonia121, heads up, this customer request was shipped in 4.49 🎉

Docs are still TODO. PR is here: #16982

@rachaelshaw
Copy link
Member

New PR here: #18715 (to avoid messing with PR open time KPI)

@rachaelshaw
Copy link
Member

Docs are merged

@fleet-release
Copy link
Contributor

Policy data clear,
Device health API secure,
Fleet's path shines bright here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-denlea #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. story A user story defining an entire feature
Development

No branches or pull requests

8 participants