Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find hosts with the most issues #18115

Closed
5 of 9 tasks
noahtalerman opened this issue Apr 8, 2024 · 22 comments
Closed
5 of 9 tasks

Find hosts with the most issues #18115

noahtalerman opened this issue Apr 8, 2024 · 22 comments
Assignees
Labels
customer-schur #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Milestone

Comments

@noahtalerman
Copy link
Member

noahtalerman commented Apr 8, 2024

Goal

User story
As a security leader,
I want to sort hosts by issues (# failing policis + # critical vulns - CVSS > 8.9)
so that I can ask for the owners of these hosts to focus on fixing the hosts w/ the most issues.

Context

Changes

Product

  • UI changes: Figma wireframes
  • REST API changes: API design PR
  • Permissions changes: All roles can access the number of issues
  • Outdated documentation changes: If documented, update the definition of issues in Fleet. UPDATE: No mention of these "issues" found (noahtalerman 2024-07-01)
  • Changes to paid features or tiers: Failing policies count and Issues count is available in Fleet Free and Fleet Premium. Critical vulns count is only available in Fleet Premium

Engineering

  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: Yes
  • Risk level: High
  • Risk description: We added an additional DB write for every policy result processing from the host.

Load testing plan

For the below scenarios, monitor latency and DB performance.

  1. Start with 100K hosts failing a policy. Modify the SQL of that policy.

  2. Start with 100K hosts failing a policy. Modify the platforms of that policy (like uncheck "windows").

  3. Start with 100K hosts failing a policy. Transfer them to a different team.

  4. Start with 100K hosts failing a policy. Delete that policy.

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@noahtalerman noahtalerman added story A user story defining an entire feature customer-schur :product Product Design department (shows up on 🦢 Drafting board) labels Apr 8, 2024
@noahtalerman noahtalerman self-assigned this Apr 8, 2024
@noahtalerman
Copy link
Member Author

Add sort to the "Issues" column on the Hosts page. Update issues count = # failing critical policies + # of vulnerabilities w/ known exploits (CISA KEV)

@noahtalerman
Copy link
Member Author

noahtalerman commented Apr 11, 2024

Hey @cjwalton this story covers the next iteration of Fleet's version of a host "risk score."

Our understanding is that y'all are looking for a way to prioritize hosts that need fixing/updating/patching.

The plan is to allow y'all to sort hosts by "issues" in Fleet: # critical policies failed + # vulns w/ known exploits (from CISA KEV)

Jason: EPSS takes CISA KEV into an account. Maybe let's start w/ EPSS > 70% and/or CVSS > 8

We want to start simple so that we move quickly for y'all while leaving the door open for future iterations.

I recorded a Loom video that walks through the improvement in more details here: https://www.loom.com/share/d594151980ec47298efafb159f0e91b1?sid=c4465470-cc23-47d9-9d86-b2542898774f

What do you think?

@noahtalerman noahtalerman added ~feature fest Will be reviewed at next Feature Fest and removed :product Product Design department (shows up on 🦢 Drafting board) labels Apr 18, 2024
@noahtalerman noahtalerman removed their assignment Apr 19, 2024
@noahtalerman noahtalerman removed the ~feature fest Will be reviewed at next Feature Fest label Apr 19, 2024
@noahtalerman noahtalerman added ~feature fest Will be reviewed at next Feature Fest #g-endpoint-ops Endpoint ops product group and removed ~feature fest Will be reviewed at next Feature Fest labels May 9, 2024
@noahtalerman noahtalerman self-assigned this May 10, 2024
@noahtalerman noahtalerman added the :product Product Design department (shows up on 🦢 Drafting board) label May 10, 2024
@noahtalerman
Copy link
Member Author

noahtalerman commented May 22, 2024

Jason: EPSS takes CISA KEV into an account. Maybe let's start w/ EPSS > 70% and/or CVSS > 8

Hey @cjwalton, based on your feedback (above) we tweaked the "Issues" count to include critical vulns (CVEs w/ CVSS score > 8.9):
Screenshot 2024-05-22 at 9 55 35 AM

The plan is to start with this.

In future iterations we can add the ability to customize the "Issues" count. For example:

  • Only include critical vulns (no failing policies)
  • Include vulns w/ EPSS > 70%. With or w/o critical vulns. With or w/o failing policies

Does that work for you?

@sharon-fdm
Copy link
Contributor

sharon-fdm commented May 29, 2024

BE 5
FE 5

@RachelElysia
Copy link
Member

@sharon-fdm

@jacobshandling mentioned the scope of this for FE is probably larger than anticipated.

TLDR: Looks like device user page and host details page use the same code, HostSummary card to render issues. Make that into reusable component for ManageHostsPage, and then add sort and empty state…. And basic tests for all?

To be thorough, this might be a 5.

@sharon-fdm sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jun 3, 2024
@sharon-fdm sharon-fdm added this to the 4.52.0-tentative milestone Jun 3, 2024
@lukeheath lukeheath removed this from the 4.52.0-tentative milestone Jun 7, 2024
jacobshandling pushed a commit that referenced this issue Jun 19, 2024
## Issue
Unreleased fix for #18115 

## Description
- BE shows `0` count for empty state so FE needs to account for `0`
instead of `undefined`

## Screenshot of fix
<img width="1219" alt="Screenshot 2024-06-18 at 5 00 04 PM"
src="https://github.com/fleetdm/fleet/assets/71795832/cd6ec944-ce99-4f8e-a630-9bf037abd0b9">


# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [x] Manual QA for all new/changed functionality
@getvictor
Copy link
Member

@xpkoala The regular QA was done by @RachelElysia

I added a few test scenarios for load testing in the description, and moved issue back to Awaiting QA.

@xpkoala
Copy link
Contributor

xpkoala commented Jun 21, 2024

Thanks @getvictor!

@xpkoala
Copy link
Contributor

xpkoala commented Jun 24, 2024

Found an issue when modifying a policy that affects 50k+ hosts with 100k+ hosts enrolled.

level=error ts=2024-06-24T15:33:47.706507287Z component=http user=tomas@fleetdm.com method=PATCH uri=/api/latest/fleet/policies/1 took=3.712123448s name="Q1 (1)" sql="SELECT * FROM osquery_info" err="saving policy: update failing policies in host issues: Error 1436 (HY000): Thread stack overrun: 242191 bytes used of a 262144 byte stack, and 20000 bytes needed. Use 'mysqld --thread_stack=#' to specify a bigger stack."

Reproduce:

  • With a policy that affects a large number of hosts (50k in this scenario)
  • Choose to edit the affected OS's for the policy (I removed 'mac' which would have set the # of hosts affected by the policy to 0)
  • Error banner "Something went wrong"

A 422 http error is recorded in the web console.

@sharon-fdm
Copy link
Contributor

Remaining work reset to 1 point.

@xpkoala
Copy link
Contributor

xpkoala commented Jun 24, 2024

Docker image being used to test this fix is 4530loadtestA

getvictor added a commit that referenced this issue Jun 25, 2024
#18115 
Fixing unreleased bug found when load testing host issues update.
@xpkoala xpkoala modified the milestones: 4.54.0-tentative, 4.53.0 Jun 25, 2024
getvictor added a commit that referenced this issue Jun 25, 2024
#18115 
Fixing issue saw in load test:
```
level=error ts=2024-06-25T17:09:08.230514976Z cron=vulnerabilities schedule=vulnerabilities instanceID="5boTc/PamsSp8Jsh4kiEOpECmPu+bmOAJaVX4XV7ZOG4vgO4U6peHyxH8mFQhBXYJt+roRpwNuGmUoEI8n/otg==" err="running job" details="get critical vulnerabilities count: Error 1114 (HY000): The table '/rdsdbdata/tmp/#sql127_6b4b_ad107' is full" jobID=update_host_issues_vulnerabilities_counts
```
getvictor added a commit that referenced this issue Jun 25, 2024
#18115
Fixing unreleased bug found when load testing host issues update.

(cherry picked from commit 246c6d1)
getvictor added a commit that referenced this issue Jun 25, 2024
#18115
Fixing issue saw in load test:
```
level=error ts=2024-06-25T17:09:08.230514976Z cron=vulnerabilities schedule=vulnerabilities instanceID="5boTc/PamsSp8Jsh4kiEOpECmPu+bmOAJaVX4XV7ZOG4vgO4U6peHyxH8mFQhBXYJt+roRpwNuGmUoEI8n/otg==" err="running job" details="get critical vulnerabilities count: Error 1114 (HY000): The table '/rdsdbdata/tmp/#sql127_6b4b_ad107' is full" jobID=update_host_issues_vulnerabilities_counts
```

(cherry picked from commit 918773b)
@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. labels Jun 26, 2024
@marko-lisica
Copy link
Member

Hey @pintomi1989 this story has shipped.

@noahtalerman There are TODOs in issue description to solve before moving to closed.

noahtalerman added a commit that referenced this issue Jul 1, 2024
API changes for the "Find hosts with the most issues" story
- #18115
@noahtalerman
Copy link
Member Author

There are TODOs in issue description to solve before moving to closed.

Docs are merged!

@fleet-release
Copy link
Contributor

Sorting hosts by flaws,
A beacon in the cloud haze,
Security evolves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-schur #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Projects
None yet
Development

No branches or pull requests

8 participants