Regularly benchmarking and stress-testing the alerting framework and rule types #119845

mikecote · 2021-11-29T14:23:38Z

The alerting system must be regularly benchmarked and stress-tested before every production release. Preferably mirroring known complex customer environments. This ensures we do not introduce any regressions by benchmarking and comparing key health metrics.

There are various ongoing performance testing & framework / tool creation efforts that relate to Kibana, some research has been done to ensure the pros/cons and applicability of each so we can invest where we see the best value proposition balanced with quickest roi we can get. As research continues it seems clear we'll plan to extend one or more tools or frameworks into a given solution. So, while we may start with one tool as an incremental first-step or as a starting point, we're developing this to a set of requirements, foremost.

Front-runner for starting-point tool/library: The Kibana Alerting team / ResponseOps kbn-alert-load Alert / Rule testing tool

It is known that this repo is forked and under current usage/dev by several Security side team members, we will research and sync on the current state / capability.
... see below for options that were declined for now.

Here are some of the WIP Requirements we are evaluating and building out:

Stretch / next goals:

Confirm/enable tool to allow testing over different Rule type needs (some WIP by Security team)
Confirm/enable tool to allow testing over Cases needs
Confirm/enable tool to allow testing over one or more connector 3rd party needs (bulk updates etc)
- focus on email connector next?

FYI: Frameworks/Tools that have been researched and ruled out for immediate purposes:

Kibana-QA team created an API load testing tool - kibana-load-testing. It was researched by Patrick M in 2020 and Alert/Rules team did not end up collaborating on it, it uses the Kibana HTTP API and so isn't best suited to assess the (background process) Task Manager at the moment
Kibana Working group's coming tool - (including folks like Spencer A / Tyler S / Daniel M / Liza K - they are discussing and working on a performance testing tool and CI integration for Kibana needs.

Eric is bringing requirements / context and generally participating with the Kib Perf Working group (v2) to benefit both groups.
Their timeline is cited as TBD for when Kibana Task Manager centric automation support will be focused, the UI is where they are investing first (as of Feb 2022). This is partly done knowing that kbn-alert-load tool exists and is sufficient for teams (based on its usasge).

elasticmachine · 2021-11-29T14:23:41Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

alexfrancoeur · 2021-12-06T15:00:36Z

Dropping this in here, but if we aren't already talking to the rally team, we may be able to use the dataset from these upcoming tracks: elastic/rally-tracks#222, elastic/apm-server#6731

mikecote · 2022-01-17T16:40:35Z

I will remove this issue (and assignees) from our iteration plan for now, as we would like for @EricDavisX to pick this up in the coming weeks with the research that is done so far.

EricDavisX · 2022-02-07T15:44:03Z

I'm researching this and hoping to finish evaluating what usage the ResponseOps and Security side teams have done in the next few days. With that done I'll be able to come up with a list of requirements and then also a modest plan for what I'll do next/further here.

EricDavisX · 2022-02-10T22:09:58Z

Still researching the kbn-alert-load tool - thanks all for the help. Also Finishing a first draft of a requirements document that QA will assess (with Engineering too) - then we'll form a plan and adjust the bullet points above

EricDavisX · 2022-02-23T21:54:30Z

MLR-QA team is wrapping up a prototype jenkins job to run kbn-alert-load tool (while security team has a prototype done in build-kite, fyi!) - I'll post details in slack for RespOps team

EricDavisX · 2022-03-23T20:50:35Z

I can update where we are. we did a proof of concept in jenkins and have decided to continue iterating on it from the machine-learning-qa-infra jenkins server:

https://ci.ml-qa.com/job/stack-testing/job/kibana/job/performance/job/response-ops-performance-test/
this job runs nightly and runs the 4-cluster 'versions-1000' suite that runs a cluster with 1000 rules every 5 second-run interval with other standard kibana defaults against 'main' branch and next-latest main branch and 7.17.x-snapshot latest and 7.16.3 (as a stable reference).

we've enhanced the jenkins run to always delete the ecctl deploys. we'll continue updating this periodically with progress.

EricDavisX · 2022-05-06T20:19:08Z

We have achieved an MVP that includes the checked metrics above, it runs nightly against several versions via cloud (CFT region) and reports pass/fail into our slack channel - I'm going to focus on other work, though may help drive QA implementing a few small remaining low-hanging fruit items.

mikecote added the Meta label Dec 1, 2021

mikecote assigned pmuellr, mikecote, ymao1 and YulNaumenko Dec 2, 2021

mikecote added this to In Progress in Kibana Alerting Dec 2, 2021

mikecote removed their assignment Dec 2, 2021

mikecote mentioned this issue Dec 9, 2021

Move kbn-alert-load tool into Kibana alerting #88389

Closed

YulNaumenko removed their assignment Dec 16, 2021

mikecote removed this from In Progress in Kibana Alerting Jan 6, 2022

mikecote unassigned pmuellr and ymao1 Jan 17, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

EricDavisX self-assigned this Feb 3, 2022

EricDavisX mentioned this issue Feb 9, 2022

[CTI] Move kbn-alert-load to elastic org #124718

Closed

3 tasks

EricDavisX mentioned this issue Mar 23, 2022

[Alerting] Performance testing tool for alerting need to have the ability to create/clean ecctl Kibana deployment. #121457

Closed

EricDavisX removed their assignment May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regularly benchmarking and stress-testing the alerting framework and rule types #119845

Regularly benchmarking and stress-testing the alerting framework and rule types #119845

mikecote commented Nov 29, 2021 •

edited by EricDavisX

elasticmachine commented Nov 29, 2021

alexfrancoeur commented Dec 6, 2021 •

edited

mikecote commented Jan 17, 2022 •

edited

EricDavisX commented Feb 7, 2022 •

edited

EricDavisX commented Feb 10, 2022

EricDavisX commented Feb 23, 2022

EricDavisX commented Mar 23, 2022

EricDavisX commented May 6, 2022

Regularly benchmarking and stress-testing the alerting framework and rule types #119845

Regularly benchmarking and stress-testing the alerting framework and rule types #119845

Comments

mikecote commented Nov 29, 2021 • edited by EricDavisX

elasticmachine commented Nov 29, 2021

alexfrancoeur commented Dec 6, 2021 • edited

mikecote commented Jan 17, 2022 • edited

EricDavisX commented Feb 7, 2022 • edited

EricDavisX commented Feb 10, 2022

EricDavisX commented Feb 23, 2022

EricDavisX commented Mar 23, 2022

EricDavisX commented May 6, 2022

mikecote commented Nov 29, 2021 •

edited by EricDavisX

alexfrancoeur commented Dec 6, 2021 •

edited

mikecote commented Jan 17, 2022 •

edited

EricDavisX commented Feb 7, 2022 •

edited