Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAM] [Flapping] Add Flapping Rules Settings #147774

Merged
merged 29 commits into from
Jan 18, 2023

Conversation

JiaweiWu
Copy link
Contributor

@JiaweiWu JiaweiWu commented Dec 19, 2022

Summary

Resolves: #143529

Initial PR for adding flapping rule settings.

This PR adds a new saved object rules-settings with the schema:

  properties: {
    flapping: {
      properties: {
        enabled: {
          type: 'boolean',
        },
        lookBackWindow: {
          type: 'long',
        },
        statusChangeThreshold: {
          type: 'long',
        },
        createdBy: {
          type: 'keyword',
        },
        updatedBy: {
          type: 'keyword',
        },
        createdAt: {
          type: 'date',
        },
        updatedAt: {
          type: 'date',
        },
      },
    },
  },

It also adds 2 new endpoints:
GET /rules/settings/_flapping
POST /rules/settings/_flapping

The new rules settings saved object is instantiated per space, using a predetermined ID to enable OCC. This new saved object allows the user to control rules flapping settings for a given space. Access control to the new saved object is done through the kibana features API. A new RulesSettingsClient was created and can be used to interact with the settings saved object. This saved object is instantiated lazily. When the code calls rulesSettingsClient.flapping().get or rulesSettingsClient.flapping().update, we will lazily create a new saved object if one does not exist for the current space. (I have explored bootstrapping this saved object elsewhere but I think this is the easiest solution, I am open to change on this).

We have set up the rules settings to support future rule settings sections by making the settings client and permissions modular. Since permission control can be easily extended by using sub features.

This PR doesn't contain integration for the task_runner to use the flapping settings, but I can do that in this PR if needed.

Rules settings feature and sub feature (under management)

rulessettingsprivileges

Rules settings settings button

with_permission_rules_config

Rules settings modal

rule_config_modal

Disabled

rules_config_modal_disabled

Rules settings settings button with insufficient permissions

no_permission_rules_config

Rules settings modal with insufficient write subfeature permissions

no_flapping_permission

Rules settings modal with insufficient read subfeature permissions

Screenshot from 2023-01-03 23-01-48

TODO: Integration testing, both API and E2E

@JiaweiWu JiaweiWu added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesManagement Issues related to the Rules Management UX v8.7.0 labels Dec 19, 2022
@JiaweiWu
Copy link
Contributor Author

@elasticmachine merge upstream

@JiaweiWu
Copy link
Contributor Author

@elasticmachine merge upstream

@JiaweiWu
Copy link
Contributor Author

JiaweiWu commented Jan 3, 2023

@elasticmachine merge upstream

@JiaweiWu JiaweiWu added the release_note:skip Skip the PR/issue when compiling release notes label Jan 4, 2023
@JiaweiWu JiaweiWu changed the title [RAM] [Flapping] Add Flapping Rules Configuration (POC) [RAM] [Flapping] Add Flapping Rules Settings (POC) Jan 4, 2023
@JiaweiWu
Copy link
Contributor Author

JiaweiWu commented Jan 4, 2023

@elasticmachine merge upstream

@JiaweiWu JiaweiWu marked this pull request as ready for review January 4, 2023 07:03
@JiaweiWu JiaweiWu requested a review from a team as a code owner January 4, 2023 07:03
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

export const RULES_SETTINGS_SAVED_OBJECT_TYPE = 'rules_settings';
export const RULES_SETTINGS_SAVED_OBJECT_ID = 'rules-settings';

export const DEFAULT_FLAPPING_SETTINGS = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the existing default values, so this will obviously need to be changed 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmuellr
Copy link
Member

pmuellr commented Jan 4, 2023

Since the flapping settings are space-specific, did we consider storing these in Advanced Settings / uiSettings instead of creating our own saved object? Seems like a perfect fit, though I'm not sure how well you can deal with feature controls with those ...

@JiaweiWu
Copy link
Contributor Author

JiaweiWu commented Jan 4, 2023

@pmuellr Thanks for the feedback. I did look at advanced settings, In my mind, I think the 3 complications with using uiSettings are:

  1. The designs for rules settings has the user manage their settings on the rules page. I'm not sure we have a pattern for modifying uiSettings outside of the advanced settings page.
  2. Like you mentioned, we want to be able to extend rules settings for future settings and have them individually controllable via sub features. AFAIK advanced settings only has 1 privilege that lets the user modify all settings or not.
  3. This one is sort of a nitpick but uiSettings does imply UI settings, I feel like rules settings are going to bit more comprehensive and involved.

Although it is interesting to think about whether or not we want rules settings to be configurable via kibana.yml, which uiSettings does provide implicitly

@pmuellr
Copy link
Member

pmuellr commented Jan 4, 2023

  1. The designs for rules settings has the user manage their settings on the rules page. I'm not sure we have a pattern for modifying uiSettings outside of the advanced settings page.

Ya, we had a brief thought about figuring out how we could use an Advanced Setting for some of our config: #132183 - I think in some Slack chatter, we were thinking that perhaps we could make use of AS, but do the UX in the rules pages - as well as supporting the AS page. I think I saw someone doing that, but not completely sure.

  1. Like you mentioned, we want to be able to extend rules settings for future settings and have them individually controllable via sub features. AFAIK advanced settings only has 1 privilege that lets the user modify all settings or not.

Ya, I think that's the killer. I thought I had seen some setting that was conditionally enabled for editing, but can't see that now.

  1. This one is sort of a nitpick but uiSettings does imply UI settings, I feel like rules settings are going to bit more comprehensive and involved.

They actually recently renamed uiSettings to settings :-)

@XavierM
Copy link
Contributor

XavierM commented Jan 4, 2023

@pmuellr, we were considering using kibana settings but if I am correct , we will need to save the full json blob of all the kibana advanced setting each time there is an update. (we did not think that was super cool) and then we won't have been able to create specific sub feature and show who/when updated this setting.

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a drive by review, will let others do the more thorough review. I'm loving the new UI and the concept of settings! I left a few questions as I think about how rule executions will load these settings in the background.

<EuiText size="s">
<FormattedMessage
id="xpack.triggersActionsUI.rulesSettings.flapping.flappingSettingsDescription"
defaultMessage="An alert will be considered flapping if it changes status {lookBackWindow} within the last {statusChangeThreshold}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defaultMessage="An alert will be considered flapping if it changes status {lookBackWindow} within the last {statusChangeThreshold}."
defaultMessage="An alert will be considered flapping if it changes status {statusChangeThreshold} within the last {lookBackWindow}."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's also missing nouns associated with the numbers, for example "if it changes status X times within the last Y runs ...

Comment on lines +17 to +18
lookBackWindow: schema.number(),
statusChangeThreshold: schema.number(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we add UI / API validation to ensure lookBackWindow >= statusChangeThreshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we can do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added backend validation for this case, @mdefazio let me know if we want to have any frontend validation

@JiaweiWu JiaweiWu requested a review from a team as a code owner January 5, 2023 23:28
Copy link
Contributor

@mdefazio mdefazio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I had some comments prepped but it looks like you've solved many of them already.

Looks like the copy has some mistakes (apologies if pulled directly from the mocks). Should be Alerts that quickly go between active and recovered...

image

I think the toggle label should just be Enable flapping detection (recommended)

We can also remove the spacer between the form description and the toggle so these are closer together.

@lcawl Do you have any thoughts on the tooltips for lookback and change threshold?

Re: validation question. Looks like the sliders were updated so the threshold is controlled by the lookback window.

Last point, which is perhaps more of a question: Should we keep the 'Documentation' and 'Settings' links in the top right even if we don't have rules installed? Would someone want to update settings/flapping detection before adding rules?

Comment on lines 10 to 38
export const rulesSettingsMappings: SavedObjectsTypeMappingDefinition = {
properties: {
flapping: {
properties: {
enabled: {
type: 'boolean',
},
lookBackWindow: {
type: 'long',
},
statusChangeThreshold: {
type: 'long',
},
createdBy: {
type: 'keyword',
},
updatedBy: {
type: 'keyword',
},
createdAt: {
type: 'date',
},
updatedAt: {
type: 'date',
},
},
},
},
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we need all these fields indexed, or can we add index: false to any of those?

Copy link
Contributor Author

@JiaweiWu JiaweiWu Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at the very least we can add index: false to the metadata fields (createdBy, etc). Since these rules settings saved objects are created once per space, we are only querying by the ID anyways.

@mikecote @XavierM curious to hear if there's any reason to index the metadata fields?

Copy link
Contributor

@XavierM XavierM Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I agree with you @JiaweiWu because this SO is global per space so there is no need to index it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, loading settings by ID is all we'll need at this time.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we may want to pull out the task_runner parts of this PR, since they seem to be non-operational. And made some other comments in that section, if we decide to keep it in. Kinda feels like we should wait till we start threading the settings through all the different parts of task_runner, before making some changes there.

<EuiText size="s">
<FormattedMessage
id="xpack.triggersActionsUI.rulesSettings.flapping.flappingSettingsDescription"
defaultMessage="An alert will be considered flapping if it changes status {lookBackWindow} within the last {statusChangeThreshold}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's also missing nouns associated with the numbers, for example "if it changes status X times within the last Y runs ...

@@ -110,6 +115,7 @@ export async function getRuleAttributes<Params extends RuleTypeParams>(

const fakeRequest = getFakeKibanaRequest(context, spaceId, rawRule.attributes.apiKey);
const rulesClient = context.getRulesClientWithRequest(fakeRequest);
const rulesSettingsClient = context.getRulesSettingsClientWithRequest(fakeRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few problems with this.

  1. Seems like an odd place to get the ruleSettingsClient. Presumably it has to be done somewhere in the rule task runner, to get the current values to pass along internally. If I was looking for it, I'd never look here though! I think the TaskRunner constructor would be a better place. I think loadRule() is used in other places as well, that probably don't need that client (just the SO attrs).

  2. I don't think we want the fakeRequest here, since that's the user that created the rule. They may not have privs to read the flapping config (as I understand it), and so presumably use of this client by such a user would end up throwing an error. At least that's my general understanding of what's going on. We need to do auth when users are reading/writing the settings in the UX (and HTTP APIs), but we will also need to get these settings for every rule run, and so the client for THOSE usages will need to be one of the superuser things.

Maybe since the settings aren't threaded all the way through yet, we could just remove these bits, since it doesn't appear they're doing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For # 2: Good point, In fact, this made me realize that I've completely neglected to use the checkPrivileges APIs to enforce savedObject privileges at the rulesSettingsClient level since I've been relying on the router tag property to gate APIs by features. So I will implement checkPrivileges shortly here as well.

In that case, what is our preferred method of "faking" a superuser? Is there a fakeRequest as a superuser? Or should we just have a programmatic bypass of the privilege checks (seems kind of...gross, haha)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, what is our preferred method of "faking" a superuser? Is there a fakeRequest as a superuser? Or should we just have a programmatic bypass of the privilege checks (seems kind of...gross, haha)?

The recommended option is to create a scoped client that excludes the security extension. This will skip all the RBAC and pretty much uncouple a user from the SO client. You can find an example here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/rules_client_factory.ts#L94-L97

If that doesn't work and you can make the rules settings client work with a savedObjectsClient | savedObjectsRepository, you should be able by bypass RBAC whenever there isn't a user by creating a SO repository instead. The repository doesn't require a user and bypasses all the wrappers (encrypted saved objects, security, etc.) so it's less favourable but still works.

@XavierM
Copy link
Contributor

XavierM commented Jan 13, 2023

Hey @pmuellr about that I'm thinking we may want to pull out the task_runner parts of this PR, since they seem to be non-operational. @JiaweiWu and I were looking at the way to do that mostly on the rule_registry side since it seems to be harder than the task runner in the alerting plugin. So we are thinking to pass an observable from the alerting plugin about flapping settings, we will kind of follow the same technique that core did with routing in kibana. We still need to investigate how we will get the space from the create_lifecycle_executor. So we are thinking to use an observer to avoid to do a request at each execution to get the flapping settings since we can not really put it in memory since user can change it. The observer will act to be the informer of the rule registry and we will wrap it as singleton around the ruleClient settings. We talked a little bit around it with @doakalexi and we are still determining what is our best options here since it is a little bit more complicated. It seems that it will work, What do you think? or do you have a better technique/idea?

@mikecote
Copy link
Contributor

So we are thinking to use an observer to avoid to do a request at each execution to get the flapping settings since we can not really put it in memory since user can change it.

This idea sounds good. Let's make sure the settings get updated across all the Kibanas when more than one Kibana is running (because only 1 Kibana will process the update settings request and the others need to be aware). So we may need some kind of per-Kibana polling / cache w/ TTL if we don't load it on every rule run..

@pmuellr
Copy link
Member

pmuellr commented Jan 13, 2023

So we are thinking to use an observer to avoid to do a request at each execution to get the flapping settings since we can not really put it in memory since user can change it.

This idea sounds good. Let's make sure the settings get updated across all the Kibanas when more than one Kibana is running (because only 1 Kibana will process the update settings request and the others need to be aware). So we may need some kind of per-Kibana polling / cache w/ TTL if we don't load it on every rule run..

Ya, we need to either load it every rule run (ugh!) or refresh it from the SO on an interval.

Seems like Advanced Settings has the same challenge. I wonder if they have some way of avoiding doing the GETs every time ...

@XavierM
Copy link
Contributor

XavierM commented Jan 13, 2023

So we are thinking to use an observer to avoid to do a request at each execution to get the flapping settings since we can not really put it in memory since user can change it.

This idea sounds good. Let's make sure the settings get updated across all the Kibanas when more than one Kibana is running (because only 1 Kibana will process the update settings request and the others need to be aware). So we may need some kind of per-Kibana polling / cache w/ TTL if we don't load it on every rule run..

Ya, we need to either load it every rule run (ugh!) or refresh it from the SO on an interval.

Seems like Advanced Settings has the same challenge. I wonder if they have some way of avoiding doing the GETs every time ...

No they ask the user to refresh the page :(

@JiaweiWu
Copy link
Contributor Author

In my last commit I addressed the following review feedback:

  • Design feedback from @mdefazio
  • Use index: false for rules-settings SO properties from @afharo
  • Remove modifications to task_runner, we will do the integration in another PR
  • Add option to create an authed and unauthed rules settings client (rulesSettingsClientFactory.createWithAuthorization and rulesSettingsClientFactory.create).
  • Move flapping client specific unit tests to be under rules_settings_flapping_client.test.ts
  • rename persist to getOrCreate

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core changes LGTM

@JiaweiWu
Copy link
Contributor Author

@elasticmachine merge upstream

@XavierM XavierM dismissed mdefazio’s stale review January 17, 2023 22:43

has been done + validation on the API side

@JiaweiWu JiaweiWu enabled auto-merge (squash) January 17, 2023 22:55
@kibana-ci
Copy link
Collaborator

kibana-ci commented Jan 17, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #39 / discover feature controls discover feature controls security no discover privileges "before all" hook for "shows 403"

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
alerting 28 29 +1
triggersActionsUi 485 492 +7
total +8

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
alerting 425 456 +31

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
triggersActionsUi 700.6KB 709.5KB +8.9KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
alerting 37 38 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 39.6KB 40.7KB +1.0KB
triggersActionsUi 114.6KB 114.9KB +328.0B
total +1.4KB

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/saved-objects-service.html#_mappings

id before after diff
rules-settings - 9 +9
Unknown metric groups

API count

id before after diff
alerting 434 465 +31

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@JiaweiWu JiaweiWu merged commit dc28138 into elastic:main Jan 18, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jan 18, 2023
const flappingEnableLabel = i18n.translate(
'xpack.triggersActionsUI.rulesSettings.modal.enableFlappingLabel',
{
defaultMessage: 'Enabled flapping detection (recommended)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Advanced Settings UI, it uses "on" and "off" instead of "enabled" and "disabled". Can we do the same here (i.e. the text changes from "On" to "Off" depending on the state of the toggle)?:

Suggested change
defaultMessage: 'Enabled flapping detection (recommended)',
defaultMessage: 'On (recommended)',

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment Feature:Alerting/RulesManagement Issues related to the Rules Management UX release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps][flapping] add flapping configuration