Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] consider making some plugin config available in the UX via Advanced Settings #132183

Open
pmuellr opened this issue May 12, 2022 · 4 comments
Labels
Feature:Actions/Framework Issues related to the Actions Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Task Manager research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented May 12, 2022

The alerting, actions, and task_manager plugins have a lot of configuration options:

config code from the plugins

const ruleTypeSchema = schema.object({
id: schema.string(),
timeout: schema.maybe(schema.string({ validate: validateDurationSchema })),
});
const connectorTypeSchema = schema.object({
id: schema.string(),
max: schema.maybe(schema.number({ max: 100000 })),
});
const rulesSchema = schema.object({
minimumScheduleInterval: schema.object({
value: schema.string({
validate: (duration: string) => {
const validationResult = validateDurationSchema(duration);
if (validationResult) {
return validationResult;
}
const parsedDurationMs = parseDuration(duration);
if (parsedDurationMs > ONE_DAY_IN_MS) {
return 'duration cannot exceed one day';
}
},
defaultValue: '1m',
}),
enforce: schema.boolean({ defaultValue: false }), // if enforce is false, only warnings will be shown
}),
run: schema.object({
timeout: schema.maybe(schema.string({ validate: validateDurationSchema })),
actions: schema.object({
max: schema.number({ defaultValue: 100000, max: 100000 }),
connectorTypeOverrides: schema.maybe(schema.arrayOf(connectorTypeSchema)),
}),
ruleTypeOverrides: schema.maybe(schema.arrayOf(ruleTypeSchema)),
}),
});
export const DEFAULT_MAX_EPHEMERAL_ACTIONS_PER_ALERT = 10;
export const configSchema = schema.object({
healthCheck: schema.object({
interval: schema.string({ validate: validateDurationSchema, defaultValue: '60m' }),
}),
invalidateApiKeysTask: schema.object({
interval: schema.string({ validate: validateDurationSchema, defaultValue: '5m' }),
removalDelay: schema.string({ validate: validateDurationSchema, defaultValue: '1h' }),
}),
maxEphemeralActionsPerAlert: schema.number({
defaultValue: DEFAULT_MAX_EPHEMERAL_ACTIONS_PER_ALERT,
}),
cancelAlertsOnRuleTimeout: schema.boolean({ defaultValue: true }),
rules: rulesSchema,
});

const preconfiguredActionSchema = schema.object({
name: schema.string({ minLength: 1 }),
actionTypeId: schema.string({ minLength: 1 }),
config: schema.recordOf(schema.string(), schema.any(), { defaultValue: {} }),
secrets: schema.recordOf(schema.string(), schema.any(), { defaultValue: {} }),
});
const customHostSettingsSchema = schema.object({
url: schema.string({ minLength: 1 }),
smtp: schema.maybe(
schema.object({
ignoreTLS: schema.maybe(schema.boolean()),
requireTLS: schema.maybe(schema.boolean()),
})
),
ssl: schema.maybe(
schema.object({
/**
* @deprecated in favor of `verificationMode`
**/
rejectUnauthorized: schema.maybe(schema.boolean()),
verificationMode: schema.maybe(
schema.oneOf(
[schema.literal('none'), schema.literal('certificate'), schema.literal('full')],
{ defaultValue: 'full' }
)
),
certificateAuthoritiesFiles: schema.maybe(
schema.oneOf([
schema.string({ minLength: 1 }),
schema.arrayOf(schema.string({ minLength: 1 }), { minSize: 1 }),
])
),
certificateAuthoritiesData: schema.maybe(schema.string({ minLength: 1 })),
})
),
});
export type CustomHostSettings = TypeOf<typeof customHostSettingsSchema>;
export const configSchema = schema.object({
allowedHosts: schema.arrayOf(
schema.oneOf([schema.string({ hostname: true }), schema.literal(AllowedHosts.Any)]),
{
defaultValue: [AllowedHosts.Any],
}
),
enabledActionTypes: schema.arrayOf(
schema.oneOf([schema.string(), schema.literal(EnabledActionTypes.Any)]),
{
defaultValue: [AllowedHosts.Any],
}
),
preconfiguredAlertHistoryEsIndex: schema.boolean({ defaultValue: false }),
preconfigured: schema.recordOf(schema.string(), preconfiguredActionSchema, {
defaultValue: {},
validate: validatePreconfigured,
}),
proxyUrl: schema.maybe(schema.string()),
proxyHeaders: schema.maybe(schema.recordOf(schema.string(), schema.string())),
/**
* @deprecated in favor of `ssl.proxyVerificationMode`
**/
proxyRejectUnauthorizedCertificates: schema.boolean({ defaultValue: true }),
proxyBypassHosts: schema.maybe(schema.arrayOf(schema.string({ hostname: true }))),
proxyOnlyHosts: schema.maybe(schema.arrayOf(schema.string({ hostname: true }))),
/**
* @deprecated in favor of `ssl.verificationMode`
**/
rejectUnauthorized: schema.boolean({ defaultValue: true }),
ssl: schema.maybe(
schema.object({
verificationMode: schema.maybe(
schema.oneOf(
[schema.literal('none'), schema.literal('certificate'), schema.literal('full')],
{ defaultValue: 'full' }
)
),
proxyVerificationMode: schema.maybe(
schema.oneOf(
[schema.literal('none'), schema.literal('certificate'), schema.literal('full')],
{ defaultValue: 'full' }
)
),
})
),
maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }),
responseTimeout: schema.duration({ defaultValue: '60s' }),
customHostSettings: schema.maybe(schema.arrayOf(customHostSettingsSchema)),
cleanupFailedExecutionsTask: schema.object({
enabled: schema.boolean({ defaultValue: true }),
cleanupInterval: schema.duration({ defaultValue: '5m' }),
idleInterval: schema.duration({ defaultValue: '1h' }),
pageSize: schema.number({ defaultValue: 100 }),
}),
microsoftGraphApiUrl: schema.maybe(schema.string()),
email: schema.maybe(
schema.object({
domain_allowlist: schema.arrayOf(schema.string()),
})
),
});

export const taskExecutionFailureThresholdSchema = schema.object(
{
error_threshold: schema.number({
defaultValue: 90,
min: 0,
}),
warn_threshold: schema.number({
defaultValue: 80,
min: 0,
}),
},
{
validate(config) {
if (config.error_threshold < config.warn_threshold) {
return `warn_threshold (${config.warn_threshold}) must be less than, or equal to, error_threshold (${config.error_threshold})`;
}
},
}
);
const eventLoopDelaySchema = schema.object({
monitor: schema.boolean({ defaultValue: true }),
warn_threshold: schema.number({
defaultValue: 5000,
min: 10,
}),
});
export const configSchema = schema.object(
{
/* The maximum number of times a task will be attempted before being abandoned as failed */
max_attempts: schema.number({
defaultValue: 3,
min: 1,
}),
/* How often, in milliseconds, the task manager will look for more work. */
poll_interval: schema.number({
defaultValue: DEFAULT_POLL_INTERVAL,
min: 100,
}),
/* How many poll interval cycles can work take before it's timed out. */
max_poll_inactivity_cycles: schema.number({
defaultValue: DEFAULT_MAX_POLL_INACTIVITY_CYCLES,
min: 1,
}),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 1000,
min: 1,
}),
/* The maximum number of tasks that this Kibana instance will run simultaneously. */
max_workers: schema.number({
defaultValue: DEFAULT_MAX_WORKERS,
// disable the task manager rather than trying to specify it with 0 workers
min: 1,
}),
/* The threshold percenatge for workers experiencing version conflicts for shifting the polling interval. */
version_conflict_threshold: schema.number({
defaultValue: DEFAULT_VERSION_CONFLICT_THRESHOLD,
min: 50,
max: 100,
}),
/* The rate at which we emit fresh monitored stats. By default we'll use the poll_interval (+ a slight buffer) */
monitored_stats_required_freshness: schema.number({
defaultValue: (config?: unknown) =>
((config as { poll_interval: number })?.poll_interval ?? DEFAULT_POLL_INTERVAL) + 1000,
min: 100,
}),
/* The rate at which we refresh monitored stats that require aggregation queries against ES. */
monitored_aggregated_stats_refresh_rate: schema.number({
defaultValue: DEFAULT_MONITORING_REFRESH_RATE,
/* don't run monitored stat aggregations any faster than once every 5 seconds */
min: 5000,
}),
/* The size of the running average window for monitored stats. */
monitored_stats_running_average_window: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_RUNNING_AVERGAE_WINDOW,
max: 100,
min: 10,
}),
/* Task Execution result warn & error thresholds. */
monitored_task_execution_thresholds: schema.object({
default: taskExecutionFailureThresholdSchema,
custom: schema.recordOf(schema.string(), taskExecutionFailureThresholdSchema, {
defaultValue: {},
}),
}),
monitored_stats_health_verbose_log: schema.object({
enabled: schema.boolean({ defaultValue: false }),
/* The amount of seconds we allow a task to delay before printing a warning server log */
warn_delayed_task_start_in_seconds: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_WARN_DELAYED_TASK_START_IN_SECONDS,
}),
}),
ephemeral_tasks: schema.object({
enabled: schema.boolean({ defaultValue: false }),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 10,
min: 1,
max: DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY,
}),
}),
event_loop_delay: eventLoopDelaySchema,
/* These are not designed to be used by most users. Please use caution when changing these */
unsafe: schema.object({
exclude_task_types: schema.arrayOf(schema.string(), { defaultValue: [] }),
}),
},
{
validate: (config) => {
if (
config.monitored_stats_required_freshness &&
config.poll_interval &&
config.monitored_stats_required_freshness < config.poll_interval
) {
return `The specified monitored_stats_required_freshness (${config.monitored_stats_required_freshness}) is invalid, as it is below the poll_interval (${config.poll_interval})`;
}
},
}
);

During a retrospective, it was noted that keeping all the meta-data around the config up-to-date is a chore. We have keys that are allowed to be used as env vars in Docker in one file, an allow-list for cloud usage in another file in another repo, and asciidoc documentation in several places, some duplicated between cloud and Kibana docs.

One thought was to try to move some of this configuration so that it could be edited "live", via Advanced Settings (AS).

pros:

  • no reboot required to change a setting
  • don't need to coordinate config keys across multiple repos / files

cons:

  • seems like we would have to support both traditional config and AS, for a long time
  • we're not sure yet which settings we would want to allow to be updated this way - is there some chance that none of our current settings would be applicable / reasonable / safe to allow as AS?

I think we'd need to do the following:

  • identify what config makes sense to be available in AS; some presumably won't be (for example, preconfigured connectors)
  • some new privileges required, or is there already an AS notion of "superuser" editing, for any settings that should be done only by an admin (for example, probably anything in task manager config)
  • would we want per-space AS config - doesn't seem like it, but maybe something like enabledActionTypes would make sense, to prevent some connectors from being used in some spaces
@pmuellr pmuellr added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Actions/Framework Issues related to the Actions Framework labels May 12, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor

Linking with #137303.

@pmuellr
Copy link
Member Author

pmuellr commented Jan 3, 2023

One thing I didn't realize when originally creating this issue, is that Advanced Settings / uiSettings only supports space-specific settings. So, this became a non-starter.

Since then, support for global settings has been added; here's one of the referenced PRs around it: #147229. So, seems do-able again ...

@pmuellr
Copy link
Member Author

pmuellr commented Jan 3, 2023

It's been a while since we looked / thought about this one, so I'm going to set it up for triage again, especially since now it might be possible to implement (where it wasn't before, with space-specific advanced settings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Actions/Framework Issues related to the Actions Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Task Manager research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

3 participants