Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] adds basic observability into Task Manager's runtime operations #77868

Merged
merged 87 commits into from
Oct 27, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Sep 18, 2020

Summary

closes #77456

This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible.

Exposed Metrics

There are three different sections to the stats returned by the health api.

  • configuration: Summarizes Task Manager's current configuration.
  • workload: Summarizes the workload in the current deployment.
  • runtime: Tracks Task Manager's performance.

Configuring the Stats

There are four new configurations:

  • xpack.task_manager.monitored_stats_required_freshness - The required freshness of critical "Hot" stats, which means that if key stats (last polling cycle time, for example) haven't been refreshed within the specified duration, the _health endpoint and service will report an Error status. By default this is inferred from the configured poll_interval and is set to poll_interval plus a 1s buffer.
  • xpack.task_manager.monitored_aggregated_stats_refresh_rate - Dictates how often we refresh the "Cold" metrics. These metrics require an aggregation against Elasticsearch and add load to the system, hence we want to limit how often we execute these. We also inffer the required freshness of these "Cold" metrics from this configuration, which means that if these stats have not been updated within the required duration then the _health endpoint and service will report an Error status. This covers the entire workload section of the stats. By default this is configured to 60s, and as a result the required freshness defaults to 61s (refresh plus a 1s buffer).
  • xpack.task_manager.monitored_stats_running_average_window- Dictates the size of the window used to calculate the running average of various "Hot" stats, such as the time it takes to run a task, the drift that tasks experience etc. These stats are collected throughout the lifecycle of tasks and this window will dictate how large the queue we keep in memory would be, and how many values we need to calculate the average against. We do not calculate the average on every new value, but rather only when the time comes to summarize the stats before logging them or returning them to the API endpoint.
  • xpack.task_manager.monitored_task_execution_thresholds- Configures the threshold of failed task executions at which point the warn or error health status will be set either at a default level or a custom level for specific task types. This will allow you to mark the health as error when any task type failes 90% of the time, but set it to error at 50% of the time for task types that you consider critical. This value can be set to any number between 0 to 100, and a threshold is hit when the value exceeds this number. This means that you can avoid setting the status to error by setting the threshold at 100, or hit error the moment any task failes by setting the threshold to 0 (as it will exceed 0 once a single failer occurs).

For example, in your Kibana.yml:

xpack.task_manager.monitored_stats_required_freshness: 5000
xpack.task_manager.monitored_aggregated_stats_refresh_rate: 60000
xpack.task_manager.monitored_stats_running_average_window: 50
xpack.task_manager.monitored_task_execution_thresholds:
  default:
    error_threshold: 70
    warn_threshold: 50
  custom:
    "alerting:always-firing":
      error_threshold: 50
      warn_threshold: 0

Consuming Health Stats

Task Manager exposes a /api/task_manager/_health api which returns the latest stats.
Calling this API is designed to be fast and doesn't actually perform any checks- rather it returns the result of the latest stats in the system, and is design in such a way that you could call it from an external service on a regular basis without worrying that you'll be adding substantial load to the system.

Additionally, the metrics are logged out into Task Manager's DEBUG logger at a regular cadence (dictated by the Polling Interval).
If you wish to enable DEBUG logging in your Kibana instance, you will need to add the following to your Kibana.yml:

logging:
  loggers:
      - context: plugins.taskManager
        appenders: [console]
        level: debug

Please bear in mind that these stats are logged as often as your poll_interval configuration, which means it could add substantial noise to your logs.
We would recommend only enabling this level of logging temporarily.

Understanding the Exposed Stats

As mentioned above, the health api exposes three sections: configuration, workload and runtime.
Each section has a timestamp and a status which indicates when the last update to this setion took place and whether the health of this section was evaluated as OK, Warning or Error.

The root has its own status which indicate the state of the system overall as infered from the status of the section.
An Error status in any section will cause the whole system to display as Error.
A Warning status in any section will cause the whole system to display as Warning.
An OK status will only be displayed when all sections are marked as OK.

The root timestamp is the time in which the summary was exposed (either to the DEBUG logger or the http api) and the last_update is the last time any one of the sections was updated.

The Configuration Section

The configuration section summarizes Task Manager's current configuration, including dynamic configurations which change over time, such as poll_interval and max_workers which adjust in reaction to changing load on the system.

These are "Hot" stats which are updated whenever a change happens in the configuration.

The Workload Section

The workload which summarizes the work load in the current deployment, listing the tasks in the system, their types and what their current status is.

It includes three sub sections:

  • The number of tasks scheduled in the system, broken down by type and status.
  • The number of idle overdue tasks, whose runAt has expired.
  • Execution density in the next minute or so (configurable), which shows how many tasks are scheduled to execute in the scope of each polling interval. This can give us an idea of how much load there is on the current Kibana deployment.

These are "Cold" stat which are updated at a regular cadence, configured by the monitored_aggregated_stats_refresh_rate config.

The Runtime Section

The runtime tracks Task Manager's performance as it runs, making note of task execution time, drift etc.
These include:

  • The time it takes a task to run (mean and median, using a configurable running average window, 50 by default)
  • The average drift that tasks experience (mean and median, using the same configurable running average window as above). Drift tells us how long after a task's scheduled a task typically executes.
  • The polling rate (the timestamp of the last time a polling cycle completed) and the result [No tasks | Filled task pool | Unexpectedly ran out of workers] frequency the past 50 polling cycles (using the same window size as the one used for running averages)
  • The Success | Retry | Failure ratio by task type. This is different than the workload stats which tell you what's in the queue, but ca't keep track of retries and of non recurring tasks as they're wiped off the index when completed.

These are "Hot" stats which are updated reactively as Tasks are executed and interacted with.

Example Stats

For example, if you curl the /api/task_manager/_health endpoint, you might get these stats:

{
     /* the time these stats were returned by the api */
    "timestamp": "2020-10-05T18:26:11.346Z",
     /* the overall status of the system */
    "status": "OK",
     /* last time any stat was updated in this output */
    "last_update": "2020-10-05T17:57:55.411Z",    
    "stats": {
        "configuration": {      /* current configuration of TM */
            "timestamp": "2020-10-05T17:56:06.507Z",
            "status": "OK",
            "value": {
                "max_workers": 10,
                "poll_interval": 3000,
                "request_capacity": 1000,
                "max_poll_inactivity_cycles": 10,
                "monitored_aggregated_stats_refresh_rate": 60000,
                "monitored_stats_running_average_window": 50
            }
        },
        "workload": {  /* The workload of this deployment */
            "timestamp": "2020-10-05T17:57:06.534Z",
            "status": "OK",
            "value": {
                "count": 6,        /* count of tasks in the system */
                "task_types": {   /* what tasks are there and what status are they in */
                    "actions_telemetry": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    },
                    "alerting_telemetry": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    },
                    "apm-telemetry-task": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    },
                    "endpoint:user-artifact-packager": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    },
                    "lens_telemetry": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    },
                    "session_cleanup": {
                        "count": 1,
                        "status": {
                            "idle": 1
                        }
                    }
                },

                /* Frequency of recurring tasks schedules */
                "schedule": [  
                    ["60s", 1],   /* 1 task, every 60s */
                    ["3600s", 3],  /* 3 tasks every hour */
                    ["720m", 1]
                ],
                /* There are no overdue tasks in this system at the moment */
                "overdue": 0, 
                /* This is the schedule density, it shows a histogram of all the  polling intervals in the next minute (or, if 
                    pollInterval is configured unusually high it will show a min of 2 refresh intervals into the future, and a max of 50 buckets).
                    Here we see that on the 3rd polling interval from *now* (which is ~9 seconds from now, as pollInterval is `3s`) there is one task due to run.
                    We also see that there are 5 due two intervals later, which is fine as we have a max workers of `10`
                 */
                "estimated_schedule_density": [0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
            }
        },
        "runtime": {
            "timestamp": "2020-10-05T17:57:55.411Z",
            "status": "OK",
            "value": {
                "polling": {
                        /* When was the last polling cycle? */
                    "last_successful_poll": "2020-10-05T17:57:55.411Z",
                        /* What is the frequency of polling cycle result?
                            Here we see 94% of "NoTasksClaimed" and 6%  "PoolFilled" */
                    "result_frequency_percent_as_number": {
                        "NoTasksClaimed": 94,
                        "RanOutOfCapacity": 0, /* This is a legacy result, we might want to rename - it tells us when a polling cycle resulted in claiming more tasks than we had workers for, butt he name doesn't make much sense outside of the context of the code */
                        "PoolFilled": 6
                    }
                },
                /* on average, the tasks in this deployment run 1.7s after their scheduled time */
                "drift": {
                    "mean": 1720,
                    "median": 2276
                },
                "execution": {
                    "duration": {
                               /* on average, the `endpoint:user-artifact-packager` tasks take 15ms to run */
                        "endpoint:user-artifact-packager": {
                            "mean": 15,
                            "median": 14.5
                        },
                        "session_cleanup": {
                            "mean": 28,
                            "median": 28
                        },
                        "lens_telemetry": {
                            "mean": 100,
                            "median": 100
                        },
                        "actions_telemetry": {
                            "mean": 135,
                            "median": 135
                        },
                        "alerting_telemetry": {
                            "mean": 197,
                            "median": 197
                        },
                        "apm-telemetry-task": {
                            "mean": 1347,
                            "median": 1347
                        }
                    },
                    "result_frequency_percent_as_number": {
                               /* and 100% of `endpoint:user-artifact-packager` have completed in success (within the running average window, so the past 50 runs (by default, configrable by `monitored_stats_running_average_window`) */
                        "endpoint:user-artifact-packager": {
                            "status": "OK",
                            "Success": 100,
                            "RetryScheduled": 0,
                            "Failed": 0
                        },
                        "session_cleanup": {
                            /* `error` status as 90% of results are `Failed` */
                            "status": "error",
                            "Success": 5,
                            "RetryScheduled": 5,
                            "Failed": 90
                        },
                        "lens_telemetry": {
                            "status": "OK",
                            "Success": 100,
                            "RetryScheduled": 0,
                            "Failed": 0
                        },
                        "actions_telemetry": {
                            "status": "OK",
                            "Success": 100,
                            "RetryScheduled": 0,
                            "Failed": 0
                        },
                        "alerting_telemetry": {
                            "status": "OK",
                            "Success": 100,
                            "RetryScheduled": 0,
                            "Failed": 0
                        },
                        "apm-telemetry-task": {
                            "status": "OK",
                            "Success": 100,
                            "RetryScheduled": 0,
                            "Failed": 0
                        }
                    }
                }
            }
        }
    }
}

Checklist

Delete any items that are not applicable to this PR.

For maintainers

* master: (92 commits)
  [ILM] Data tiers for 7.10 (elastic#76126)
  [ML] Transforms: Fixes styling of preview grid pagination in summary step (elastic#77789)
  [Drilldowns] Beta badge support. Mark URL Drilldown as Beta (elastic#75654)
  Re-enable session lifespan, idle timeout api integration tests and use unique names for the security test reports. (elastic#77746)
  [Alerting] renames code in alerting RBAC exemption to make it easier to maintain (elastic#77598)
  [Alerting & Actions] Overwrite SOs when updating instead of partially updating (elastic#73688)
  fixed react warning in Suspense in alert flyout (elastic#77777)
  [APM] Track usage of Gold+ features (elastic#77630)
  Visualize: Bad request when working with histogram aggregation (elastic#77684)
  remove legacy ES plugin (elastic#77703)
  [Lens] change name of custom query to filters (elastic#77725)
  skip flaky suite (elastic#76239)
  remove visual aspects of baseline job (elastic#77815)
  skip flaky suite (elastic#77835)
  Fixes typo in data recognizer text (elastic#77691)
  management/update trusted_apps jest snapshot
  [build] Use Elastic hosted UBI minimal base image (elastic#77776)
  [APM] Add transaction error rate alert (elastic#76933)
  [Security Solution] [Detections] Remove file validation on import route (elastic#77770)
  [Enterprise Search][tech debt] Add Kea logic paths for easier debugging/defaults (elastic#77698)
  ...
* master: (226 commits)
  [Enterprise Search] Added Logic for the Credentials View (elastic#77626)
  [CSM] Js errors (elastic#77919)
  Add the @kbn/apm-config-loader package (elastic#77855)
  [Security Solution] Refactor useSelector (elastic#75297)
  Implement tagcloud renderer (elastic#77910)
  [APM] Alerting: Add global option to create all alert types (elastic#78151)
  [Ingest pipelines] Upload indexed document to test a pipeline (elastic#77939)
  TypeScript cleanup in visualizations plugin (elastic#78428)
  Lazy load metric & mardown visualizations (elastic#78391)
  [Detections][EQL] EQL rule execution in detection engine (elastic#77419)
  Update tutorial-full-experience.asciidoc (elastic#75836)
  Update tutorial-define-index.asciidoc (elastic#75754)
  Add support for runtime field types to mappings editor. (elastic#77420)
  [Monitoring] Usage collection (elastic#75878)
  [Docs][Actions] Add docs for Jira and IBM Resilient (elastic#78316)
  [Security Solution][Resolver] Update @timestamp formatting (elastic#78166)
  [Security Solution] Fix app layout (elastic#76668)
  [Security Solution][Resolver] 2 new functions to DAL (elastic#78477)
  Adds new elasticsearch client to telemetry plugin (elastic#78046)
  skip flaky suite (elastic#78512) (elastic#78511) (elastic#78510) (elastic#78509) (elastic#78508) (elastic#78507) (elastic#78506) (elastic#78505) (elastic#78504) (elastic#78503) (elastic#78502) (elastic#78501) (elastic#78500)
  ...
* master:
  Fix APM lodash imports (elastic#78438)
  Add deprecated message to tile_map and region_map visualizations. (elastic#77683)
  Fix Lens smokescreen flaky tests (elastic#78566)
  updated discover with alt text (elastic#77660)
  Fix types (elastic#78619)
  Update tutorial-visualizing.asciidoc (elastic#76977)
  Update tutorial-discovering.asciidoc (elastic#76976)
  [Search] Error notification alignment (elastic#77788)
  Update tutorial-define-index.asciidoc (elastic#76975)
  [Lens] Fieldless operations (elastic#78080)
  [Usage Collection] [schema] Explicit "array" definition (elastic#78141)
  Update tutorial-define-index.asciidoc (elastic#76973)
  Fix --no-basepath references in doc (elastic#78570)
  Move StubIndexPattern to data plugin and convert to TS. (elastic#78518)
  Index pattern class - remove unused methods (elastic#78538)
  [Security Solution] [ALL] Eliminates all console.error and console.warn from Jest output (elastic#78523)
  [Actions] avoids setting a default dedupKey on PagerDuty (elastic#77773)
  First stab at developer-focussed saved objects docs (elastic#71430)
  remove unnecessary config validations (elastic#78527)
* master: (288 commits)
  add core-js production dependency (elastic#79395)
  Add support for sharing saved objects to all spaces (elastic#76132)
  [Alerting UI] Display a banner to users when some alerts have failures, added alert statuses column and filters (elastic#79038)
  load js-yaml lazily (elastic#79092)
  skip flaky suite (elastic#77278)
  Fix agentPolicyUpdateEventHandler() to use app context soClient for creation of actions (elastic#79341)
  [Security Solution] Untitled Timeline created when first action is to add note (elastic#78988)
  [Security Solutions][Detection Engine] Updates the edit rules page to only have what is selected for editing (elastic#79233)
  Cleanup yarn.lock from duplicates (elastic#66617)
  [kbn/optimizer] implement more efficient auto transpilation for node (elastic#79052)
  [Ingest Manager] Rename Fleet setup and requirement, Fleet => Central… (elastic#79291)
  [core/server/plugins] don't run discovery in dev server parent process (take 2) (elastic#79358)
  [babel/register] remove from build (take 2) (elastic#79379)
  [Security Solution] Changes rules table tag display (elastic#77102)
  define integrationTestRoot in config file and use to define screensho… (elastic#79247)
  Revert "[babel/register] remove from build (elastic#79176)"
  skip flaky suite (elastic#75241)
  [Uptime] Synthetics UI (elastic#77960)
  [Security Solution] [Detections] Only display actions options if user has "read" privileges (elastic#78812)
  [babel/register] remove from build (elastic#79176)
  ...
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I was hoping this wouldn't be quite so complex, but obviously it's going to be somewhat complex, so ... it is what it is :-)

I gave this a whirl on my 100 alerts w/4 active instances load, ran it for a while, checking the health, looked good, deleted all the alerts, leaving the queued actions to error (since the alert is no longer available) to look at the "error" side as well. Works as expected!

@@ -145,6 +145,15 @@ export interface AggregationOptionsByType {
>;
keyed?: boolean;
} & AggregationSourceOptions;
range: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes needed? I'm wondering if we will need to pre-req the apm plugin, if not now, in some future where the compilation depends on plugin deps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are investigating moving these types into the JS client (see #77720.) No timeframe on that, but worth considering.

I'm always in favor of explicitly declaring dependencies, though in this case we would create a circular one since APM has Task Manager as an optional dependency.

As for the type changes here I'd like to defer to @dgieselaar on that one before approving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't crazy about this, but checked with @dgieselaar and he gave a thumbsup.
He said he'd review on the PR 👍

* you may not use this file except in compliance with the Elastic License.
*/

import stats from 'stats-lite';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun factoid - stats-lite is from my former NodeSource colleague and all-around great person, Bryce Baril.

👋 @brycebaril

return res.ok({
body: lastMonitoredStats
? calculateStatus(lastMonitoredStats)
: { id: taskManagerId, timestamp: new Date().toISOString(), status: HealthStatus.Error },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really more of a "NO DATA" condition, rather than an error, right? Probably worth indicating that somehow, not sure it's worthy of a new status if instead it could just be in a message somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but arguably it could also be an error if you consider that the goal of monitoring is to confirm that TM is working and having no data here indicates that TM is not working.

My thinking was that when a Kibana starts up it goes from Error to OK, and that makes sense to me fro ma monitoring perspective...
I'd opt to keep it this way until (unless) a customer complains :)

@botelastic botelastic bot added the Team:APM All issues that need APM UI Team support label Oct 21, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@gmmorris gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0 labels Oct 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

* master: (63 commits)
  [KP] Fix Headers timeout issue (elastic#81140)
  [ML] Functional tests - stabilize typing with checks service method (elastic#81338)
  tabify - support docs (elastic#80351)
  [Security Solution][Detections] Look-back time logic fix (elastic#81383)
  [Workplace Search] Add top-level tests for Groups (elastic#81215)
  [Fleet] Fix agent action observable for long polling (elastic#81376)
  [Maps] fix feature tooltip remains open when zoom level change hides layer (elastic#81373)
  skip flaky suite (elastic#78689)
  chore(NA): add spec-to-console and plugin-helpers as devOnly dependencies (elastic#81357)
  Ensure some data is returned (elastic#81375)
  Change dumb-init to tini (elastic#81126)
  [Reporting/Tech Debt] Convert PdfMaker class to TypeScript (elastic#81242)
  Use Storybook Controls instead of Knobs (elastic#80705)
  [junit] make sure that report paths are unique (elastic#81255)
  bump elastic/elasticsearch-js version to 7.10.0-rc1 (elastic#81288)
  run ssl tests on CI (elastic#81320)
  Fix alert defaults (elastic#81207)
  [ML] DF Analytics wizard: ensure user can set mml manually or select to use given estimate (elastic#81078)
  Add UI notifier to indicate secret fields and to remember / reenter values (elastic#80657)
  [Monitoring] Use async/await (elastic#81200)
  ...
Copy link
Member

@dgieselaar dgieselaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great to see more teams using these types! Left a small suggestion for type inference in aggregate.

@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 This will be good to have insight into!

* master: (37 commits)
  [ILM] Migrate Warm phase to Form Lib (elastic#81323)
  [Security Solutions][Detection Engine] Fixes critical bug with error reporting that was doing a throw (elastic#81549)
  [Detection Rules] Add 7.10 rules (elastic#81676)
  [kbn/optimizer] ignore missing metrics when updating limits with --focus (elastic#81696)
  [SECURITY SOLUTIONS] Bugs overview page + investigate eql in timeline (elastic#81550)
  [Maps] fix unable to edit cluster vector styles styled by count when switching to super fine grid resolution (elastic#81525)
  Fixed migration issue for case specific actions, by extending email action migrator checks (elastic#81673)
  [CI] Preparation for APM tracking on CI (elastic#80399)
  [Home] Fixes Kibana app description order on home page and updates Canvas copy (elastic#80057)
  Make sure `to` is 'now' and not the same as `from` (elastic#81524)
  Nitpicking the 8.0 Breaking Change issue template (elastic#81678)
  [SECURITY_SOLUTION] Fix text on onboarding screen (elastic#81672)
  [data.search] Skip async search tests in build candidates and production builds (elastic#81547)
  Fix previousStartedAt by not changing when execution fails (elastic#81388)
  [Monitoring] Fix a couple of issues with the cpu usage alert (elastic#80737)
  Telemetry collection xpack to ts project references (elastic#81269)
  Elasticsearch: don't use url authentication for new client (elastic#81564)
  [App Search] Credentials: implement working flyout form (elastic#81541)
  Properly encode links to edit user page (elastic#81562)
  [Alerting UI] Don't wait for health check before showing Create Alert flyout (elastic#80996)
  ...
* master:
  [Security Solution][Endpoint][Admin] Malware Protections Notify User Version (elastic#81415)
  Revert "[Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)"
  [Maps] Use default format when proxying EMS-files (elastic#79760)
  [Discover] Deangularize context.html (elastic#81442)
  Use the displayName property in default editor (elastic#73311)
  adds bug label to Bug report for Security Solution template (elastic#81643)
  [ML] Transforms: Remove index field limitation for custom query. (elastic#81467)
  [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)
  [Task Manager] Mark task as failed if maxAttempts has been met. (elastic#80681)
  [Lens] Fix URL query loss on redirect (elastic#81475)
  Log reason for 404 in field existence check (elastic#81315)
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

distributable file count

id before after diff
default 48089 48100 +11

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 5dfa45d into elastic:master Oct 27, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Oct 27, 2020
…perations (elastic#77868)

This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible.
# Conflicts:
#	x-pack/test/plugin_api_integration/test_suites/task_manager/index.ts
gmmorris added a commit to gmmorris/kibana that referenced this pull request Oct 27, 2020
* master: (87 commits)
  [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81778)
  [i18n] add get_kibana_translation_paths tests (elastic#81624)
  [UX] Fix search term reset from url (elastic#81654)
  [Lens] Improved range formatter (elastic#80132)
  [Resolver] `SideEffectContext` changes, remove `@testing-library` uses (elastic#81077)
  [Time to Visualize] Update Library Text with Call to Action (elastic#81667)
  [docs] Resolving failed Kibana upgrade migrations (elastic#80999)
  [ftr/menuToggle] provide helper for enhanced menu toggle handling (elastic#81709)
  [Task Manager] adds basic observability into Task Manager's runtime operations (elastic#77868)
  Doc changes for stack management and grouped feature privileges (elastic#80486)
  Added functional test for alerts list filters by status, alert type and action type. Did a code refactoring and splitting for alerts tests. (elastic#81422)
  [Security Solution][Endpoint][Admin] Malware Protections Notify User Version (elastic#81415)
  Revert "[Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)"
  [Maps] Use default format when proxying EMS-files (elastic#79760)
  [Discover] Deangularize context.html (elastic#81442)
  Use the displayName property in default editor (elastic#73311)
  adds bug label to Bug report for Security Solution template (elastic#81643)
  [ML] Transforms: Remove index field limitation for custom query. (elastic#81467)
  [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)
  [Task Manager] Mark task as failed if maxAttempts has been met. (elastic#80681)
  ...
gmmorris added a commit that referenced this pull request Oct 27, 2020
…perations (#77868) (#81808)

This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible.
# Conflicts:
#	x-pack/test/plugin_api_integration/test_suites/task_manager/index.ts
@gmmorris
Copy link
Contributor Author

Documented as part of #89997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:enhancement Team:APM All issues that need APM UI Team support Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations
7 participants