Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Collect Elastic Agent metrics and send to Elasticsearch #22394

Closed
7 tasks
ph opened this issue Nov 3, 2020 · 28 comments
Closed
7 tasks

[Elastic Agent] Collect Elastic Agent metrics and send to Elasticsearch #22394

ph opened this issue Nov 3, 2020 · 28 comments
Assignees

Comments

@ph
Copy link
Contributor

ph commented Nov 3, 2020

AC

  • I should be able to see the CPU Usage on the running machine. Am using too much CPU?
  • I should be able to see Disk usage on the running machine. Am I running out of disk space?
  • I should be able to see the Memory usage of Elastic Agent. Is the Elastic Agent using too much memory?
  • I should be able to see the system memory. Am I running out of memory?
  • I should be able to see fd usage. Am I keep too many files open?
  • Collected metric should be send to a specific agent data stream.
  • Data should follow ECS fields

Implementations proposal:

??

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph
Copy link
Contributor Author

ph commented Nov 3, 2020

@michalpristas Can you "write" a proposal here and more specifically can we reuse the libbeat's code to collect and expose the metrics via.

@ruflin Were you considering this similar to other beats, expose an HTTP endpoint and use the metricbeat module to collect that information?

@ph ph added the v7.11.0 label Nov 3, 2020
@ruflin
Copy link
Member

ruflin commented Nov 5, 2020

@ph I think in the future, Elastic Agent will ship all the metrics data directly, also for its processes. So this is probably a good start to try it out.

@ph
Copy link
Contributor Author

ph commented Nov 5, 2020

@ruflin Agree but can it be done in two steps?

  • Step1: Expose HTTP and use metricbeat.
  • Step2: Have an internal pipeline for sending metric?

@ruflin
Copy link
Member

ruflin commented Nov 6, 2020

Sure, if this simplifies things. Lets make sure the HTTP is not "official" supported so we can remove it.

@ph
Copy link
Contributor Author

ph commented Nov 12, 2020

@ravikesarwani FYI, forgot to ping you on this one.

@michalpristas Can you take a look at #22394 (comment) ?

@michalpristas
Copy link
Contributor

michalpristas commented Nov 25, 2020

@ph can you elaborate on this one I should be able to query a data_stream query the collected metric.
as a first step i will use libbeat code to expose metrics stats using socket/npipe. then metricbeat will collect these.

i have also two options in mind:

  • first one is that we dont allow disabling monitoring and we will monitor all processes and agent metrics at all times.
  • second one will result in monitoring all if enabled and monitoring just agent when disabled.

or do we want to provide an option of not monitoring agent metrics?

EDIT: exposing HTTP endpoint for metricbeat use will only report data from beat module. i think uptime and number of goroutines may be valuable.
other options i managed to accomplish by correctly configuring metricbeat so it watches elastic agent and reports, cpu,fd, memory... all these resulting into single index

@ph
Copy link
Contributor Author

ph commented Nov 25, 2020

@michalpristas I have reworded the AC for Collected metric should be send to a specific agent data stream.

Concerning the behavior you are describing, isn't this just implementation details? I meant we have an option in the configuration file which is:

# agent.monitoring:
#   # enabled turns on monitoring of running processes
#   enabled: true
#   # enables log monitoring
#   logs: true
#   # enables metrics monitoring
#   metrics: true

Would like to hear what others think?

@michalpristas
Copy link
Contributor

@ph yeah i was wondering if using these options to disable monitoring, whether or not do we want to collect metrics of agent. e.g if there is any reason to continue collecting agent metrics if we set agent.monitoring.metrics:false
there probably isn't. i just wanted to double check.

i think i have something we can work with as a draft, i will test it on all OSes to see what gets reported where and then sum it up somehow.

@ravikesarwani
Copy link

Apart from Memory, CPU, Throughput, System load and Open file handles the other key metrics are:

  • Event rate: Am I okay in catching up or falling behind. This is really critical for logs where the generated data can fluctuate a lot.
  • Any failure information (fail rates, output errors)

Here's an image from beats overview page in stack monitoring that can potentially be used as a data point.
BeatsOverview

@ph Should we modify the original issue comments with some of this information?

@blakerouse
Copy link
Contributor

@ravikesarwani The event rate would be great overall for an Elastic Agent, but it has a technical issue. At the moment Elastic Agent doesn't send events to elasticsearch directly, that is done by the beats running under the Elastic Agent. We already collect these metrics from the beats running under the Elastic Agent, but its not the Elastic Agent itself.

@ravikesarwani
Copy link

The Agent metrics dashboard by default provides a view where it shows accumulated information for the Agent and all the beats metrics combined. If user filters just to see Agent metrics then the event rate graph would be empty. If user selects to filter one of more beats they will see the event rate information.

@michalpristas
Copy link
Contributor

michalpristas commented Dec 7, 2020

docs examples of collected metrics by using http input
see that i needed to place collected metrics into metrics field as http did not enable me to put them at the root level

linux

{
	"_index": ".ds-metrics-elastic_agent.elastic_agent-default-000001",
	"_id": "f0V-PHYB92whPQxjg6Wk",
	"_version": 1,
	"_score": null,
	"_source": {
		"@timestamp": "2020-12-07T09:18:11.839Z",
		"host": {
			"os": {
				"kernel": "4.4.0-31-generic",
				"codename": "xenial",
				"platform": "ubuntu",
				"version": "16.04.1 LTS (Xenial Xerus)",
				"family": "debian",
				"name": "Ubuntu"
			},
			"id": "c0cc2a7efa902a719ada8ab6584b6bcb",
			"containerized": false,
			"ip": [
				"172.17.0.1"
			],
			"mac": [
				"08:00:ab:08:00:ab"
			],
			"hostname": "vagrant",
			"architecture": "x86_64",
			"name": "vagrant"
		},
		"agent": {
			"type": "metricbeat",
			"version": "8.0.0",
			"ephemeral_id": "4470f14f-9bf1-4452-aee7-37e0db031c5c",
			"id": "95a0ff4e-e36e-4c0a-a552-844799011648",
			"name": "vagrant"
		},
		"event": {
			"duration": 5269657,
			"dataset": "elastic_agent.elastic_agent",
			"module": "http"
		},
		"elastic_agent": {
			"version": "8.0.0",
			"id": "f0eb529c-3512-429f-b6aa-37264dddb402",
			"process": "elastic-agent",
			"snapshot": false
		},
		"metricset": {
			"name": "json",
			"period": 10000
		},
		"data_stream": {
			"type": "metrics",
			"dataset": "elastic_agent.elastic_agent",
			"namespace": "default"
		},
		"system": {
			"process": {
				"cpu": {
					"system": {
						"ticks": 1745,
						"time": {
							"ms": 1745
						}
					},
					"total": {
						"ticks": 7291,
						"time": {
							"ms": 7291
						},
						"value": 7291
					},
					"user": {
						"time": {
							"ms": 5546
						},
						"ticks": 5546
					}
				},
				"memory": {
					"size": 74531072
				},
				"fd": {
					"open": 21
				},
				"cgroup": {
					"cpuacct": {
						"id": "elastic-agent.service",
						"total": {
							"ns": 5728663070
						}
					},
					"memory": {
						"mem": {
							"limit": {
								"bytes": 9223372036854772000
							},
							"usage": {
								"bytes": 462458880
							}
						},
						"id": "elastic-agent.service"
					}
				}
			}
		},
		"ecs": {
			"version": "1.7.0"
		},
		"service": {
			"address": "http://unix/stats",
			"type": "http"
		},
		"http": {}
	},
	"fields": {
		"@timestamp": [
			"2020-12-07T09:18:11.839Z"
		]
	},
	"sort": [
		1607332691839
	]
}

mac

{
	"_index": ".ds-metrics-elastic_agent.elastic_agent-default-000001",
	"_id": "2QxlPHYBjGFDnaF_EkU-",
	"_version": 1,
	"_score": null,
	"_source": {
		"@timestamp": "2020-12-07T08:50:24.348Z",
		"event": {
			"dataset": "elastic_agent.elastic_agent",
			"module": "http",
			"duration": 3040126
		},
		"metricset": {
			"name": "json",
			"period": 10000
		},
		"system": {
			"process": {
				"cpu": {
					"system": {
						"ticks": 1745,
						"time": {
							"ms": 1745
						}
					},
					"total": {
						"ticks": 7291,
						"time": {
							"ms": 7291
						},
						"value": 7291
					},
					"user": {
						"time": {
							"ms": 5546
						},
						"ticks": 5546
					}
				},
				"memory": {
					"size": 74531072
				}
			}
		},
		"host": {
			"mac": [
				"ac:de:48:ac:de:48"
			],
			"name": "MacBook-Pro-2.local",
			"hostname": "MacBook-Pro-2.local",
			"architecture": "x86_64",
			"os": {
				"name": "Mac OS X",
				"kernel": "18.7.0",
				"build": "18G6032",
				"platform": "darwin",
				"version": "10.14.6",
				"family": "darwin"
			},
			"id": "FC609F24-07E1-54EA-8E33-56F9D5A7A97E",
			"ip": [
				"127.0.0.2"
			]
		},
		"agent": {
			"ephemeral_id": "0cf156d9-4398-4c29-a52d-596ec7a93f5f",
			"id": "e09c86a1-f5dd-4fe8-898c-70de832e2a9e",
			"name": "MacBook-Pro-2.local",
			"type": "metricbeat",
			"version": "8.0.0"
		},
		"service": {
			"address": "http://unix/stats",
			"type": "http"
		},
		"http": {},
		"data_stream": {
			"dataset": "elastic_agent.elastic_agent",
			"namespace": "default",
			"type": "metrics"
		},
		"elastic_agent": {
			"snapshot": false,
			"version": "8.0.0",
			"id": "02e6478a-72b9-4a5e-bd63-0f6be2ef4dba",
			"process": "elastic-agent"
		},
		"ecs": {
			"version": "1.6.0"
		}
	},
	"fields": {
		"@timestamp": [
			"2020-12-07T08:50:24.348Z"
		]
	},
	"sort": [
		1607331024348
	]
}

@ruflin
Copy link
Member

ruflin commented Dec 7, 2020

I guess you could use the rename processor to move the fields under root?

Why is there a beat object in the stats for memstats for example?

@michalpristas
Copy link
Contributor

michalpristas commented Dec 8, 2020

Why is there a beat object in the stats for memstats for example?

there's beat.memstat because we're collecting beatMetrics using http. i can use rename to move it up if this is disturbing.
i tried rename for moving to root but either i'm using it wrong or it's not possible. looking at the code of rename it should not be possible

@ruflin
Copy link
Member

ruflin commented Dec 9, 2020

Can you elaborate on the beatMetrics using http?

@michalpristas
Copy link
Contributor

michalpristas commented Dec 9, 2020

what i mean by beatMetrics is a registry with the same name in a libbeat code. this registry is part of Default registry under stats namespace which is exposed using /stats endpoint. This endpoint I consume using http input.

@ruflin
Copy link
Member

ruflin commented Dec 9, 2020

I think we should be careful to reuse too much of libbeat here as Elastic Agent is not a Beat. If we need the stats inside beatMetrics, could we get them from beatMetrics but put together our own even with the values? I think Elastic Agent must be in control on where and what metrics are exposed.

@michalpristas
Copy link
Contributor

michalpristas commented Dec 9, 2020

i agree with agent having control of what is being published but even with that in mind i worked with the assumption we agreed on using libbeat to collect metrics at first at least.

at the same time we need to take into account that we're walking towards dashboards where aggregated metrics of agent and its processes are being displays. we probably need to have a common format in mind so aggregations are not cumbersome.

i updated rules which cleaned up resulting document of agent. mainly cleaning up what's not needed from beatMetrics registry and renaming metrics.beat to metrics.process this should be universal for each process.
additionally beats can provide more information in this document as ravi mentioned somewhere up here like output events and RW errors. agent can jump in and use those as soon it has ability to collect them

@ruflin ruflin changed the title [Elastic Agent] Collect Elastic Agent metrics and send to Elastic Search [Elastic Agent] Collect Elastic Agent metrics and send to Elasticsearch Dec 10, 2020
@michalpristas
Copy link
Contributor

updated #22394 (comment) structure again, important part is system.process

collecting just memory, cpu and fd (on linux)
total system metrics such as free/used disk space or or free RAM would need a different approach either Go APM Agent or agent logic itself. these can be added later.

@ruflin
Copy link
Member

ruflin commented Dec 10, 2020

This LGTM.

@ph
Copy link
Contributor Author

ph commented Dec 10, 2020

LGTM, seems OK for first version.

@simitt
Copy link
Contributor

simitt commented Dec 10, 2020

@michalpristas would it be possible to also include cgroup metrics from the beginning? Similar to what has been done in #21113. This is important when running in containers.

@michalpristas
Copy link
Contributor

@simitt yes i can do that

@michalpristas
Copy link
Contributor

@ruflin @simitt i updated PR here to match proposed doc schema. if you can take a look it would be great
#22793

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 27, 2021
@jsoriano jsoriano added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@jlind23
Copy link
Collaborator

jlind23 commented Oct 29, 2021

@michalpristas it seems that you have already done this PR: #22793
What should we do with this issue then? Is there still something to deliver?

@jlind23 jlind23 closed this as completed Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants