Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process summary metrics #4231

Merged
merged 1 commit into from May 12, 2017

Conversation

@tsg
Copy link
Collaborator

commented May 5, 2017

As we're enabling include_top_n by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a sample of the
data).

This PR adds a new process_summary metricset that adds these metrics. The fields are namespaced under process.summary.

Remaining TODOs:

  • system test for the new metricset.
  • changelog
  • update Kibana dashboards

This is required for #4112.

@ruflin
Copy link
Collaborator

left a comment

I'm on the fence if this should be it's own metricset or not. I feel like it should be possible for a user to only enable the summary which speaks for its own metricset. At the same time I see it valuable inside the process metricset as it can be "calculated" during the fetch.

If we put it in it's own metricset we could put it under the namespace processes. My worry is if we put it under process is that it mixes the summary with a single event.

Thinking more about it I tend to go in the direction of a separate metricset which is enabled by default. This also allows to set different periods for the summary.

@@ -487,3 +487,33 @@
description: >
Total number of I/O operations performed on all devices
by processes in the cgroup as seen by the throttling policy.
- name: summaries

This comment has been minimized.

Copy link
@ruflin

ruflin May 9, 2017

Collaborator

I would make this singular as it is only 1 summary and not multiple.

@@ -49,6 +49,7 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
CPUTicks bool `config:"cpu_ticks"`
CacheCmdLine bool `config:"process.cmdline.cache.enabled"`
IncludeTop includeTopConfig `config:"process.include_top_n"`
Summaries bool `config:"process.summaries"`

This comment has been minimized.

Copy link
@ruflin

ruflin May 9, 2017

Collaborator

process.summary.enabled? In case we add more options to it in the future

@@ -373,6 +383,11 @@ func (procStats *ProcStats) GetProcStats() ([]common.MapStr, error) {
}
procStats.ProcsMap = newProcs

var summaries *processSummaries
if procStats.Summaries {
summaries = procStats.getSummaries(processes)

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

I'm a little concerned that users might expect the summary to unfiltered. As written it looks like the summary only includes processes that matched the given regular expressions.

If we keep it this way, I feel like we should include info about what we matched against in the summary. This would be needed to distinguish between two separate summary events if the system/process metricset is used more than once.

This comment has been minimized.

Copy link
@tsg

tsg May 9, 2017

Author Collaborator

Ah, you are right, it should be unfiltered, I didn't realize that the regexps are already applied at that point.

@tsg

This comment has been minimized.

Copy link
Collaborator Author

commented May 9, 2017

@ruflin I agree that a separate metricset would be somehow cleaner, although I don't like it when two metricsets have very similar names (process and processes) because it's unclear for the user which one does what. We already have that situation with fststat and filesystem, and I find it confusing.

Another concern is that with a separate metricset we'd need to poll the process data twice, from both metricsets. Or is there a way for two metricsets to work on the same data?

@tsg

This comment has been minimized.

Copy link
Collaborator Author

commented May 9, 2017

Ok, I have tried the approach of having its own metricset and I have to admit it feels a lot simpler, so I'm now also leaning towards that option. I named the metricset process_summary, but I'm open to suggestions :).

@andrewkroh
Copy link
Member

left a comment

I think having a separate metricset is simpler w.r.t. the overall implementation.

state := sigar.ProcState{}
err = state.Get(pid)
if err != nil {
return nil, fmt.Errorf("Error getting process state for pid=%d: %v", pid, err)

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

One bad apple will spoil the bunch. This will cause problems on Windows, because pid=0 and csrss.exe processes cannot be accessed. But this could also affect other OSes if a process exits in between the Pids() and the state.Get(pid) calls.

One solution would be to treat these processes as a new state, unknown, in the state summary.


pids, err := process.Pids()
if err != nil {
logp.Warn("Getting the list of pids: %v", err)

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

I think the error should only be handled once. I'm advocating against logging the error also returning it.

This comment has been minimized.

Copy link
@tsg

tsg May 10, 2017

Author Collaborator

Oh, that was copy pasting from the process metricset :) I'll fix it in both places.

case 'Z':
summary.zombie += 1
default:
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

This could be another reason for having an unknown state in the summary. Without it, the sum of the counts of each state will not equal the total in this case.

This comment has been minimized.

Copy link
@tsg

tsg May 10, 2017

Author Collaborator

Initially I thought that we don't need unknown because it can be computed on the UI side, but I think we can be explicit about it.


config := struct{}{}

if err := base.Module().UnpackConfig(&config); err != nil {

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

These can be removed since there is no config.

pids, err := process.Pids()
if err != nil {
logp.Warn("Getting the list of pids: %v", err)
return nil, err

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 9, 2017

Member

I recommend wrapping the returned error to provide some addition context, like errors.Wrap(err, "failed to fetch list of PIDs").

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented May 10, 2017

Agree on the naming part, it's somehow ugly. I like the suggestion of process_summary and we could still put it under process.summary as the docs namespace. Will we allow the process_summary to also be filtered or exclude processes?

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented May 10, 2017

Talking about naming: In case we agree on process_summary, we should probably rename fsstats to filesystem_summary for consitency. Agree that it is suboptimal / confusing at the moment.

@tsg

This comment has been minimized.

Copy link
Collaborator Author

commented May 10, 2017

Will we allow the process_summary to also be filtered or exclude processes?

I'd say no, for that people should use the process metricset and do the analytics in Kibana.

@tsg

This comment has been minimized.

Copy link
Collaborator Author

commented May 10, 2017

jenkins, retest it

@tsg tsg force-pushed the tsg:add_process_summaries branch from dfd5229 to 48c8fb7 May 10, 2017

"sleeping": 0,
"stopped": 0,
"total": 355,
"unknown": 130,

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 10, 2017

Member

That's a lot of unknowns. Is that because of permissions on /proc/pid/*?

This comment has been minimized.

Copy link
@tsg

tsg May 10, 2017

Author Collaborator

I think so, it was generated on my mac without sudo.

@@ -0,0 +1,3 @@
=== system process_summary MetricSet

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 10, 2017

Member

system should be capitalized and I think we should change process_summary to "Process Summary". I think it would fit in better on the metricset listing.

@tsg tsg referenced this pull request May 10, 2017
8 of 9 tasks complete
@andrewkroh
Copy link
Member

left a comment

LGTM

// init registers the MetricSet with the central registry.
// The New method will be called after the setup of the module and before starting to fetch data
func init() {
if err := mb.Registry.AddMetricSet("system", "process_summary", New, parse.EmptyHostParser); err != nil {

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 10, 2017

Member

What if the metricset name were process.summary instead of process_summary, would that alleviate some of the concerns about the name differing from the data model?

This comment has been minimized.

Copy link
@ruflin

ruflin May 11, 2017

Collaborator

@andrewkroh I like the idea. Would this cause any issues somewhere else?

And interesting part of this could be that one could also enable process.* metricsets in the future. This becomes especially interesting in the kubernetes case.

This comment has been minimized.

Copy link
@tsg

tsg May 11, 2017

Author Collaborator

Interesting idea, but that currently doesn't work:

2017/05/11 10:25:00.264381 metricbeat.go:31: INFO Register [ModuleFactory:[docker, mongodb, mysql, postgresql, system], MetricSetFactory:[apache/status, audit/kernel, ceph/cluster_disk, ceph/cluster_health, ceph/monitor_health, ceph/pool_disk, couchbase/bucket, couchbase/cluster, couchbase/node, docker/container, docker/cpu, docker/diskio, docker/healthcheck, docker/image, docker/info, docker/memory, docker/network, dropwizard/collector, elasticsearch/node, elasticsearch/node_stats, golang/expvar, golang/heap, haproxy/info, haproxy/stat, http/json, jolokia/jmx, kafka/consumergroup, kafka/partition, kibana/status, kubernetes/container, kubernetes/node, kubernetes/pod, kubernetes/system, kubernetes/volume, memcached/stats, mongodb/dbstats, mongodb/status, mysql/status, nginx/stubstatus, php_fpm/pool, postgresql/activity, postgresql/bgwriter, postgresql/database, prometheus/collector, prometheus/stats, redis/info, redis/keyspace, system/core, system/cpu, system/diskio, system/filesystem, system/fsstat, system/load, system/memory, system/network, system/process, system/process.summary, vsphere/datastore, vsphere/host, vsphere/virtualmachine, zookeeper/mntr]]
panic: name process already used

I think this brings up the question (again), if we want to support metricset hierarchies.

This comment has been minimized.

Copy link
@ruflin

ruflin May 11, 2017

Collaborator

I would separate the two things. I think we should support dots in metricset names which does not necessarly mean there is a hierachy in the code structure. That it happens on the data side is nice of course.

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 11, 2017

Member

So the panic is caused by a naming collision the happens when registering the metrics key for metricbeat.system.process.summary with libbeat monitoring. Since the process metricset is already registered we end up with a collision. We could work around the issue with a change to this line to remove dots from the key name. This would "flatten" the metrics reported about the metricsets and simplify consuming those metrics in any sort of UI.

key := fmt.Sprintf("metricbeat.%s.%s", module, strings.Replace(name, ".", "_", -1))

But @tsg's concern that the usage of dots is overloaded was spot on.

@tsg tsg force-pushed the tsg:add_process_summaries branch 2 times, most recently from 93f2798 to f6a79fe May 12, 2017

@tsg tsg removed the in progress label May 12, 2017

@tsg

This comment has been minimized.

Copy link
Collaborator Author

commented May 12, 2017

Updated to use process.summary in fields, but keep process_summary as the metricset name. Rebased and ready for new reviews/merging.

@ruflin
ruflin approved these changes May 12, 2017
@ruflin

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

@tsg I think you have to use ["..."] in the tests there:

Traceback (most recent call last):
  File "/go/src/github.com/elastic/beats/metricbeat/tests/system/test_system.py", line 318, in test_process_summary
    summary = evt["system"]["process.summary"]
KeyError: 'process.summary'
Add process summary metrics
As we're enabling `include_top_n` by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a
sample of the data).

This adds a new `process_summary` metricset that adds these metrics. The fields
are namespaced under `process.summary`.

This PR adds summary metrics for the total number of processes and their state, as an
extra document created by the `system.process` metricset.

@tsg tsg force-pushed the tsg:add_process_summaries branch from f6a79fe to fbac667 May 12, 2017

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

jenkins, retest it

state := sigar.ProcState{}
err = state.Get(pid)
if err != nil {
summary.unknown += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++


switch byte(state.State) {
case 'S':
summary.sleeping += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.sleeping += 1 with summary.sleeping++

case 'S':
summary.sleeping += 1
case 'R':
summary.running += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.running += 1 with summary.running++

case 'R':
summary.running += 1
case 'D':
summary.idle += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.idle += 1 with summary.idle++

case 'D':
summary.idle += 1
case 'T':
summary.stopped += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.stopped += 1 with summary.stopped++

case 'T':
summary.stopped += 1
case 'Z':
summary.zombie += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.zombie += 1 with summary.zombie++

summary.zombie += 1
default:
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)
summary.unknown += 1

This comment has been minimized.

Copy link
@elasticmachine

elasticmachine May 12, 2017

Collaborator

[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++

@andrewkroh
Copy link
Member

left a comment

LGTM

assert.Contains(t, event, "idle")
assert.Contains(t, event, "stopped")
assert.Contains(t, event, "zombie")
assert.Contains(t, event, "unknown")

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 12, 2017

Member

Missing total.

And how about a test to ensure the sum of the various states equals total?

assert isinstance(summary["idle"], int)
assert isinstance(summary["stopped"], int)
assert isinstance(summary["zombie"], int)
assert isinstance(summary["unknown"], int)

This comment has been minimized.

Copy link
@andrewkroh

andrewkroh May 12, 2017

Member

Missing total.

@ruflin ruflin merged commit 86932ca into elastic:master May 12, 2017

3 of 4 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
CLA Commit author is a member of Elasticsearch
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
default Build finished.
Details
tsg added a commit to tsg/beats that referenced this pull request May 12, 2017
Follow up on comments from elastic#4231
This addresses comments in elastic#4231 that came just before the PR got merged.
andrewkroh added a commit that referenced this pull request May 13, 2017
Follow up on comments from #4231 (#4305)
This addresses comments in #4231 that came just before the PR got merged.
ramon-garcia added a commit to ramon-garcia/beats that referenced this pull request Dec 5, 2017
Add process summary metrics (elastic#4231)
As we're enabling `include_top_n` by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a
sample of the data).

This adds a new `process_summary` metricset that adds these metrics. The fields
are namespaced under `process.summary`.

This PR adds summary metrics for the total number of processes and their state, as an
extra document created by the `system.process` metricset.
ramon-garcia added a commit to ramon-garcia/beats that referenced this pull request Dec 5, 2017
Follow up on comments from elastic#4231 (elastic#4305)
This addresses comments in elastic#4231 that came just before the PR got merged.
athom added a commit to athom/beats that referenced this pull request Jan 25, 2018
Add process summary metrics (elastic#4231)
As we're enabling `include_top_n` by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a
sample of the data).

This adds a new `process_summary` metricset that adds these metrics. The fields
are namespaced under `process.summary`.

This PR adds summary metrics for the total number of processes and their state, as an
extra document created by the `system.process` metricset.
athom added a commit to athom/beats that referenced this pull request Jan 25, 2018
Follow up on comments from elastic#4231 (elastic#4305)
This addresses comments in elastic#4231 that came just before the PR got merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.