[Metricbeat][Kubernetes] Share watchers between metricsets #37332

constanca-m · 2023-12-07T09:16:12Z

Proposed commit message

WHAT: metricsets share watchers between each other, instead of each having their own.
WHY: Please check the issue [Metricbeat][Kubernetes] Refactor watchers used in Kubernetes metricsets for metadata enrichment #37243.

Details

The needed watchers for each resource are defined in getExtraWatchers function. You can check which are required in the table of section Expected watchers in this issue.

Note: only state_resourcequota does not have the expected watchers with this change. This is because we need to change the implementation of that metricset first.

We have a global map that saves all the watchers:

type watchers struct {
	watchersMap map[string]*watcherData
	lock        sync.RWMutex
}

The key to this map is the resource name, and the values are defined as:

type watcherData struct {
	metricsetsUsing []string // list of metricsets using this watcher

	watcher kubernetes.Watcher
	started bool // true if watcher has started, false otherwise

	enrichers map[string]*enricher // map of enrichers using this watcher. The key is the metricset name

	metadataObjects map[string]bool // map of ids of each object received by the handler functions
}

metricsetsUsing contains the list of metricsets that are using this watcher. We need this because when the enricher calls Start() or Stop(), the watchers start/stop. We cannot start a watcher more than once. We only stop a watcher if the list of metricsets using it is empty. We use metricset to avoid conflicts between metricsets that use the same resource, like state_pod and pod.
watcher is the kubernetes watcher for the resource.
started just tells us if the watcher is started. This is mainly needed for the enricher.Start() and for testing purposes.
enrichers is the list of enrichers for this watcher per each metricset
metadataEvents is the resulted metadata events from the resource event handler. Please see the next list, point 6.2, why this is necessary

The algorithm goes like this when NewResourceMetadataEnricher is called:

The configuration is validated. It will return a nil enricher if it fails.
We create the configuration needed for the metadata generator. It will return a nil enricher if it fails.
We create the K8s client. It will return a nil enricher if it fails.
We start all the watchers:
1. We first check if the resource exists. If it fails, we stop.
2. We build the kubernetes.WatchOptions{} needed for the watcher. If it fails, we stop.
3. We start the watcher for this specific resource:
  1. We first check if the watcher is already created.
  2. If it is, then we don't do anything.
  3. Otherwise, we create a new watcher and put it in the map with key = resource name.
  4. We add this metricset to the list of metricsets we have that are using this watcher.
4. We get all needed extra resources for this resource, and repeat step 3.
We create the metadata generators.
Lastly, create the enricher.
Considerations:
1. Because each watcher only has one function for UpdateFunc / addFunc and DeleteFunc, we need to save which metricsets and respective enrichers need that handler function. For this, we keep track of the enrichers using a map, and iterate over that map when one of the functions is triggered.
2. It is possible that AddFunc is called for one metricset first, and when the other metricset starts, the AddFunc is no longer triggered. To avoid the loss of metadata, we have the map metadataObjects, that saves the id of the object that triggered the handler function. This way, for each enricher upon creation, we iterate over all this map and using the id saved there, we get the object from the watcher store. Using this object, we call the update function and ensure all enrichers have up to date metadata.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Clone this branch.
Follow the steps of this README file to launch metricbeat with the changes.
Check it is working as expected.

Related issues

Relates to [Metricbeat][Kubernetes] Refactor watchers used in Kubernetes metricsets for metadata enrichment #37243.

Results

Metricbeat

These results come from running metricbeat with this configuration for kubernetes module (all metricsets that launch watchers are enabled, the others are not).

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      scope: cluster
      node: ${NODE_NAME}
      unique: true
      templates:
        - config:
            - module: kubernetes
              hosts: ["kube-state-metrics:8080"]
              period: 10s
            #  #add_metadata: true
              metricsets:
                - state_node
                - state_deployment
                - state_daemonset
                - state_replicaset
                - state_pod
                - state_container
                - state_cronjob
                - state_job
                #- state_resourcequota
                - state_statefulset
                - state_service
                - state_persistentvolume
                - state_persistentvolumeclaim
                - state_storageclass
                - state_namespace
            - module: kubernetes
              metricsets:
                - node
                - pod
                - container
              period: 10s
              host: ${NODE_NAME}
              hosts: ["https://${NODE_NAME}:10250"]
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              ssl.verification_mode: "none"

The logs for the watchers initialization will look like this (only message field displayed for simplicity):

{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.094Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher node successfully, created by node.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher pod successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":368},"message":"Created watcher state_namespace successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}                <-------------------------------------
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.191Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher deployment successfully, created by state_deployment.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.206Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher daemonset successfully, created by state_daemonset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.516Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher replicaset successfully, created by state_replicaset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.118Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher cronjob successfully, created by state_cronjob.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.126Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher statefulset successfully, created by state_statefulset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.132Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher service successfully, created by state_service.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.139Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolume successfully, created by state_persistentvolume.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.146Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolumeclaim successfully, created by state_persistentvolumeclaim.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.152Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher storageclass successfully, created by state_storageclass.","service.name":"metricbeat","ecs.version":"1.6.0"}

Notice the line with <----------: the pod was the resource that created the watcher for namespace, since it is one of the required resources, and it did not exist yet. This is also the reason why we don't see the line "message":"Created watcher state_namespace successfully, created by state_namespace.", because by the time state_namespace is iterating over the needed watchers, they are already created.

In Discover:

Elastic Agent

These results come from running EA with this standalone manifest, but with the custom image.

Logs:

These are the logs for starting the watchers (working as expected).

{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher statefulset successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher state_namespace successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":321,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.257Z","message":"Started watcher node successfully, created by node.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.361Z","message":"Started watcher pod successfully, created by pod.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.470Z","message":"Started watcher deployment successfully, created by deployment.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.577Z","message":"Started watcher persistentvolume successfully, created by persistentvolume.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.679Z","message":"Started watcher persistentvolumeclaim successfully, created by persistentvolumeclaim.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.782Z","message":"Started watcher replicaset successfully, created by replicaset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.887Z","message":"Started watcher service successfully, created by service.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.017Z","message":"Started watcher storageclass successfully, created by storageclass.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.120Z","message":"Started watcher cronjob successfully, created by cronjob.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.225Z","message":"Started watcher daemonset successfully, created by daemonset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.329Z","message":"Started watcher job successfully, created by job.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}

The results for the dashboards are (check if it still works):

Note: only dashboards for resources that launch watchers are considered. There were no changes in the others.

Notes for testing

Important things to consider when testing this PR code changes:

This PR changes only affect the kubernetes module metricsets that use metadata enrichment. These are state_namespace state_node state_deployment state_daemonset state_replicaset state_pod state_container state_job state_cronjob state_statefulset state_service state_persistentvolume state_persistentvolumeclaim state_storageclass pod container node
Everything that was working before this PR changes should still be working. The changes only reduce the number of watchers created from the different metricsets, thus reducing the k8s API calls.
Thorough regression testing is needed. In more details:
a. All events coming from the affected metricsets in Kibana should be enriched with own resource metadata (labels, annotations) and kubernetes node metadata and kubernetes namespace metadata when applicable.
b. When a new metadata(like a new label) is added on a resource(i.e. pod) then the new events from the related metricset(pod, container, state_pod, state_container) should contain the new metadata
c. When a new node or namespace label and annotation is added to a node/namespace, then the events from relevant metricsets(state_node, node or state_namespace) should include the new metadata.
d. The events of the rest of the metricsets(i.e. state_pod or state_deployment) coming from resources in the updated node/namespace won't get the updated node or namespace metadata out of the box.
e. In order for those events to be updated, there should be first an update in the metadata of these resources. For example if a node is labeled then the pods of that node won't get the new node label immediately. In order to get it, we should also add a label to these pod to trigger a watcher event. Then the new events will include the new pod label and node label
f. Test with addition/removal of metadata on pods that run on the leader node and also on the non-leader nodes.

mergify · 2023-12-07T09:16:46Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @constanca-m? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-12-07T09:24:25Z

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-07T09:16:30.968+0000
Duration: 8 min 49 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-12-07T09:41:40Z

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 22 min 52 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

- remove nolint:all

elasticmachine · 2023-12-07T09:52:07Z

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 18 min 18 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

metricbeat/module/kubernetes/util/kubernetes.go

elasticmachine · 2023-12-07T10:33:33Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 50 min 17 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

gizas · 2023-12-07T10:41:09Z

metricbeat/module/kubernetes/util/kubernetes.go

-// NewResourceMetadataEnricher returns an Enricher configured for kubernetes resource events
+// getExtraWatchers returns a list of the extra resources to watch based on some resource.
+// The full list can be seen in https://github.com/elastic/beats/issues/37243, at Expected Watchers section.
+func getExtraWatchers(resourceName string, config *kubernetesConfig) []string {


Also as a general comment all this functions I think can move to util folder and keep same structure (as we have util.NewResourceMetadataEnricher), because now we go back and forth

You mean in a different file inside util? They are in the util folder right now.

gizas · 2023-12-07T10:47:16Z

This PR will also need testing with Agent for sure. We need to build the agent and repeat the same tests and see if we dont break anything.
Also I would need to run some E2E tests to see that metadata enrichemnt are ok.
We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

MichaelKatsoulis

Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.

beats/metricbeat/module/kubernetes/kubernetes.go

Line 42 in 609817b

GetStateMetricsFamilies(prometheus p.Prometheus) ([]*p.MetricFamily, error)

The watches need to be set in a module level. We need to take into account elastic-agent where each metricset starts as a new kubernetes module. So when we need to share stuff between metricsets of kubernetes module, we need to define a global cache on Module's level shared across all Module's instances.
See @ChrsMark implementation in #25640 (comment)

constanca-m · 2023-12-07T11:45:34Z

We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

I think maybe we should move those new decisions to a new PR @gizas

constanca-m · 2023-12-07T11:47:29Z

Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.

I think this approach still works this way and does basically the same thing.

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times. The only workaround i could find for that, would be to pass the funcitons as parameters, but it is very hard to read the code that way.

I added unit tests for every function, and they work just fine. @MichaelKatsoulis

Edit: It is the same approach we are already using for state_metricset shared map.

MichaelKatsoulis · 2023-12-07T14:34:31Z

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times

You need to define a watchersCache in utils like we do with MetricsRepo

beats/metricbeat/module/kubernetes/kubernetes.go

Line 88 in a8d1567

metricsRepo *util.MetricsRepo

and

beats/metricbeat/module/kubernetes/util/metrics_repo.go

Line 71 in a8d1567

type MetricsRepo struct {

Then its pointer can be passed to NewResourceMetadataEnricher like metricsRepo.

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

constanca-m · 2023-12-07T15:06:55Z

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

Yes, the number of watchers are correct. I posted the results in Results in the description from running metricbeat. I also added the unit tests. EA was also tested, results are now in the description. @MichaelKatsoulis

Any test in specific I should do? Any specific situation?

elasticmachine · 2024-03-13T10:48:03Z

💔 Build Failed

Buildkite Build
Commit: a7f709a

Failed CI Steps

History

💚 Build #3203 succeeded 8fcdfc2
💔 Build #3105 failed 64ac188
💔 Build #3104 failed b51c873
💔 Build #3103 failed 74d51e2

cc @constanca-m

elasticmachine · 2024-03-13T10:56:57Z

💚 Build Succeeded

Buildkite Build
Commit: a7f709a

History

💚 Build #1567 succeeded 8fcdfc2
💚 Build #1460 succeeded 64ac188
💔 Build #1457 failed a888b88
💔 Build #1456 failed 519e3fb
💚 Build #1389 succeeded 57135c9
💚 Build #1359 succeeded 15b673a

cc @constanca-m

elasticmachine · 2024-03-13T11:13:12Z

💚 Build Succeeded

Buildkite Build
Commit: a7f709a

History

💚 Build #1575 succeeded 8fcdfc2
💔 Build #1457 failed 64ac188
💔 Build #1454 failed a888b88
💔 Build #1453 failed 519e3fb
💚 Build #1382 succeeded 57135c9
💚 Build #1352 succeeded 15b673a

cc @constanca-m

elasticmachine · 2024-03-13T11:13:54Z

💚 Build Succeeded

Buildkite Build
Commit: a7f709a

History

💚 Build #731 succeeded 8fcdfc2
💔 Build #624 failed 64ac188
💔 Build #621 failed a888b88
💔 Build #620 failed 519e3fb
💚 Build #553 succeeded 57135c9
💚 Build #523 succeeded 15b673a

cc @constanca-m

elasticmachine · 2024-03-13T11:27:17Z

💚 Build Succeeded

Buildkite Build
Commit: a7f709a

History

💚 Build #2782 succeeded 8fcdfc2
💔 Build #2674 failed 64ac188
💔 Build #2671 failed a888b88
💔 Build #2670 failed 519e3fb
💔 Build #2602 failed 57135c9
💚 Build #2572 succeeded 15b673a

cc @constanca-m

tetianakravchenko

regarding the documentation:

I think it makes more sense to keep structs and function parameters details as comments in the code, otherwise there is high chances this doc will be not kept in sync with code changes. I believe it would be helpful to keep implementation details in the code, and enrichers.md for more high level overview
could the documentation be restructured to have sub-topics? it is hard to read 100+ lines of text as one piece
could maybe be added a high level architecture diagram? It would simplify the documentation a lot

tetianakravchenko · 2024-03-14T11:12:00Z

metricbeat/module/kubernetes/util/enrichers.md

+
+- The following description is irrelevant to the metadata enrichment that happens due to the `add_kubernetes_metadata` processor and Kubernetes provider.
+- The `add_kubernetes_metadata` processor is skipped in cases of Kubernetes module for metrics collection. It is only used in the case of logs collection when the Kubernetes autodiscover provider is not used.
+-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.


why not only?

Suggested change

-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.

-The Kubernetes autodiscover provider enriches with metadata only when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.

maybe instead of referring to different project (Elastic Agent) - add link to the filebeat https://github.com/elastic/beats/blob/main/deploy/kubernetes/filebeat-kubernetes.yaml#L133-L150?

The kubernetes provider is also used with templates to start a metricbeat module (https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover.html#configuration-autodiscover) and enrich the metrics with kubernetes metadata

tetianakravchenko · 2024-03-14T11:30:46Z

metricbeat/module/kubernetes/util/enrichers.md

+like state_pod, state_container, node, state_job etc.
+- The shared watchers are stored in the module-level struct [resourceWatchers](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/kubernetes.go#L102). The struct is initialized when the Kubernetes module [starts](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L124).
+- Each time a watcher gets created it is added to the struct along with some watcher important data. The [struct](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L99) consists of a metaWatchersMap which has as key the name of the resource this watcher is watching(i.e. pod or node) and as value a [metaWatcher](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L86C6-L86C17) struct. 
+- The metaWatcher struct holds all the relevant info for the specific watcher. These info include


I would move such detailed explanation on the metaWatcher struct to the code, otherwise there is high chances this doc will be not kept in sync with code changes

These details of the variables are already in the code like this:

watcher kubernetes.Watcher started bool // true if watcher has started, false otherwise metricsetsUsing []string // list of metricsets using this watcher enrichers map[string]*enricher // map of enrichers using this watcher. The key is the metricset name metadataObjects map[string]bool // map of ids of each object received by the handler functions nodeScope bool // whether this watcher is only for current node restartWatcher kubernetes.Watcher // whether this watcher needs a restart

metricbeat/module/kubernetes/util/enrichers.md

tetianakravchenko · 2024-03-14T12:19:50Z

metricbeat/module/kubernetes/util/enrichers.md

+- If it does not exist it creates it and [stores](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L288C3-L288C18) it in the `resourceWatchers` struct. When creating a new metaWatcher struct, thet `started` options set to false as the watcher has not been started yet(just created). The `metadataObjects`, `enrichers` and `metricsetsUsing` initiliazied to empty, the `restartWatcher` set to nil and the `nodeScope` according to the function input(except from cases when it is an extra watcher where it is hardcoded to false).
+- If the watcher for that resource already exists, we check if it needs to be restarted. This situation can arise in specific cases. For instance, if a pod watcher has been created from a metricset like pod or container with nodeScope set to true (watching only from one node), and then another metricset like state_pod or state_container tries to create a pod watcher (which can happen only in the leader node), the watcher already exists. However, if we don't take any action in this scenario, the state_pod watcher would use a watcher that watches only from one node, leading to missing metadata, as described earlier. To address this, we need to update the watcher, mainly changing its watch options (removing options.Node). Unfortunately, a running watcher cannot be updated directly. Instead, we must stop it and create a new one with the correct watch options. The new restartWatcher must be identical to the old watcher, including the same handler function (more on that later), with the only difference being in the watch options. Consequently, the metaWatcher struct of the watcher gets updated with the restartWatcher. The process of stopping the old watcher and starting the new one is handled later on.
+- After createAllWatchers function creates all the watchers that are needed and update the resourceWatchers struct, the code flow returns to NewResourceMetadataEnricher
+- Now, let's delve into the creation of metadata generators and handler functions. Watchers, on their own, are responsible for subscribing to the Kubernetes API and monitoring changes in the resources they are configured to watch. Their primary function is to update their internal cache or store. However, to determine what actions to take when a change occurs, we rely on the so-called event handlers.


can maybe this list be structured somehow with sub-titles? for example metadata generation can be split to another list?

tetianakravchenko · 2024-03-14T13:43:25Z

metricbeat/module/kubernetes/util/kubernetes.go

+	// Enrich the given list of events
+	Enrich([]mapstr.M)
+}
+
 type enricher struct {


can we add comments regarding this struct and the struct fields here, instead of in documentation - to keep the implementation details in code and the high level overview in the https://github.com/elastic/beats/blob/a7f709a153a66bdb5369128342bd21c36da6439e/metricbeat/module/kubernetes/util/enrichers.md instead ?

tetianakravchenko · 2024-03-14T14:16:50Z

metricbeat/module/kubernetes/util/kubernetes.go

+	} else if resourceMetaWatcher.nodeScope != nodeScope && resourceMetaWatcher.nodeScope {
+		// It might happen that the resourceMetaWatcher already exists, but is only being used to monitor the resources
+		// of a single node. In that case, we need to check if we are trying to create a new resourceMetaWatcher that will track
+		// the resources of multiple nodes. If it is the case, then we need to update the resourceMetaWatcher.


In that case, we need to check if we are trying to create a new resourceMetaWatcher that will track the resources of multiple nodes.

it is not the multiple nodes, but rather the whole cluster/clusterScope, right?
Such behavior is expected only on the leader node, right?
what will happen in case of the re-election, is it needed in that case to switch from clusterScope to nodeScope (now this condition would only work for the nodeScope: true)?

it is not the multiple nodes, but rather the whole cluster/clusterScope, right?

You are right

what will happen in case of the re-election, is it needed in that case to switch from clusterScope to nodeScope

Under what conditions the re-election happens ? I think it only happens if an agent is down/restarted, so in that case the whole metricsets initialisation will take place again and the ones with clusterscope will not start at this node.

tetianakravchenko · 2024-03-25T10:13:41Z

metricbeat/module/kubernetes/util/enrichers.md

+-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.
+- Metadata enrichment from the enrichers happens for all the following Kubernetes metricsets: `state_namespace`, `state_node`, `state_deployment`, `state_daemonset`, `state_replicaset`, `state_pod`, `state_container`, `state_job`, `state_cronjob`, `state_statefulset`, `state_service`, `state_persistentvolume`, `state_persistentvolumeclaim`, `state_storageclass`, `pod`, `container`, `node`.
+- The reason that these metricsets trigger the metadata enrichment is because of the way they start.
+- All `state_metricsets` (except `state_container`) trigger the shared [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/state_daemonset/state_daemonset.go#L45) function when they get initialized.


almost all the links are from the private repo - https://github.com/constanca-m/beats, that can be outdated quite soon or removed when the feature branch will be merged in

tetianakravchenko · 2024-03-25T11:52:19Z

metricbeat/module/kubernetes/util/enrichers.md

+    - The pod watcher event handler will only call the updateFunc of the one enricher it has, when triggered
+    - A new pod appears, the watcher handler function gets triggered and it executes the updatefunc of only the pod metricset's enricher
+    - So the pod metricset's enricher's metadata map will have the metadata for all pod resources 
+    - state_pod metricsets gets initiliazied. This happens in leader node. First pod metricsets starts and then state_pod


First pod metricsets starts and then state_pod
does it always happen like this? or it is an example here?

It always happens. The reason is that for the state_* metricsets to start , the leader election needs to take place first which takes sometime.

tetianakravchenko · 2024-03-25T11:54:22Z

metricbeat/module/kubernetes/util/enrichers.md

+    - A new enricher for state_pod gets created, it uses the same existing pod watcher. Inside buildMetadataEnricher the watcher.enrichers map gets updated to include the state_pod enricher as well
+    - So whenever a new pod appears or gets updated, both enricher's updateFunc will be triggered, so both enrichers' metadata map will be up to date
+    - But what happens with pods that triggered a watcher event, in between the pod and state_pod initilization. So after the pod metricsets got initiliazed and before the state_pod got initiliazed
+    - This is very common. The moment a watcher starts watching(from pod metricset) it immidiately gets notified for all pods in the cluster and executes whatever its updateFunc says. In that case the pod enricher's updateFunc


In that case the pod enricher's updateFunc

this sentence is not finished, no?

tetianakravchenko · 2024-03-25T12:06:23Z

metricbeat/module/kubernetes/util/enrichers.md

+    - This is very common. The moment a watcher starts watching(from pod metricset) it immidiately gets notified for all pods in the cluster and executes whatever its updateFunc says. In that case the pod enricher's updateFunc
+    - So when the watcher got updated with state_pod enricher, the events for the existing pods have already arrived and at that point the watcher did not call the state_pod's enricher updateFunc
+    - Outcome is that the state_pod enricher's metadata map won't have the existing pods metadata, because the metadata generate method of that enricher was never called
+    - So we need a way to handle those cases


is it an explanation of the restartWatcher ? this seems to be not finished

Co-authored-by: Tetiana Kravchenko <tanya.kravchenko.v@gmail.com>

mergify · 2024-03-25T12:25:58Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers

constanca-m · 2024-03-25T12:29:32Z

metricbeat/module/kubernetes/util/enrichers.md

+- The reason that these metricsets trigger the metadata enrichment is because of the way they start.
+- All `state_metricsets` (except `state_container`) trigger the shared [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/state_daemonset/state_daemonset.go#L45) function when they get initialized.
+- [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/helper/kubernetes/state_metricset.go#L44) calls the [New metricsets method](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/helper/kubernetes/state_metricset.go#L80), which returns a new metricset with resource metadata enricher as part of it.
+- Node, pod, container, and `state_container` metricsets do not trigger the `kubernetes.Init` rather they implement their own [New method](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/node/node.go#L66) with the enricher as part of the metricsets returned.


I think all points up until this one are not necessary in this file, and they might add confusion. We just need to say the enrichers are created in the metricset struct. We talk a lot about the kubernetes.Init, and I think that code might change in the future, and then this file won't be updated. Can we remove all these points and just add a sentence explaining where we start enrichers? @MichaelKatsoulis

metricbeat/module/kubernetes/util/kubernetes.go

Co-authored-by: Tetiana Kravchenko <tanya.kravchenko.v@gmail.com>

go.mod

blakerouse · 2024-04-02T13:43:48Z

metricbeat/module/kubernetes/util/kubernetes.go

+// removeFromMetricsetsUsing removes the metricset from the list of resources using the shared watcher.
+// It returns true if element was removed and new size of array.
+// The cache should be locked when called.
+func removeFromMetricsetsUsing(resourceName string, notUsingName string, resourceWatchers *Watchers) (bool, int) {


The inverse of this function addToMetricsetsUsing grabs the lock and this function doesn't. That inconsistency is not clear. I would prefer to see that consistent between the two functions.

At the time we call this function, we are already holding the lock. That is why it is not consistent. I will add a comment to the addToMetricsetsUsing mentioning the lock

metricbeat/module/kubernetes/util/kubernetes.go

Signed-off-by: constanca <constanca.manteigas@elastic.co>

metricbeat/module/kubernetes/util/kubernetes.go

Signed-off-by: constanca <constanca.manteigas@elastic.co>

blakerouse

Thanks for the additional comment and cleanup of the go.mod file.

Signed-off-by: constanca <constanca.manteigas@elastic.co>

Share watchers between metricsets.

fab1407

constanca-m added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Dec 7, 2023

constanca-m self-assigned this Dec 7, 2023

constanca-m requested a review from a team as a code owner December 7, 2023 09:16

constanca-m requested review from gsantoro and ChrsMark December 7, 2023 09:16

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 7, 2023

update CHANGELOG-developer.next.asciidoc

888dfbe

constanca-m requested review from gizas, MichaelKatsoulis and tetianakravchenko December 7, 2023 09:27

fix lint error.

e56dc9a

- Remove getResourceMetadataWatchers

5ebc1b3

- remove nolint:all

gizas reviewed Dec 7, 2023

View reviewed changes

metricbeat/module/kubernetes/util/kubernetes.go Outdated Show resolved Hide resolved

gizas reviewed Dec 7, 2023

View reviewed changes

MichaelKatsoulis requested changes Dec 7, 2023

View reviewed changes

constanca-m mentioned this pull request Dec 8, 2023

[Metricbeat][Kubernetes] Refactor watchers used in Kubernetes metricsets for metadata enrichment #37243

Closed

6 tasks

constanca-m and others added 2 commits December 26, 2023 11:13

Merge branch 'main' into fix-multiple-watchers

7c62943

Move map initialization to kubernetes top level

8dd0ad6

MichaelKatsoulis approved these changes Mar 14, 2024

View reviewed changes

MichaelKatsoulis requested a review from gizas March 14, 2024 10:07

tetianakravchenko reviewed Mar 25, 2024

View reviewed changes

Update metricbeat/module/kubernetes/util/enrichers.md

21f7503

Co-authored-by: Tetiana Kravchenko <tanya.kravchenko.v@gmail.com>

constanca-m commented Mar 25, 2024

View reviewed changes

MichaelKatsoulis added 3 commits March 27, 2024 12:42

Update code comments and remove enrichers.md

07819eb

Update go.mod

8fa7599

Merge remote-tracking branch 'upstream/main' into fix-multiple-watchers

64815dc

tetianakravchenko approved these changes Mar 28, 2024

View reviewed changes

metricbeat/module/kubernetes/util/kubernetes.go Outdated Show resolved Hide resolved

MichaelKatsoulis and others added 2 commits March 28, 2024 17:10

Update metricbeat/module/kubernetes/util/kubernetes.go

38d44f4

Co-authored-by: Tetiana Kravchenko <tanya.kravchenko.v@gmail.com>

Merge branch 'main' into fix-multiple-watchers

a586db9

blakerouse reviewed Apr 2, 2024

View reviewed changes

Update go.mod and add comment about lock

53fd63a

Signed-off-by: constanca <constanca.manteigas@elastic.co>

blakerouse reviewed Apr 2, 2024

View reviewed changes

metricbeat/module/kubernetes/util/kubernetes.go Outdated Show resolved Hide resolved

Update comment about lock

b012f02

Signed-off-by: constanca <constanca.manteigas@elastic.co>

blakerouse approved these changes Apr 2, 2024

View reviewed changes

constanca-m and others added 2 commits April 3, 2024 09:06

Merge branch 'main' into fix-multiple-watchers

f9f6f0a

mage check

92c3d5b

Signed-off-by: constanca <constanca.manteigas@elastic.co>

constanca-m requested a review from a team as a code owner April 3, 2024 07:23

ShourieG approved these changes Apr 3, 2024

View reviewed changes

pierrehilbert approved these changes Apr 3, 2024

View reviewed changes

constanca-m merged commit 460b5c4 into elastic:main Apr 3, 2024
156 of 160 checks passed

constanca-m deleted the fix-multiple-watchers branch April 3, 2024 13:21

MichaelKatsoulis mentioned this pull request Apr 8, 2024

Add a markdown with high level kubernetes metadata enrichment explanation #38757

Merged

6 tasks

	-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.
	-The Kubernetes autodiscover provider enriches with metadata only when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.

[Metricbeat][Kubernetes] Share watchers between metricsets #37332

[Metricbeat][Kubernetes] Share watchers between metricsets #37332

Conversation

constanca-m commented Dec 7, 2023 • edited by MichaelKatsoulis Loading

Proposed commit message

Details

Checklist

How to test this PR locally

Related issues

Results

Metricbeat

Elastic Agent

Notes for testing

mergify bot commented Dec 7, 2023

elasticmachine commented Dec 7, 2023

❕ Build Aborted

Build stats

🤖 GitHub comments

elasticmachine commented Dec 7, 2023

❕ Build Aborted

Build stats

🤖 GitHub comments

elasticmachine commented Dec 7, 2023

❕ Build Aborted

Build stats

🤖 GitHub comments

elasticmachine commented Dec 7, 2023

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gizas commented Dec 7, 2023

MichaelKatsoulis left a comment

Choose a reason for hiding this comment

constanca-m commented Dec 7, 2023

constanca-m commented Dec 7, 2023 • edited Loading

MichaelKatsoulis commented Dec 7, 2023

constanca-m commented Dec 7, 2023 • edited Loading

elasticmachine commented Mar 13, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

elasticmachine commented Mar 13, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 13, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 13, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 13, 2024

💚 Build Succeeded

History

tetianakravchenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

constanca-m commented Dec 7, 2023 •

edited by MichaelKatsoulis

Loading

constanca-m commented Dec 7, 2023 •

edited

Loading

constanca-m commented Dec 7, 2023 •

edited

Loading

elasticmachine commented Mar 13, 2024 •

edited

Loading