Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/istio metrics #4253

Merged
merged 29 commits into from
Nov 18, 2022
Merged

Feature/istio metrics #4253

merged 29 commits into from
Nov 18, 2022

Conversation

gsantoro
Copy link
Contributor

@gsantoro gsantoro commented Sep 21, 2022

What does this PR do?

Provide an integrations for Istio metrics for both Istiod and sidecar Proxy containers

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Author's Checklist

How to test this PR locally

Same steps as #3632

Related issues

Screenshots

@gsantoro gsantoro added enhancement New feature or request In Progress Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Sep 21, 2022
@gsantoro gsantoro self-assigned this Sep 21, 2022
@gsantoro gsantoro requested a review from a team as a code owner September 21, 2022 16:15
@elasticmachine
Copy link

elasticmachine commented Sep 21, 2022

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@elasticmachine
Copy link

elasticmachine commented Sep 21, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-11-18T16:48:29.240+0000

  • Duration: 16 min 57 sec

Test stats 🧪

Test Results
Failed 0
Passed 14
Skipped 0
Total 14

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Sep 21, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (1/1) 💚
Files 100.0% (3/3) 💚
Classes 100.0% (3/3) 💚
Methods 90.0% (27/30)
Lines 97.472% (347/356)
Conditionals 100.0% (0/0) 💚

@gsantoro gsantoro requested review from ChrsMark and a team September 23, 2022 13:37
@@ -15,3 +15,22 @@ The `access_logs` data stream collects Istio access logs.
{{event "access_logs"}}

{{fields "access_logs"}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u add a section on how to test it locally? Steps how to install istio, config until you see metrics in ELK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all the instructions on how to test this locally should end up in the official docs. It doesn't seem to be standard practise in other packages. In order to test it "locally" you need to setup all sort of VMs, K8s cluster and elastic stack with custom properties. I don't think this kind of information should end up in this readme. For the istio specific steps to set this up, I am using the getting started at https://istio.io/latest/docs/setup/getting-started/ which is already documented in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried to change this a little: have a look here https://github.com/elastic/integrations/tree/main/packages/nginx_ingress_controller/_dev/build#how-to-setup-and-test-ingress-controller-locally

In general of course not instructions to install all components, you will consider that user has already elk and k8s etc. But reference to starting guide with some additional hints on what to install would be great I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I can add something similar but fortunately there is not much custom configs or manual steps that I needed for either logs or metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are other packages that have additionla readme files in testing that describe in detail how it should be tested. So users using the package will not see it but any package dev has easy access to it.

Copy link
Contributor Author

@gsantoro gsantoro Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin reading your comment I would suggest that a more appropriate place for testing docs is packages/istio/data_stream/istiod_metrics/_dev/test/system or anywhere else hidden in the test folders. If I understand correctly adding this test docs here (at packages/istio/_dev/build/docs/README.md will modify the packages/istio/docs/README.md which is used to expose the package docs to the package users).

I'm not a fan of mixing testing docs with official user facing docs where we usually document only the fields exposed by the integration.

Copy link
Contributor

@gizas gizas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to include any tests ?

@gsantoro
Copy link
Contributor Author

hey @gizas about the tests,
I would have loved to add some tests like I have done with the istio access_logs ones, unfortunately integrations tests with prometheus input like https://github.com/elastic/beats/blob/main/x-pack/metricbeat/module/istio/istiod/_meta/testdata/istiod.v1.7.1.plain are not supported in integrations. You can only have json or text (1 log per line) tests. If we were to add json tests for the metrics, we would be testing only the operations in the ingest pipelines (which are currently just a few renaming) but not the logic of converting prometheus metrics into json documents in Elasticsearch format.

@gsantoro
Copy link
Contributor Author

@ChrsMark can you confirm my assumption at #4253 (comment) that I can't test this PR with prometheus metrics as input similarly to how you do that in your Istio beat module?

@ChrsMark
Copy link
Member

ChrsMark commented Sep 26, 2022

@ChrsMark can you confirm my assumption at #4253 (comment) that I can't test this PR with prometheus metrics as input similarly to how you do that in your Istio beat module?

If I understand the question correctly: in general you can have system/integration tests for metrics in packages. See for example #4253 (comment).

This under the hood spins a target container using the files at https://github.com/elastic/integrations/tree/main/packages/nats/_dev/deploy. The approach must be documented at https://github.com/elastic/elastic-package/blob/main/docs/howto/system_testing.md.

In case of Istio though, since it's not quite straightfarward to spin up an istio target container I would suggest playing with a mocked API that would serve a specific sample as response. This has been done for containerd package for the same reasons (https://github.com/elastic/integrations/tree/main/packages/containerd/_dev/deploy/docker) and as explained at #2522 (review) for the k8s package before the development of the k8s deployer in elastic package: https://github.com/elastic/integrations/tree/b189a92444c4dfc33a3462fdad21b40a5bf8ae4e/packages/kubernetes/_dev/deploy/docker. (long term, we could invest some time to support a complete istio setup for integration tests under elastic-package)

Would that do the trick here too?

@gsantoro
Copy link
Contributor Author

hey @ChrsMark, thanks for the reply.
Just to clarify (from my new understanding now), since integration doesn't support directly the *.plain format like it does json or .text, you need to write an integration test instead similarly to what we do for containerd at https://github.com/elastic/integrations/tree/main/packages/containerd/_dev/deploy/docker or kubernetes at https://github.com/elastic/integrations/tree/b189a92444c4dfc33a3462fdad21b40a5bf8ae4e/packages/kubernetes/_dev/deploy/docker.

Diving into the docker-compose.yml used in both integration tests I have a couple of questions though:

  1. we use the docker image chanjarster/prometheus-mock-data:latest to mock the prometheus metrics from a static file. As far as I can see at https://github.com/chanjarster/prometheus-mock-data, this image is only necessary if you want to provide different metrics over time.

You would have a file

<1st time scrape data>
---
<2nd time scrape data>
---
<3rd time scrape data>
---
<4th time scrape data>
...

and that image would provide a different set of metrics data every time is queried. My question is, since all the metrics in both those integrations have only 1 set of metrics (there is no --- separator) wouldn't be easier to just serve the static file from an nginx docker image?

  1. is there a reason why we use an nginx container if its only purpose seems to proxy pass to the chanjarster/prometheus-mock-data image?

If my assumptions on both questions are correct, we could use just the nginx image and serve the static file with the prometheus metrics.

@MichaelKatsoulis and @ChrsMark and @mtojek could use please explain why we decided to use the two docker images and this setup?

BTW, I agree with your reasoning that since Istio setup is quite memory intensive (I can't run everything on my laptop) and for the Istio metrics integration we only have to mock the prometheus metrics, and we can do that with a static file and nginx image (or 2 docker images), we should just do that for now.

@ChrsMark
Copy link
Member

hey @ChrsMark, thanks for the reply. Just to clarify (from my new understanding now), since integration doesn't support directly the *.plain format like it does json or .text, you need to write an integration test instead similarly to what we do for containerd at https://github.com/elastic/integrations/tree/main/packages/containerd/_dev/deploy/docker or kubernetes at https://github.com/elastic/integrations/tree/b189a92444c4dfc33a3462fdad21b40a5bf8ae4e/packages/kubernetes/_dev/deploy/docker.

Correct :) .

Diving into the docker-compose.yml used in both integration tests I have a couple of questions though:

  1. we use the docker image chanjarster/prometheus-mock-data:latest to mock the prometheus metrics from a static file. As far as I can see at https://github.com/chanjarster/prometheus-mock-data, this image is only necessary if you want to provide different metrics over time.

I think the mock server just responds the same sample over time, at least this what the containerd case was about.
In general when elastic-package runs these tests, waits until it finds stored documents inside the data_stream's index in ES. So here the goal is to have the data_stream to consume from endpoint, at least once, and ship the data to ES, so only one scrape would work here yes.

You would have a file

<1st time scrape data>
---
<2nd time scrape data>
---
<3rd time scrape data>
---
<4th time scrape data>
...

and that image would provide a different set of metrics data every time is queried. My question is, since all the metrics in both those integrations have only 1 set of metrics (there is no --- separator) wouldn't be easier to just serve the static file from an nginx docker image?

Maybe in containerd we don't have the --- separator, that's why we have the same response over time?

  1. is there a reason why we use an nginx container if its only purpose seems to proxy pass to the chanjarster/prometheus-mock-data image?

@MichaelKatsoulis do you have the answer handy on this? Is it because we want proxie the request to a specific endpoint internally 🤔 ?

If my assumptions on both questions are correct, we could use just the nginx image and serve the static file with the prometheus metrics.

@MichaelKatsoulis and @ChrsMark and @mtojek could use please explain why we decided to use the two docker images and this setup?

From my side if you have an accessible mock API to serve the response you are good to go :), so whatever that works best for you. But let's wait for @MichaelKatsoulis ' answer on your no2 question above.

BTW, I agree with your reasoning that since Istio setup is quite memory intensive (I can't run everything on my laptop) and for the Istio metrics integration we only have to mock the prometheus metrics, and we can do that with a static file and nginx image (or 2 docker images), we should just do that for now.

@MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis do you have the answer handy on this? Is it because we want proxie the request to a specific endpoint internally 🤔 ?

I really don't. Maybe it was easier for me to reuse an example with both containers than finding how to server a text file with nginx. But just nginx container which will serve the static text file is what we need.

@gsantoro
Copy link
Contributor Author

@ChrsMark and @MichaelKatsoulis I am having some troubles understanding how the system tests should work. Here a couple of questions:

  • system test for containerd package is failing for me. Is this normal? Are we ignoring these tests in jenkins?
  • system test for istio package (using nginx + static file) fails with could not add data stream config to policy. I will have to investigate what elastic-package test is doing under the hood.
  • even assuming that I can make the tests passing, looking at other system tests (eg. containerd) we don't compare an expected output with what the system test runs. Is that system test supposed to only check that we can't scrape the prometheus endpoint or am I missing something?

@MichaelKatsoulis
Copy link
Contributor

  • system test for containerd package is failing for me. Is this normal? Are we ignoring these tests in jenkins?

That is weird. I just tried them out locally and they pass for all data streams. elastic-package test system --data-streams memory -v for example. The tests are not ignored in Jenkins.

  • even assuming that I can make the tests passing, looking at other system tests (eg. containerd) we don't compare an expected output with what the system test runs. Is that system test supposed to only check that we can't scrape the prometheus endpoint or am I missing something?

They way they work is that an elastic agent is spin up and a policy with specific data stream is created and starts collecting logs from an actual or mock system. Then elastic-package just checks that a data-stream (index) for that data set is created in elasticsearch. That's it. It is assumed that data are being collected and shipped.

@ChrsMark
Copy link
Member

ChrsMark commented Oct 12, 2022

Hey let me try to answer those inline.

@ChrsMark and @MichaelKatsoulis I am having some troubles understanding how the system tests should work. Here a couple of questions:

  • system test for containerd package is failing for me. Is this normal? Are we ignoring these tests in jenkins?

Hmm it should not fail 🤔 . What exactly do you run? What do you see?

I run from latest master/main:

elastic-package stack up -d -v --version=8.6.0-SNAPSHOT

eval "$(elastic-package stack shellinit)"

elastic-package test system -v 

Note that this should be executed under the github.com/elastic/integrations/packages/containerd directory.
On my end it runs without an issue:

....
--- Test results for package: containerd - START ---
╭────────────┬─────────────┬───────────┬───────────┬────────┬───────────────╮
│ PACKAGE    │ DATA STREAM │ TEST TYPE │ TEST NAME │ RESULT │  TIME ELAPSED │
├────────────┼─────────────┼───────────┼───────────┼────────┼───────────────┤
│ containerd │ blkio       │ system    │ default   │ PASS   │ 25.721272744s │
│ containerd │ cpu         │ system    │ default   │ PASS   │ 27.524840255s │
│ containerd │ memory      │ system    │ default   │ PASS   │ 29.275314138s │
╰────────────┴─────────────┴───────────┴───────────┴────────┴───────────────╯
--- Test results for package: containerd - END   ---
Done
  • system test for istio package (using nginx + static file) fails with could not add data stream config to policy. I will have to investigate what elastic-package test is doing under the hood.

This looks like the policy is not able to be assigned to the Agent. I think it has to do with the policy itself and not the environment (nginx etc). Is the package policy able to be assinged to an Agent manually? Any other logs that would look suspicious? @jsoriano any ideas how we could better define a troubleshooting process in cases like this?

  • even assuming that I can make the tests passing, looking at other system tests (eg. containerd) we don't compare an expected output with what the system test runs. Is that system test supposed to only check that we can't scrape the prometheus endpoint or am I missing something?

From https://github.com/elastic/elastic-package/blob/main/docs/howto/system_testing.md:

6.Wait a reasonable amount of time for the Agent to collect data from the integration service and index it into the correct Elasticsearch data stream.
7. Query the first 500 documents based on @timestamp for validation.
8. Validate mappings are defined for the fields contained in the indexed documents.
9. Validate that the JSON data types contained _source are compatible with mappings declared for the field.

So the idea is that you test if the package can collect data from the endpoint and hence produce events, and if the events' fields are covered by the package's mappings.

@gsantoro
Copy link
Contributor Author

For future reference,
there were a couple of things to fix for the system tests to pass:

  1. the prometheus metrics files (istio.txt and proxy.txt) needed an empty line at the end of the file
  2. the integration needed to be built and published to the package registry running as part of elastic-package for the test to pass.

The following commands were run from within the integrations/packages/istiod directory:

elastic-package build
elastic-package stack up --services package-registry -d

@gsantoro
Copy link
Contributor Author

I have now added some system test that simulate the prometheus metrics just using a static file and an nginx container.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! Code-wise looks good to me! Only one major comment to address regarding the connectivity to istiod component.

multi: false
required: true
show_user: true
default: ${kubernetes.labels.app} == 'istiod' and ${kubernetes.annotations.prometheus.io/scrape} == 'true'
Copy link
Member

@ChrsMark ChrsMark Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a rehearsal from https://www.elastic.co/blog/istio-monitoring-with-elastic-observability I remember that istiod could be accessed directly through its service. For example:

Metricbeat config:

- module: istio 
  metricsets: ['istiod'] 
  period: 10s 
  hosts: ['istiod.istio-system:15014']

So in that case you don't need a to automatically discover the Pods using a condition, since you can directly access the endpoint through the k8s Service:
istiod.istio-system:15014 where istiod is the name of the Service and istio-system is the name of Namespace it belongs to.

Is this still the case or there is sth that has been changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also any update on this @gsantoro ? The Pr can be merged for now and open a small enhancement for this if you think you need time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer to merge and then open a new issue for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is istiod exposed through a Service today? We need to clarify this before proceeding.

Also how many istiod Pods are running per cluster? If it's only one, which is also exposed through a Service, then we should better use istiod.istio-system:15014 as host endpoint along with a leaderelection condition. Similarly to what we have for cluster level metrics at https://github.com/elastic/integrations/blob/main/packages/kubernetes/data_stream/apiserver/manifest.yml#L21.
If there are more than 1 istiod Pod per cluster (for high availability maybe) then we should check if those keep and expose the same metrics or each one of them keep and provide different metrics (using a sharding mechanismo for example). In that case we could use the "autodiscovery" approach otherwise if the endpoint of the control plane metrics is unique we should not use the "autodiscovery" approach. See below:

The way we have it now makes it more complicated and resource consuming since we use an "autodiscovery" approach when the endpoint is unique per cluster and static through a Service. This means that every time the istiod Pod's state is updated we will trigger an event that this Pod is updated and we will re-launch the "autodiscovery" event. This happens at https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/providers/kubernetes/pod.go#L218-L232 and if you follow the codebase you can understand that the processing load is quite a lot.

Add to this the fact that "autodiscovery" based inputs are harder to troubleshoot them.
Consequently it'snot a suggested practice to use "autodiscovery" conditions if not really needed and I would prefer fixing it at first place here instead of having a follow up.
Let me know if that makes sense or if I miss anything here.

packages/istio/_dev/build/docs/README.md Show resolved Hide resolved
@gizas
Copy link
Contributor

gizas commented Nov 18, 2022

@gsantoro I remember an issue you had about broken dashboards because datastreams were missing? I can not see a specific filter in the dashboards at the moment so I guess we are ok? Is not it?

@gsantoro
Copy link
Contributor Author

@gsantoro I remember an issue you had about broken dashboards because datastreams were missing? I can not see a specific filter in the dashboards at the moment so I guess we are ok? Is not it?

The issue with broken dashboards was because of missing data-view not datastreams and yes there are filters on the dashboards called Panel filters. There are everywhere, a bit annoying but that seems to be the price to pay to use the default dataview metrics-* with a filter on event.dataset: istio.istiod_metrics and event.dataset: istio.proxy_metrics

@gsantoro gsantoro mentioned this pull request Nov 18, 2022
14 tasks
version: 0.1.0
release: beta
version: 0.2.0
release: ga
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to make it ga already? Usually we were making integrations ga at some point after the initial release. For example at #482 there is a list of items (https://github.com/elastic/beats/blob/master/.github/ISSUE_TEMPLATE/module-checklist.md) that need to be verified before the integration to become ga. I'm not sure if this has changed. @gizas do you know more on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that was a change at last minute. The title of the Epic for this iteration 8.6 clearly state that this should become GA.

Copy link
Contributor

@gizas gizas Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsantoro the title of the Epic might be misleading but to clarify the main goal was to perform the work in order to be ready for GA. But putting the label in the package before letting the integration being tested from our customers can be tricky.

I would suggest before promoting this 0.2.0 version to production, please do a hot-fix and revert the beta label back.

Maybe this is the chance to deal with above #4253 (comment) as well n the same push

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR at #4684

@gsantoro gsantoro merged commit 34889a1 into elastic:main Nov 18, 2022
@gsantoro gsantoro deleted the feature/istio_metrics branch November 18, 2022 17:07
This was referenced Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants