Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some dimensions to the kubernetes integration #2076

Merged
merged 15 commits into from
Nov 22, 2021

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Nov 1, 2021

On time series databases, dimensions are the set of values that identifies a single time series. For efficiency this set should be minimal, but enough to avoid mixing unrelated data.
It is important to select a good set of dimensions, because changes on them would be breaking.

As an initial exercise before starting to use it more broadly in integrations, I have selected the dimensions for some kubernetes data stream:

  • For pods: the pod uid, the name and the namespace. The uid could be actually enough, but I have found that it is not present for all pods.
  • For containers: same as for pods, plus the container name.
  • For apiserver: most things that can be different between requests. I wonder if we should select all things that can be different between requests.

In other data streams I have set as dimensions the identifiers related to the monitored resource, and the namespace if the resource is at the namespace level.

I wonder if we should also add cluster identifiers as dimensions. On one side there can be pods with the same name and same namespace on different clusters, but on the other side I wonder how common this is, they usually have randomized suffixes.

@ChrsMark can you think of other fields that would make sense having as dimensions?

@nik9000 could you please take a look here to see if the selected dimensions make sense?

Related links:

@jsoriano jsoriano added Team:Integrations Label for the Integrations team Team:Ecosystem Label for the Packages Ecosystem team labels Nov 1, 2021
@jsoriano jsoriano self-assigned this Nov 1, 2021
@elasticmachine
Copy link

elasticmachine commented Nov 1, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-11-19T11:17:15.903+0000

  • Duration: 31 min 10 sec

  • Commit: 8f082c9

Test stats 🧪

Test Results
Failed 0
Passed 114
Skipped 0
Total 114

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

Copy link

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks reasonable to me.

@ChrsMark
Copy link
Member

ChrsMark commented Nov 2, 2021

Very interesting @jsoriano ! I find your additions sane, I only left one comment/question about the container one.

@jsoriano jsoriano marked this pull request as ready for review November 15, 2021 15:41
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@jsoriano
Copy link
Member Author

I went through all the data streams in the kubernetes integration and marked dimensions accordingly, I am opening this for review.

In the proxy data stream I haven't marked any dimension, I was not sure which one to use, should we use one that identifies the node? I wonder if this data stream should collect the kubernetes.node.name then. @ChrsMark wdyt?

Regarding the use of the cluster and node identifiers as dimensions, I may not been very consistent, I have tried to identify the cases where having them as dimensions are more relevant. An extra eye/opinion regarding that would be interesting.

@ChrsMark
Copy link
Member

For proxy maybe https://github.com/elastic/beats/blob/95626b8f1690344312c0831ab2bdcbccffe4d089/metricbeat/module/kubernetes/proxy/proxy.go#L47 or some of those labels like host+handler?

Regarding cluster identifier I would avoid it since it's not a field that will always be there, we populate it in a best effort way.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left just 2 comments:

  1. what it this -next tag for the version
  2. let's try to be consistent on the container's ecs fields and the place we locate them. If we need to move some of them to the ecs specific file we need to do it for all of them.

About the orchestator fields I saw that they are used in combination with node.name so I think we are just fine if those fields are missing since we can rely on the node.name? Would that somehow make things tricky if we have several clusters and only some of them have the identifiers (like some one AWS and some on GKE)?

packages/kubernetes/data_stream/container/fields/ecs.yml Outdated Show resolved Hide resolved
@@ -1,7 +1,7 @@
format_version: 1.0.0
name: kubernetes
title: Kubernetes
version: 1.4.1
version: 1.4.2-next
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more background in what special this release has? It's first time I'm seeing this and wondering what is the difference and if it would require any special handling in the release process (promote etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With semver, any suffix added this way is considered a prerelease. With this we can accumulate changes for snapshot packages, and can decide later to release 1.4.2 (or 1.5.0).
This is something supported in the spec since elastic/package-spec#193.
In principle a "prerelease" version shouldn't be used in production environments, and shouldn't be promoted to production. We are discussing more about this in elastic/package-spec#225

But I can tag a proper release and promote it if prefered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both options, however if we have new releases while this one is still a pre-lease then we will have to manually make it an actual release by pushing directly to the registry, right?

@jsoriano
Copy link
Member Author

About the orchestator fields I saw that they are used in combination with node.name so I think we are just fine if those fields are missing since we can rely on the node.name? Would that somehow make things tricky if we have several clusters and only some of them have the identifiers (like some one AWS and some on GKE)?

I added them to have a way to distinguish nodes with the same name on different pods. It is ok if it is empty, it will only not be meaningful to identify timeseries. If a user needs to distinguish resources between clusters that don't provide this info automatically, they will need to manually set these fields.

I wonder if they should be dimensions in all cases, many resources can have the same name on different clusters and not all of them have randomized uids.

@jsoriano
Copy link
Member Author

jsoriano commented Nov 16, 2021

Well, thinking again about the orchestrator fields, I guess that it depends on how this data is used. If data from different clusters (that cannot be identified otherwise) is shown together, then I think that it will make sense. Otherwise, users can still filter their data per cluster. I am going to remove it.

@jsoriano
Copy link
Member Author

/test

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! Not sure if CI failure is relevant.

@jsoriano
Copy link
Member Author

Waiting for #2177, that will include checks for the dimensions fields.

@jsoriano
Copy link
Member Author

All good with CI after including the checks for dimensions in elastic-package, merging, thanks!

@jsoriano jsoriano merged commit f58f321 into elastic:master Nov 22, 2021
@jsoriano jsoriano deleted the kubernetes-dimensions branch November 22, 2021 10:38
eyalkraft pushed a commit to build-security/integrations that referenced this pull request Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Ecosystem Label for the Packages Ecosystem team Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants