Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase time period for rate over cadvisor metrics #254

Merged
merged 2 commits into from Oct 2, 2019

Conversation

paulfantom
Copy link
Member

Workaround for an issue described in prometheus-operator/prometheus-operator#2677 (comment)

/cc @s-urbaniak

@s-urbaniak
Copy link
Contributor

i think you have to regenerate the assets too

@s-urbaniak
Copy link
Contributor

/lgtm

@paulfantom paulfantom merged commit 21ace9b into prometheus-operator:master Oct 2, 2019
@tbarrella
Copy link

Hello, I was looking into applying this change and noticed this comment in the prometheus-adapter code, saying that window should match the duration in the rate queries. Should the nodeQuery and window be adjusted to 5m as well?

I was also curious about the trade-offs between using a larger window in prometheus adapter and reducing the cAdvisor scraping interval, which also seems to address the issue

@gitfool
Copy link
Contributor

gitfool commented Oct 6, 2019

So nodeQuery should also be changed to use 5m, right?

Another question; is there a reason it uses kubelet (cAdvisor) metric for containerQuery and node exporter metric for nodeQuery? Wouldn’t it be better, or at least more consistent, to use the same metric source for both?

@s-urbaniak
Copy link
Contributor

@giftool afaik nodeQuery is not affected as it is sourced from node exporter, not from cadvisor.

@gitfool
Copy link
Contributor

gitfool commented Oct 7, 2019

@s-urbaniak okay but now I'm confused as my understanding was that, for the resource metrics config, any rate used in a containerQuery or nodeQuery should use the same value as specified in window, regardless of other problems like jitter.

Also, could you enlighten me about the split between kubelet and node exporter metrics. Is one preferred over the other for some reason?

@paulfantom
Copy link
Member Author

Also, could you enlighten me about the split between kubelet and node exporter metrics. Is one preferred over the other for some reason?

Node-exporter gives much more information about node itself as this is its main job. Kubelet, on the other hand, is aimed at pods/containers thus giving better information about those. Just a matter of selecting the best tool for the job.

okay but now I'm confused as my understanding was that, for the resource metrics config, any rate used in a containerQuery or nodeQuery should use the same value as specified in window, regardless of other problems like jitter.

Yes, rate window in nodeQuery should be fixed and changed to 5m. Do you want to create a PR with a fix? :)

@gitfool
Copy link
Contributor

gitfool commented Oct 7, 2019

Node-exporter gives much more information about node itself as this is its main job. Kubelet, on the other hand, is aimed at pods/containers thus giving better information about those. Just a matter of selecting the best tool for the job.

@paulfantom I can appreciate that but it does add to the confusion for Prometheus newbies like myself. For simple resource metrics, specifically cpu and memory usage, is there a difference between the quality of values coming from the kubelet vs the node exporter?

I'm currently using the kubelet for all resource metrics as follows (using pulumi-kubernetes):

const prometheusAdapterChart = new k8s.helm.v2.Chart("pa", {
    repo: "stable",
    chart: "prometheus-adapter",
    version: "1.3.0",
    namespace: "monitoring",
    transformations: [ setMonitoringNamespace ],
    values: {
        prometheus: { url: "http://po-prometheus-operator-prometheus" },
        rules: {
            resource: {
                cpu: {
                    containerQuery: "sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!='',container!='POD',pod!=''}[5m])) by (<<.GroupBy>>)",
                    nodeQuery: "sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,id='/'}[5m])) by (<<.GroupBy>>)",
                    resources: {
                        overrides: {
                            node: { resource: "node" },
                            namespace: { resource: "namespace" },
                            pod: { resource: "pod" }
                        }
                    },
                    containerLabel: "container"
                },
                memory: {
                    containerQuery: "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!='',container!='POD',pod!=''}) by (<<.GroupBy>>)",
                    nodeQuery: "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'}) by (<<.GroupBy>>)",
                    resources: {
                        overrides: {
                            node: { resource: "node" },
                            namespace: { resource: "namespace" },
                            pod: { resource: "pod" }
                        }
                    },
                    containerLabel: "container"
                },
                window: "5m"
            }
        }
    }
}, {
    dependsOn: [ monitoringNamespace, prometheusOperatorChart ],
    provider: provider
});

Previously I was actually using a mix with the node exporter as follows:

const prometheusAdapterChart = new k8s.helm.v2.Chart("pa", {
    repo: "stable",
    chart: "prometheus-adapter",
    version: "1.3.0",
    namespace: "monitoring",
    transformations: [ setMonitoringNamespace ],
    values: {
        prometheus: { url: "http://po-prometheus-operator-prometheus" },
        rules: {
            resource: {
                cpu: {
                    containerQuery: "sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!='',container!='POD',pod!=''}[5m])) by (<<.GroupBy>>)",
                    nodeQuery: "sum(1 - rate(node_cpu_seconds_total{mode='idle'}[5m]) * on(namespace,pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>)",
                    resources: {
                        overrides: {
                            node: { resource: "node" },
                            namespace: { resource: "namespace" },
                            pod: { resource: "pod" }
                        }
                    },
                    containerLabel: "container"
                },
                memory: {
                    containerQuery: "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!='',container!='POD',pod!=''}) by (<<.GroupBy>>)",
                    nodeQuery: "sum((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * on(namespace,pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>)",
                    resources: {
                        overrides: {
                            node: { resource: "node" },
                            namespace: { resource: "namespace" },
                            pod: { resource: "pod" }
                        }
                    },
                    containerLabel: "container"
                },
                window: "5m"
            }
        }
    }
}, {
    dependsOn: [ monitoringNamespace, prometheusOperatorChart ],
    provider: provider
});

It took me a while to work out what I needed to do to get the memory nodeQuery working with the node exporter. Examples, like the config used in this repo, didn't work until I realized I also needed to get the node from node_namespace_pod:kube_pod_info: as the instance label did not match the node name, presumably due to environmental differences when running on AWS EKS.

Anyway, this exercise taught me a lot about the Prometheus Adapter but left me with questions about what was the best way to do things and why, so I appreciate any insights. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants