Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics for provisioner usage #1872

Merged
merged 5 commits into from Jun 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 5 additions & 2 deletions Makefile
Expand Up @@ -42,7 +42,7 @@ deflake:
battletest: strongertests
go tool cover -html coverage.out -o coverage.html

verify: codegen ## Verify code. Includes dependencies, linting, formatting, etc
verify: codegen docgen ## Verify code. Includes dependencies, linting, formatting, etc
go mod tidy
go mod download
golangci-lint run
Expand Down Expand Up @@ -78,6 +78,9 @@ codegen: ## Generate code. Must be run if changes are made to ./pkg/apis/...
output:crd:artifacts:config=charts/karpenter/crds
hack/boilerplate.sh

docgen: ## Generate docs
bwagner5 marked this conversation as resolved.
Show resolved Hide resolved
go run hack/docs/metrics_gen_docs.go pkg/ website/content/en/preview/tasks/metrics.md

release: ## Generate release manifests and publish a versioned container image.
$(WITH_GOFLAGS) ./hack/release.sh

Expand All @@ -102,4 +105,4 @@ issues: ## Run GitHub issue analysis scripts
website: ## Serve the docs website locally
cd website && npm install && git submodule update --init --recursive && hugo server

.PHONY: help dev ci release test battletest verify codegen apply delete toolchain release licenses issues website
.PHONY: help dev ci release test battletest verify codegen docgen apply delete toolchain release licenses issues website nightly snapshot
2 changes: 2 additions & 0 deletions cmd/controller/main.go
Expand Up @@ -50,6 +50,7 @@ import (
"github.com/aws/karpenter/pkg/controllers/counter"
metricsnode "github.com/aws/karpenter/pkg/controllers/metrics/node"
metricspod "github.com/aws/karpenter/pkg/controllers/metrics/pod"
metricsprovisioner "github.com/aws/karpenter/pkg/controllers/metrics/provisioner"
"github.com/aws/karpenter/pkg/controllers/node"
"github.com/aws/karpenter/pkg/controllers/persistentvolumeclaim"
"github.com/aws/karpenter/pkg/controllers/provisioning"
Expand Down Expand Up @@ -117,6 +118,7 @@ func main() {
node.NewController(manager.GetClient()),
metricspod.NewController(manager.GetClient()),
metricsnode.NewController(manager.GetClient()),
metricsprovisioner.NewController(manager.GetClient()),
counter.NewController(manager.GetClient()),
).Start(ctx); err != nil {
panic(fmt.Sprintf("Unable to start manager, %s", err))
Expand Down
216 changes: 216 additions & 0 deletions hack/docs/metrics_gen_docs.go
@@ -0,0 +1,216 @@
/*
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package main

import (
"flag"
"fmt"
"go/ast"
"go/parser"
"go/token"
"io/fs"
"log"
"os"
"path/filepath"
"sort"
"strings"
)

type metricInfo struct {
namespace string
subsystem string
name string
help string
}

func (i metricInfo) qualifiedName() string {
return fmt.Sprintf("%s_%s_%s", i.namespace, i.subsystem, i.name)
}

// metrics_gen_docs is used to parse the source code for Prometheus metrics and automatically generate markdown documentation
// based on the naming and help provided in the source code.

func main() {
flag.Parse()
if flag.NArg() != 2 {
log.Printf("Usage: %s path/to/metrics/controller path/to/markdown.md", os.Args[0])
os.Exit(1)
}
fset := token.NewFileSet()
var packages []*ast.Package
root := flag.Arg(0)

// walk our metrics controller directory
log.Println("parsing code in", root)
filepath.WalkDir(root, func(path string, d fs.DirEntry, err error) error {
if d == nil {
return nil
}
if !d.IsDir() {
return nil
}
// parse the packagers that we find
pkgs, err := parser.ParseDir(fset, path, func(info fs.FileInfo) bool {
return true
}, parser.AllErrors)
if err != nil {
log.Fatalf("error parsing, %s", err)
}
for _, pkg := range pkgs {
if strings.HasSuffix(pkg.Name, "_test") {
continue
}
packages = append(packages, pkg)
}
return nil
})

// metrics are all package global variables
var allMetrics []metricInfo
for _, pkg := range packages {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a Karpenter dry-run mode where we could just parse the metrics endpoint rather than making all these assumptions about the pkg structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that could work too. I could go either way, parsing the source isn't too bad and we walk through all of the controller packages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this approach for now. This is missing cloudprovider metrics: https://github.com/aws/karpenter/blob/main/pkg/cloudprovider/metrics/cloudprovider.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

for _, file := range pkg.Files {
for _, decl := range file.Decls {
switch v := decl.(type) {
case *ast.FuncDecl:
// ignore
case *ast.GenDecl:
if v.Tok == token.VAR {
allMetrics = append(allMetrics, handleVariableDeclaration(v)...)
}
default:

}
}
}
}
sort.Slice(allMetrics, bySubsystem(allMetrics))

outputFileName := flag.Arg(1)
f, err := os.Create(outputFileName)
if err != nil {
log.Fatalf("error creating output file %s, %s", outputFileName, err)
}

log.Println("writing output to", outputFileName)
fmt.Fprintf(f, `---
title: "Metrics"
linkTitle: "Metrics"
weight: 100

description: >
Inspect Karpenter Metrics
---
`)
fmt.Fprintf(f, "<!-- this document is generated from hack/docs/metrics_gen_docs.go -->\n")
fmt.Fprintf(f, "Karpenter writes several metrics to Prometheus to allow monitoring cluster provisioning status\n")
previousSubsystem := ""
for _, metric := range allMetrics {
if metric.subsystem != previousSubsystem {
fmt.Fprintf(f, "## %s%s Metrics\n", strings.ToTitle(metric.subsystem[0:1]), metric.subsystem[1:])
njtran marked this conversation as resolved.
Show resolved Hide resolved
previousSubsystem = metric.subsystem
fmt.Fprintln(f)
}
fmt.Fprintf(f, "### `%s`\n", metric.qualifiedName())
fmt.Fprintf(f, "%s\n", metric.help)
fmt.Fprintln(f)
}

}

func bySubsystem(metrics []metricInfo) func(i int, j int) bool {
subSystemSortOrder := map[string]int{}
subSystemSortOrder["provisioner"] = 1
subSystemSortOrder["nodes"] = 2
subSystemSortOrder["pods"] = 3
subSystemSortOrder["cloudprovider"] = 4
subSystemSortOrder["allocation_controller"] = 5
return func(i, j int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the subsystem isn't one of the enum keys above, it'll show up as the first sorted, right? (Value of 0), can we make it the other way around?

lhs := metrics[i]
rhs := metrics[j]
if subSystemSortOrder[lhs.subsystem] != subSystemSortOrder[rhs.subsystem] {
return subSystemSortOrder[lhs.subsystem] < subSystemSortOrder[rhs.subsystem]
}
return lhs.qualifiedName() < rhs.qualifiedName()
}
}

func handleVariableDeclaration(v *ast.GenDecl) []metricInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment this function so it's more readable to newcomers?

var metrics []metricInfo
for _, spec := range v.Specs {
vs, ok := spec.(*ast.ValueSpec)
if !ok {
continue
}
for _, v := range vs.Values {
ce, ok := v.(*ast.CallExpr)
if !ok {
continue
}
funcPkg := getFuncPackage(ce.Fun)
if funcPkg != "prometheus" {
continue
}
if len(ce.Args) != 2 {
continue
}
arg := ce.Args[0].(*ast.CompositeLit)
keyValuePairs := map[string]string{}
for _, el := range arg.Elts {
kv := el.(*ast.KeyValueExpr)
key := fmt.Sprintf("%s", kv.Key)
switch key {
case "Namespace", "Subsystem", "Name", "Help":
default:
// skip any keys we don't care about
continue
}
value := ""
switch val := kv.Value.(type) {
case *ast.BasicLit:
value = val.Value
case *ast.SelectorExpr:
if selector := fmt.Sprintf("%s.%s", val.X, val.Sel); selector == "metrics.Namespace" {
value = "karpenter"
} else {
log.Fatalf("unsupported selector %s", selector)
}
default:
log.Fatalf("unsupported value %T %v", kv.Value, kv.Value)
}
keyValuePairs[key] = strings.TrimFunc(value, func(r rune) bool {
return r == '"'
})
}
metrics = append(metrics, metricInfo{
namespace: keyValuePairs["Namespace"],
subsystem: keyValuePairs["Subsystem"],
name: keyValuePairs["Name"],
help: keyValuePairs["Help"],
})
}
}
return metrics
}

func getFuncPackage(fun ast.Expr) string {
if sel, ok := fun.(*ast.SelectorExpr); ok {
return fmt.Sprintf("%s", sel.X)
}
if ident, ok := fun.(*ast.Ident); ok {
return ident.String()
}
log.Fatalf("unsupported func expression %T, %v", fun, fun)
return ""
}
2 changes: 1 addition & 1 deletion pkg/cloudprovider/metrics/cloudprovider.go
Expand Up @@ -40,7 +40,7 @@ var methodDurationHistogramVec = prometheus.NewHistogramVec(
Namespace: metrics.Namespace,
Subsystem: "cloudprovider",
Name: "duration_seconds",
Help: "Duration of cloud provider method calls.",
Help: "Duration of cloud provider method calls. Labeled by the controller, method name and provider.",
},
[]string{
metricLabelController,
Expand Down
12 changes: 6 additions & 6 deletions pkg/controllers/metrics/node/controller.go
Expand Up @@ -56,7 +56,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "allocatable",
Help: "Node allocatable",
Help: "Node allocatable are the resources allocatable by nodes. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand All @@ -65,7 +65,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "total_pod_requests",
Help: "Node total pod requests",
Help: "Node total pod requests are the resources requested by non-DaemonSet pods bound to nodes. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand All @@ -74,7 +74,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "total_pod_limits",
Help: "Node total pod limits",
Help: "Node total pod limits are the resources specified by non-DaemonSet pod limits. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand All @@ -83,7 +83,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "total_daemon_requests",
Help: "Node total daemon requests",
Help: "Node total daemon requests are the resource requested by DaemonSet pods bound to nodes. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand All @@ -92,7 +92,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "total_daemon_limits",
Help: "Node total daemon limits",
Help: "Node total pod limits are the resources specified by DaemonSet pod limits. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand All @@ -101,7 +101,7 @@ var (
Namespace: "karpenter",
Subsystem: "nodes",
Name: "system_overhead",
Help: "Node system daemon overhead",
Help: "Node system daemon overhead are the resources reserved for system overhead, the difference between the node's capacity and allocatable values are reported by the status. Labeled by provisioner name, node name, zone, architecture, capacity type, instance type, node phase and resource type.",
},
labelNames(),
)
Expand Down
2 changes: 1 addition & 1 deletion pkg/controllers/metrics/pod/controller.go
Expand Up @@ -54,7 +54,7 @@ var (
Namespace: "karpenter",
Subsystem: "pods",
Name: "state",
Help: "Pod state.",
Help: "Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, provisioner name, zone, architecture, capacity type, instance type and pod phase.",
},
labelNames(),
)
Expand Down