Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom calls: add new metrics to count skipped tail calls to custom programs #15475

Merged
merged 2 commits into from
Mar 29, 2021

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Mar 26, 2021

Add a new metrics to count the number of skipped tail calls to custom programs in the datapath. This metrics provides an indicator of whether or not custom programs are effectively attached to the dedicated hooks.

However, note that when tail calls to custom programs are enabled, all endpoints attempt to use them, so the metrics will keep incrementing unless all hooks have a program attached (or if tail calls to custom programs are disabled).

Note that the CI test relies on the CLI command cilium cleanup to reset the values for the metrics. Given that the command must be run when the agent is not running, we cannot call it from the Cilium pods. Instead, we build it and call it from the pod we use to compile the custom program. Building the CLI takes time (maybe ~20 seconds on local runs), so this is not ideal. I plan to package the byte counter example as a dedicated image, and could include the CLI with it so it is part of the image, which would solve the problem. Before that, I wonder if there's an alternative solution to reset the metrics. Maybe just edit the eBPF map by running bpftool map delete pinned /sys/fs/bpf/tc/globals/cilium_metrics/key <...> directly on the node? But this looks less clean, maybe?

@qmonnet qmonnet added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels Mar 26, 2021
@qmonnet qmonnet requested a review from pchaigno March 26, 2021 01:01
@qmonnet qmonnet requested a review from a team as a code owner March 26, 2021 01:01
@qmonnet qmonnet requested a review from a team March 26, 2021 01:01
@qmonnet qmonnet requested a review from a team as a code owner March 26, 2021 01:01
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Mar 26, 2021
@qmonnet qmonnet force-pushed the pr/custcall_failed_metrics branch 2 times, most recently from 1195894 to 6ec6b56 Compare March 26, 2021 10:09
@qmonnet qmonnet requested a review from kaworu March 26, 2021 10:10
@qmonnet qmonnet changed the title custom calls: add new metrics to count skipped tail calls to custom programs custom calls: add new metrics to count skipped tail calls to custom programs (and clean up code) Mar 26, 2021
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a question on the last commit. Other commits look fine to me.

test/k8sT/CustomCalls.go Outdated Show resolved Hide resolved
@qmonnet qmonnet removed the request for review from kaworu March 26, 2021 15:36
@qmonnet qmonnet changed the title custom calls: add new metrics to count skipped tail calls to custom programs (and clean up code) custom calls: add new metrics to count skipped tail calls to custom programs Mar 26, 2021
@qmonnet qmonnet marked this pull request as draft March 26, 2021 16:02
Add a new metrics to count the number of skipped tail calls to custom
programs in the datapath. This metrics provides an indicator of whether
or not custom programs are effectively attached to the dedicated hooks.

However, note that when tail calls to custom programs are enabled, all
endpoints attempt to use them, so the metrics will keep incrementing
unless all hooks have a program attached (or if tail calls to custom
programs are disabled).

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Update the end-to-end tests for tail calls to custom programs to
validate that the new metrics is incremented when tail calls to custom
programs are enabled, but with no program attached.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet
Copy link
Member Author

qmonnet commented Mar 29, 2021

Latest update:

  • Removed first commits (moved to custom calls: cleanup and improve a few elements #15480)
  • Addressed Paul's suggestion: merge new test into existing tests for tail calls to custom programs. Cleaner, runs for per-endpoint routes as well, although it not longer checks that the metrics remains at zero when the hooks are disabled.
Incremental diff
diff --git a/test/k8sT/CustomCalls.go b/test/k8sT/CustomCalls.go
index a1e201452c75..2b35f2ba9830 100644
--- a/test/k8sT/CustomCalls.go
+++ b/test/k8sT/CustomCalls.go
@@ -241,6 +241,26 @@ var _ = Describe("K8sCustomCalls", func() {
 				fmt.Sprintf("Byte count (%d) differs from expected value (%d)", count, expectedCount))
 		}
 
+		getMissedCustomCallsCount := func(ciliumPod string,
+			direction string) int {
+
+			cmd := fmt.Sprintf("cilium bpf metrics list -o jsonpath='{$[?(@.reason==11)].values.%s.packets}'", direction)
+			res := kubectl.ExecPodCmd(helpers.KubeSystemNamespace, ciliumPod, cmd)
+			res.ExpectSuccess("Failed to lookup metrics for missed tail calls to custom programs")
+
+			// If the metrics is missing from the output, consider
+			// it is a zero value
+			output := strings.TrimSpace(res.Stdout())
+			if output == "" {
+				return 0
+			}
+
+			count, err := strconv.Atoi(output)
+			ExpectWithOffset(2, err).ToNot(HaveOccurred(),
+				fmt.Sprintf("Failed to convert metrics value: %s", err))
+			return count
+		}
+
 		cleanupByteCounter := func(endpointId int64, ciliumPod string,
 			serverIdentity string, direction customCallDirection) {
 
@@ -286,6 +306,11 @@ var _ = Describe("K8sCustomCalls", func() {
 			expectedCountIngress, expectedCountEgress uint,
 			runEgress bool) {
 
+			var metrics = map[string]int{
+				"ingress": 0,
+				"egress":  0,
+			}
+
 			// Deploy Cilium, enable tail calls to custom programs
 			deploymentManager.DeployCilium(ciliumOptions, DeployCiliumOptionsAndDNS)
 
@@ -310,6 +335,12 @@ var _ = Describe("K8sCustomCalls", func() {
 			// the byte-counter hash map.
 			identityKey := getIdentityKey("k8s:id=app1", ciliumPodK8s1)
 
+			// Collect initial value for metrics on skipped tail
+			// calls to custom programs
+			for direction := range metrics {
+				metrics[direction] = getMissedCustomCallsCount(ciliumPodK8s2, direction)
+			}
+
 			err = kubectl.WaitforPods(helpers.DefaultNamespace, "-l zgroup=testapp", helpers.HelperTimeout)
 			ExpectWithOffset(1, err).Should(BeNil())
 
@@ -330,6 +361,20 @@ var _ = Describe("K8sCustomCalls", func() {
 				podApp1.Status.PodIP, identityKey, EgressIPv4,
 				expectedCountEgress)
 			cleanupByteCounter(endpointId, ciliumPodK8s2, identityKey, EgressIPv4)
+
+			By("Making sure metrics for skipped calls to custom programs are incremented")
+
+			// We expect the value to have raised for both
+			// directions, even if we have a program attached. This
+			// is because the metrics is common to all tail calls
+			// to custom programs, for all endpoints (the only
+			// distinction is ingress/egress), and other endpoints
+			// in our network do not have custom programs attached.
+			for direction, current := range metrics {
+				metrics[direction] = getMissedCustomCallsCount(ciliumPodK8s2, direction) - current
+				ExpectWithOffset(1, metrics[direction]).To(BeNumerically(">", 0),
+					fmt.Sprintf("Value not incremented (delta: %d) for %s metrics for skipped calls", metrics[direction], direction))
+			}
 		}
 
 		It("Loads byte-counter and gets consistent values", func() {
@@ -355,83 +400,5 @@ var _ = Describe("K8sCustomCalls", func() {
 
 			checkByteCounter(options, expectedByteCount, 0, false)
 		})
-
-		getMissedCustomCallsCount := func(ciliumPod string,
-			direction string) int {
-
-			cmd := fmt.Sprintf("cilium bpf metrics list -o jsonpath='{$[?(@.reason==11)].values.%s.packets}'", direction)
-			res := kubectl.ExecPodCmd(helpers.KubeSystemNamespace, ciliumPod, cmd)
-			res.ExpectSuccess("Failed to lookup metrics for missed tail calls to custom programs")
-
-			// If the metrics is missing from the output, consider
-			// it is a zero value
-			output := strings.TrimSpace(res.Stdout())
-			if output == "" {
-				return 0
-			}
-
-			count, err := strconv.Atoi(output)
-			ExpectWithOffset(2, err).ToNot(HaveOccurred(),
-				fmt.Sprintf("Failed to convert metrics value: %s", err))
-			return count
-		}
-
-		checkMetrics := func(enableCustomCalls bool) {
-			var customCallsOption string
-
-			if enableCustomCalls {
-				customCallsOption = "true"
-			} else {
-				customCallsOption = "false"
-			}
-
-			deploymentManager.DeployCilium(map[string]string{
-				"customCalls.enabled": customCallsOption,
-			}, DeployCiliumOptionsAndDNS)
-
-			ciliumPodK8s2, err := kubectl.GetCiliumPodOnNode(helpers.K8s2)
-			Expect(err).ShouldNot(HaveOccurred(), "Cannot get cilium pod on k8s2")
-
-			getPodsInfo()
-
-			err = kubectl.WaitforPods(helpers.DefaultNamespace, "-l zgroup=testapp", helpers.HelperTimeout)
-			ExpectWithOffset(1, err).Should(BeNil())
-
-			// Send traffic between pods
-			cmd := helpers.PingWithCount(podApp1.Status.PodIP, 1)
-			res := kubectl.ExecPodCmd(helpers.DefaultNamespace, podApp2.Name, cmd)
-			res.ExpectSuccess(fmt.Sprintf("Failed to ping from %s to %s", podApp2.Name, podApp1.Status.PodIP))
-
-			// Check metrics
-			for _, direction := range []string{"ingress", "egress"} {
-				missedCalls := getMissedCustomCallsCount(ciliumPodK8s2, direction)
-				if enableCustomCalls {
-					ExpectWithOffset(1, missedCalls).To(BeNumerically(">", 0),
-						fmt.Sprintf("Zero value for metrics (%s) when tail calls to custom programs are enabled", direction))
-				} else {
-					ExpectWithOffset(1, missedCalls).To(Equal(0),
-						fmt.Sprintf("Non-zero value for metrics (%s) when tail calls to custom programs are disabled", direction))
-				}
-			}
-		}
-
-		It("Increments dedicated metrics on missed tail calls", func() {
-
-			By("Resetting metrics")
-
-			cmd := "./cilium/cilium cleanup --force"
-			res := kubectl.ExecPodCmd(helpers.DefaultNamespace, compilerPodName, cmd)
-			res.ExpectSuccess("Failed to clean Cilium state (and metrics)")
-
-			By("Checking metrics are at 0 when tail calls to custom programs are disabled")
-
-			checkMetrics(false)
-
-			deploymentManager.DeleteAll()
-
-			By("Checking metrics are non-nul when tail calls to custom programs are enabled")
-
-			checkMetrics(true)
-		})
 	})
 })

@qmonnet qmonnet marked this pull request as ready for review March 29, 2021 13:29
@qmonnet qmonnet requested a review from pchaigno March 29, 2021 13:29
@pchaigno
Copy link
Member

test-me-please

bpf/bpf_lxc.c Show resolved Hide resolved
bpf/bpf_lxc.c Show resolved Hide resolved
@pchaigno pchaigno removed their assignment Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants