Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes for policystatemetrics #2285

Merged
merged 3 commits into from
Apr 2, 2024
Merged

fixes for policystatemetrics #2285

merged 3 commits into from
Apr 2, 2024

Conversation

kkourt
Copy link
Contributor

@kkourt kkourt commented Apr 2, 2024

No description provided.

policystatemetrics needs a reference to the sensor manager so that it
can collect metrics. Currently, this reference is passed using
observer.GetSensorManager() at initialization time.

In observer tests, we currently do not restart the metrics (see [1])
which means that if we create a new observer, then the metrics will
still reference the old sensor manager.

Fix this by having policystatemetrics to call
observer.GetSensorManager() to get the latest version of the sensor
manager.

[1] https://github.com/cilium/tetragon/blob/22eb995b19207ac0ced2dd83950ec8e8aedd122d/pkg/observer/observertesthelper/observer_test_helper.go#L272-L276

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
We should also do the same in the other operations, but we leave that as
a followup.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt kkourt added the release-note/misc This PR makes changes that have no direct user impact. label Apr 2, 2024
This patch adds a timeout for ListTracingPolicies. It can be the case
that the sensor manager is stuck or misbehaving. This patch (combined
with the previous one) ensures that metrics will continue after a
timeout.

Tested manually using:

```diff
diff --git a/pkg/metrics/policystatemetrics/policystatemetrics_test.go b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
index 227306b65..fd581392b 100644
--- a/pkg/metrics/policystatemetrics/policystatemetrics_test.go
+++ b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
@@ -9,6 +9,7 @@ import (
 	"io"
 	"strings"
 	"testing"
+	"time"

 	"github.com/cilium/tetragon/pkg/observer"
 	tus "github.com/cilium/tetragon/pkg/testutils/sensors"
@@ -57,3 +58,22 @@ tetragon_tracingpolicy_loaded{state="load_error"} %d
 	err = testutil.CollectAndCompare(collector, expectedMetrics(1, 0, 0, 0))
 	assert.NoError(t, err)
 }
+
+func TestTimeout(t *testing.T) {
+	reg := prometheus.NewRegistry()
+
+	manager := tus.GetTestSensorManager(context.TODO(), t).Manager
+	observer.SetSensorManager(manager)
+	t.Cleanup(observer.ResetSensorManager)
+
+	collector := newPolicyStateCollector()
+	reg.Register(collector)
+
+	go func() {
+		err := manager.SleepForTesting(context.TODO(), t, 1*time.Second)
+		assert.NoError(t, err)
+	}()
+
+	err := testutil.CollectAndCompare(collector, strings.NewReader(""))
+	assert.NoError(t, err)
+}
diff --git a/pkg/sensors/manager.go b/pkg/sensors/manager.go
index eaf908340..291a58c8f 100644
--- a/pkg/sensors/manager.go
+++ b/pkg/sensors/manager.go
@@ -8,6 +8,8 @@ import (
 	"errors"
 	"fmt"
 	"strings"
+	"testing"
+	"time"

 	"github.com/cilium/tetragon/api/v1/tetragon"
 	"github.com/cilium/tetragon/pkg/k8s/apis/cilium.io/v1alpha1"
@@ -96,6 +98,13 @@ func startSensorManager(
 				logger.GetLogger().Debugf("stopping sensor controller...")
 				done = true
 				err = nil
+
+			// NB(kkourt): for testing
+			case *sensorManagerSleep:
+				time.Sleep(op.d)
+				err = nil
+
 			default:
 				err = fmt.Errorf("unknown sensorOp: %v", op)
 			}
@@ -421,6 +430,13 @@ type sensorCtlStop struct {
 	retChan chan error
 }

+// sensorManagerSleep just sleeps. Intended only for testing.
+type sensorManagerSleep struct {
+	ctx     context.Context
+	retChan chan error
+	d       time.Duration
+}
+
 type LoadArg struct{}
 type UnloadArg = LoadArg

@@ -436,5 +452,18 @@ func (s *sensorEnable) sensorOpDone(e error)         { s.retChan <- e }
 func (s *sensorDisable) sensorOpDone(e error)        { s.retChan <- e }
 func (s *sensorList) sensorOpDone(e error)           { s.retChan <- e }
 func (s *sensorCtlStop) sensorOpDone(e error)        { s.retChan <- e }
+func (s *sensorManagerSleep) sensorOpDone(e error)   { s.retChan <- e }

 type sensorCtlHandle = chan<- sensorOp
+
+func (h *Manager) SleepForTesting(ctx context.Context, t *testing.T, d time.Duration) error {
+	retc := make(chan error)
+	op := &sensorManagerSleep{
+		ctx:     ctx,
+		retChan: retc,
+		d:       d,
+	}
+
+	h.sensorCtl <- op
+	return <-retc
+}
```

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt kkourt force-pushed the pr/kkourt/testing-metrix-fix branch from 5dc1065 to 5ce0f71 Compare April 2, 2024 16:13
@mtardy
Copy link
Member

mtardy commented Apr 2, 2024

I have #2284 that reliably hangs thanks to this test failing on arm64, https://github.com/cilium/tetragon/actions/runs/8523996161/job/23347576591?pr=2284 I could try to rebase on top of this and see if it fixes my issue!

Let see if this works for my issue #2286.

Copy link
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing it actually fixes #2210. Let's merge this!

@mtardy mtardy linked an issue Apr 2, 2024 that may be closed by this pull request
@kkourt kkourt marked this pull request as ready for review April 2, 2024 17:42
@kkourt kkourt requested a review from a team as a code owner April 2, 2024 17:42
@kkourt kkourt requested a review from olsajiri April 2, 2024 17:42
@kkourt kkourt merged commit 911fe6d into main Apr 2, 2024
35 checks passed
@kkourt kkourt deleted the pr/kkourt/testing-metrix-fix branch April 2, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/sensors/tracing tests are timing out on arm64
2 participants