Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes for policystatemetrics #2285

Merged
merged 3 commits into from
Apr 2, 2024
Merged

fixes for policystatemetrics #2285

merged 3 commits into from
Apr 2, 2024

Commits on Apr 2, 2024

  1. policystatemetrics: use observer.GetSensorManager()

    policystatemetrics needs a reference to the sensor manager so that it
    can collect metrics. Currently, this reference is passed using
    observer.GetSensorManager() at initialization time.
    
    In observer tests, we currently do not restart the metrics (see [1])
    which means that if we create a new observer, then the metrics will
    still reference the old sensor manager.
    
    Fix this by having policystatemetrics to call
    observer.GetSensorManager() to get the latest version of the sensor
    manager.
    
    [1] https://github.com/cilium/tetragon/blob/22eb995b19207ac0ced2dd83950ec8e8aedd122d/pkg/observer/observertesthelper/observer_test_helper.go#L272-L276
    
    Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
    kkourt committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    e26b385 View commit details
    Browse the repository at this point in the history
  2. sensors: respect ctx in ListTracingPolicies

    We should also do the same in the other operations, but we leave that as
    a followup.
    
    Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
    kkourt committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    33b4014 View commit details
    Browse the repository at this point in the history
  3. policystatemetrics: timeout for ListTracingPolicies

    This patch adds a timeout for ListTracingPolicies. It can be the case
    that the sensor manager is stuck or misbehaving. This patch (combined
    with the previous one) ensures that metrics will continue after a
    timeout.
    
    Tested manually using:
    
    ```diff
    diff --git a/pkg/metrics/policystatemetrics/policystatemetrics_test.go b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
    index 227306b65..fd581392b 100644
    --- a/pkg/metrics/policystatemetrics/policystatemetrics_test.go
    +++ b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
    @@ -9,6 +9,7 @@ import (
     	"io"
     	"strings"
     	"testing"
    +	"time"
    
     	"github.com/cilium/tetragon/pkg/observer"
     	tus "github.com/cilium/tetragon/pkg/testutils/sensors"
    @@ -57,3 +58,22 @@ tetragon_tracingpolicy_loaded{state="load_error"} %d
     	err = testutil.CollectAndCompare(collector, expectedMetrics(1, 0, 0, 0))
     	assert.NoError(t, err)
     }
    +
    +func TestTimeout(t *testing.T) {
    +	reg := prometheus.NewRegistry()
    +
    +	manager := tus.GetTestSensorManager(context.TODO(), t).Manager
    +	observer.SetSensorManager(manager)
    +	t.Cleanup(observer.ResetSensorManager)
    +
    +	collector := newPolicyStateCollector()
    +	reg.Register(collector)
    +
    +	go func() {
    +		err := manager.SleepForTesting(context.TODO(), t, 1*time.Second)
    +		assert.NoError(t, err)
    +	}()
    +
    +	err := testutil.CollectAndCompare(collector, strings.NewReader(""))
    +	assert.NoError(t, err)
    +}
    diff --git a/pkg/sensors/manager.go b/pkg/sensors/manager.go
    index eaf908340..291a58c8f 100644
    --- a/pkg/sensors/manager.go
    +++ b/pkg/sensors/manager.go
    @@ -8,6 +8,8 @@ import (
     	"errors"
     	"fmt"
     	"strings"
    +	"testing"
    +	"time"
    
     	"github.com/cilium/tetragon/api/v1/tetragon"
     	"github.com/cilium/tetragon/pkg/k8s/apis/cilium.io/v1alpha1"
    @@ -96,6 +98,13 @@ func startSensorManager(
     				logger.GetLogger().Debugf("stopping sensor controller...")
     				done = true
     				err = nil
    +
    +			// NB(kkourt): for testing
    +			case *sensorManagerSleep:
    +				time.Sleep(op.d)
    +				err = nil
    +
     			default:
     				err = fmt.Errorf("unknown sensorOp: %v", op)
     			}
    @@ -421,6 +430,13 @@ type sensorCtlStop struct {
     	retChan chan error
     }
    
    +// sensorManagerSleep just sleeps. Intended only for testing.
    +type sensorManagerSleep struct {
    +	ctx     context.Context
    +	retChan chan error
    +	d       time.Duration
    +}
    +
     type LoadArg struct{}
     type UnloadArg = LoadArg
    
    @@ -436,5 +452,18 @@ func (s *sensorEnable) sensorOpDone(e error)         { s.retChan <- e }
     func (s *sensorDisable) sensorOpDone(e error)        { s.retChan <- e }
     func (s *sensorList) sensorOpDone(e error)           { s.retChan <- e }
     func (s *sensorCtlStop) sensorOpDone(e error)        { s.retChan <- e }
    +func (s *sensorManagerSleep) sensorOpDone(e error)   { s.retChan <- e }
    
     type sensorCtlHandle = chan<- sensorOp
    +
    +func (h *Manager) SleepForTesting(ctx context.Context, t *testing.T, d time.Duration) error {
    +	retc := make(chan error)
    +	op := &sensorManagerSleep{
    +		ctx:     ctx,
    +		retChan: retc,
    +		d:       d,
    +	}
    +
    +	h.sensorCtl <- op
    +	return <-retc
    +}
    ```
    
    Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
    kkourt committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    5ce0f71 View commit details
    Browse the repository at this point in the history