Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness of controlplane tests #30906

Merged

Commits on Feb 22, 2024

  1. controlplane: remove obsolete CNPNodeStatusGC test

    The control plane test for CNPNodeStatusGC was initially intended to
    check the behaviour of running with and without GC. With the deprecation
    of the CNP node status, the GC now amounts to unconditional cleanup.
    
    The commit mentioned below had broken the test insofar as it remove the
    differentiating factor: the configuration. In other words, we were
    running the same code twice, expecting a different outcome.
    
    Ideally, this would have broken the first variant - with GC "disabled",
    but unfortunately the test was racy in itself. I suspect that with
    progressing modularisation of the agent and operator, somewhere along
    the line we lost the guarantee that the GC happens before StartOperator
    returns. Hence we checked that the CNPs were unchanged while the
    operator was starting up concurrently - a classic race condition.
    
    Since the whole test is obsolete, simply remove it, and leave the second
    variant of the test around to check that we actually perform the
    deletion.
    
    Fixes: c15f8e4 (Remove skip-cnp-status-startup-clean)
    
    Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
    Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
    bimmlerd and glrf committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    353bdbb View commit details
    Browse the repository at this point in the history
  2. controlplane: fix panic: send on closed channel

    Rarely, the control plane test panics, due to a send on a closed
    channel. This can occur in a narrow race window in the filteringWatcher:
    
    1. Stop is called on the child watcher
    2. Child watcher calls stop on parent watcher
    3. Concurrently, an event is dequeued from the parent result chan, and
       we enter the filtering logic.
    4. The parent result chan is closed, and we close the child event channel
    5. The filter is matched, and we attempt to write on the closed channel,
       which causes the panic.
    
    Instead of closing the channel in the Stop method, close the channel
    from the writing goroutine (as is commonly considered best practice in
    Go.)
    
    Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors)
    
    Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
    bimmlerd committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    966cd5f View commit details
    Browse the repository at this point in the history
  3. controlplane: add mechanism to wait for watchers

    We've recently learned that the fake k8s client set's object tracker do
    not respect the semantics of the real api-server when it comes to
    'Watch': since the object tracker does not care for ResourceVersions, it
    cannot respect the version from which it ought to replay events. As a
    result, the default informer (more precisely, its reflector) is racy: it
    uses a ListAndWatch approach, which relies on this resource version to
    avoid a race window between the end of list and the beginning of watch.
    Therefore, all informers used in cilium have a low chance of hitting
    this race when used with a k8s fake object tracker.
    
    This is somewhat known in the k8s community, see for example [1].
    However, the upstream response is that one simply shouldn't be using the
    fake infrastructure to test real informers. Unfortunately, this pattern
    is used somewhat pervasively inside the cilium tests, specifically so in
    the controlplane tests.
    
    This patch introduces a mechanism which reduces the likelihood of
    hitting the flake, under the assumption that we do not (often) establish
    multiple watchers for the same resource. In the following patch, we'll
    use the new infrastructure to reduce the flakiness of tests.
    
    [1]: kubernetes/kubernetes#95372
    
    Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
    Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
    bimmlerd and glrf committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    1f076f6 View commit details
    Browse the repository at this point in the history
  4. controlplane: wait for watcher establishment

    Use the infrastructure introduced in the previous commit to deflake
    control plane tests which update k8s state after starting the agent.
    
    Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com>
    Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
    bimmlerd and glrf committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    6839dc4 View commit details
    Browse the repository at this point in the history