Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestLabelsDemoApp failures (flake?) #1954

Open
kkourt opened this issue Jan 10, 2024 · 6 comments
Open

TestLabelsDemoApp failures (flake?) #1954

kkourt opened this issue Jan 10, 2024 · 6 comments
Assignees
Labels
area/ci Related to CI kind/ci-flake A flake in CI

Comments

@kkourt
Copy link
Contributor

kkourt commented Jan 10, 2024

Hit a TestLabelsDemoApp failure (https://github.com/cilium/tetragon/actions/runs/7472506683/job/20334842561?pr=1948) in #1948. Seems like a flake.

Details:

 I0110 09:10:55.961095   14366 dumpinfo.go:240] contacting metrics serveraddrhttp://localhost:2112/metrics
--- FAIL: TestLabelsDemoApp (30.00s)
    --- FAIL: TestLabelsDemoApp/Run_Event_Checks (10.01s)
        --- FAIL: TestLabelsDemoApp/Run_Event_Checks/Run_Event_Checks (10.01s)
            rpcchecker.go:171: 
                	Error Trace:	/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/tests/e2e/checker/rpcchecker.go:171
                	            				/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:422
                	            				/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:453
                	Error:      	Received unexpected error:
                	            	failed to get events after 10 tries
                	Test:       	TestLabelsDemoApp/Run_Event_Checks/Run_Event_Checks
                	Messages:   	checks should pass
    --- FAIL: TestLabelsDemoApp/Run_Workload (30.00s)
        --- FAIL: TestLabelsDemoApp/Run_Workload/Wait_for_Checker (30.00s)
            rpcchecker.go:107: 
                	Error Trace:	/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/tests/e2e/checker/rpcchecker.go:107
                	            				/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:422
                	            				/home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:453
                	Error:      	failed to wait for checker labelsEventChecker to start after 30s
                	Test:       	TestLabelsDemoApp/Run_Workload/Wait_for_Checker
I0110 09:10:55.961222   14366 dumpinfo.go:299] contacting gops agentaddr127.0.0.1:8118
FAIL
E0110 09:10:55.961342   14366 dumpinfo.go:303] "failed to dump heap profile" err="failed to dump heap profile: dial tcp 127.0.0.1:8118: connect: connection refused" addr="127.0.0.1:8118"
coverage: [no statements]
I0110 09:10:55.961425   14366 dumpinfo.go:48] "Dumping test data" dir="/tmp/tetragon.e2e.TestLabelsDemoApp.2402988556"
I0110 09:10:55.961435   14366 dumpinfo.go:233] No checker info to dump
E0110 09:10:55.961716   14366 dumpinfo.go:244] "failed to contact metrics server" err="Get \"http://localhost:2112/metrics\": dial tcp [::1]:2112: connect: connection refused" addr="http://localhost:2112/metrics"
E0110 09:10:56.101977   14366 dumpinfo.go:71] "Failed to extract previous tetragon logs" err="failed to run kubectl logs -c tetragon -n kube-system tetragon-4x65q --previous: exit status 1"
I0110 09:10:56.892134   14366 cluster.go:165] Deleting temporary kind cluster tetragon-ci-5c01
I0110 09:10:56.892181   14366 kind.go:149] Destroying kind cluster tetragon-ci-5c01
I0110 09:10:58.147558   14366 kind.go:159] Removing kubeconfig file /tmp/kind-cluser-tetragon-ci-5c01-kubecfg1496165204
I0110 09:10:58.147660   14366 portforward.go:142] "Test ended, stopping portforward" pod="tetragon-4x65q" namespace="kube-system" ports=["54321:54321","2112:2112","8118:8118"]
FAIL	github.com/cilium/tetragon/tests/e2e/tests/labels	118.314s
ok  	github.com/cilium/tetragon/tests/e2e/tests/policyfilter	127.217s	coverage: [no statements]
ok  	github.com/cilium/tetragon/tests/e2e/tests/skeleton	379.648s	coverage: [no statements]
FAIL
make: *** [Makefile:251: e2e-test] Error 1
Error: Process completed with exit code 2.
@kkourt kkourt added area/ci Related to CI kind/ci-flake A flake in CI labels Jan 10, 2024
@mtardy
Copy link
Member

mtardy commented Jan 10, 2024

More typical timeout example. I think one solution would be to move away from these deployments that are flaky "by nature": they somehow fail to deploy on time even in an environment with enough resources. We have been talking in the past about moving away from those and maybe use https://github.com/GoogleCloudPlatform/microservices-demo, especially since now Tetragon is independent of Cilium for those tests.

@willfindlay
Copy link
Contributor

Let's make a good first issue to do the migration to the microservices demo. I think it makes a lot of sense.

@mtardy
Copy link
Member

mtardy commented Jan 15, 2024

Let's make a good first issue to do the migration to the microservices demo. I think it makes a lot of sense.

See #1976.

@jrfastab jrfastab self-assigned this Feb 8, 2024
@jrfastab jrfastab added the release-blocker This PR or issue is blocking the next release. label Feb 8, 2024
@lambdanis lambdanis removed the release-blocker This PR or issue is blocking the next release. label Apr 26, 2024
@lambdanis
Copy link
Contributor

lambdanis commented May 7, 2024

This might be fixed by #2345. Let's keep an eye on Tetragon e2e tests for a couple of weeks, if it's stable then we can close the issue.

UPDATE: It seems the test is still flaky after switching to otel-demo app. It failed in #2417: https://github.com/cilium/tetragon/actions/runs/8966724879/attempts/1

@Trung-DV
Copy link
Contributor

Trung-DV commented May 8, 2024

Hi @lambdanis
https://github.com/cilium/tetragon/actions/runs/8966724879/job/24623050943#step:6:9683

time="2024-05-06T09:26:21Z" level=info msg="PROCESS_EXEC:894 => FINAL MATCH "
time="2024-05-06T09:26:21Z" level=info msg="DONE!"
--- FAIL: TestLabelsDemoApp (241.38s)
    --- FAIL: TestLabelsDemoApp/Run_Workload (118.15s)
        --- FAIL: TestLabelsDemoApp/Run_Workload/Run_Workload (118.10s)
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:60: failed to install demo app after 3 tries
FAIL

The test seems successful, but the demo has failed to install. Maybe this is another flake test?

Btw, I'm wondering why we have to install and check labels in parallel instead of installing the demo app successfully and then running the label checker test?

func TestLabelsDemoApp(t *testing.T) {
// Must be called at the beginning of every test
runner.SetupExport(t)
labelsChecker := labelsEventChecker().WithEventLimit(5000).WithTimeLimit(5 * time.Minute)
// This starts labelsChecker and uses it to run event checks.
runEventChecker := features.New("Run Event Checks").
Assess("Run Event Checks", labelsChecker.CheckInNamespace(1*time.Minute, namespace)).Feature()
// This feature waits for labelsChecker to start then runs a custom workload.
runWorkload := features.New("Run Workload").
/* Wait up to 30 seconds for the event checker to start before continuing */
Assess("Wait for Checker", labelsChecker.Wait(30*time.Second)).
/* Run the workload */
Assess("Run Workload", installDemoApp(labelsChecker)).
Feature()
uninstall := features.New("Uninstall Demo App").
Assess("Uninstall", uninstallDemoApp()).Feature()
// Spawn workload and run checker
runner.TestInParallel(t, runEventChecker, runWorkload)
runner.Test(t, uninstall)
}

@mtardy
Copy link
Member

mtardy commented May 16, 2024

Btw, I'm wondering why we have to install and check labels in parallel instead of installing the demo app successfully and then running the label checker test?

I'm not sure indeed. The only reason can be that it can potentially speed up the tests because technically the checker can finish before all the deployments are ready. If it can make debugging easier, maybe we could consider changing that. Do you have any memories on that @willfindlay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Related to CI kind/ci-flake A flake in CI
Projects
None yet
Development

No branches or pull requests

6 participants