Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky TestPeriodicHourly and TestPeriodicMinutes #17054

Closed
ahrtr opened this issue Dec 3, 2023 · 16 comments · Fixed by #18608
Closed

Flaky TestPeriodicHourly and TestPeriodicMinutes #17054

ahrtr opened this issue Dec 3, 2023 · 16 comments · Fixed by #18608

Comments

@ahrtr
Copy link
Member

ahrtr commented Dec 3, 2023

Which github workflows are flaking?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/16822/pull-etcd-unit-test/1731327844841164800

{Failed  === RUN   TestPeriodicHourly
    logger.go:130: 2023-12-03T15:04:01.795Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:01.806Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s", "took": "6m0s"}
    logger.go:130: 2023-12-03T15:04:01.975Z	INFO	starting auto periodic compaction	{"revision": 9, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:01.975Z	INFO	completed auto periodic compaction	{"revision": 9, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:9 , want revision:10 
    logger.go:130: 2023-12-03T15:04:02.154Z	INFO	starting auto periodic compaction	{"revision": 19, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:02.154Z	INFO	completed auto periodic compaction	{"revision": 19, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:19 , want revision:20 
    logger.go:130: 2023-12-03T15:04:02.375Z	INFO	starting auto periodic compaction	{"revision": 29, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:02.375Z	INFO	completed auto periodic compaction	{"revision": 29, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:29 , want revision:30 
--- FAIL: TestPeriodicHourly (0.90s)
{Failed  === RUN   TestPeriodicMinutes
    logger.go:130: 2023-12-03T15:04:02.606Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "5m0s"}
    periodic_test.go:105: len(actions) = 0, expected >= 1
--- FAIL: TestPeriodicMinutes (0.27s)

Which tests are flaking?

TestPeriodicHourly and TestPeriodicMinutes

Github Action link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

@tjungblu
Copy link
Contributor

tjungblu commented Dec 6, 2023

To repro this add a time.Sleep(11 * time.Millisecond) sleep before this line:

The recorder (backed by an unbuffered channel) is supposed to act as a semaphore here, but the Wait(1) might miss a revision and the follow-up calls might go out of sync.

An easy trick might be to let the recorder advance the fake clock directly on the periodic, so it's serialized within one goroutine - similar with the compaction request. Seems the fake clock also introduces some asynchronicity when using pc.clock.After(retryInterval).

@moficodes
Copy link
Member

I can take a look.

@wenjiaswe
Copy link
Contributor

/assign @moficodes

@moficodes
Copy link
Member

Followed the instruction from @tjungblu and managed to reproduce the error.

client/pkg/testutil/recorder.go L131 has a case <-time.After(10 * time.Millisecond) so if after 10ms action is dropped and the test fails.

One thing we could do is add some additional wait there. But that still does not feel like a great solution. But might be good enough as a test util?

@ahrtr thoughts?

@ahrtr
Copy link
Member Author

ahrtr commented Dec 27, 2023

Thanks for looking into the issue.

client/pkg/testutil/recorder.go L131 has a case <-time.After(10 * time.Millisecond) so if after 10ms action is dropped and the test fails.

I do not get time to have a deep dive into the test case. But based on the quick reading on the source code, it seems that the failure is irrelevant to the case <-time.After(10 * time.Millisecond) you mentioned. The reason is simple, even you increase the timeout, and it runs into the case act := <-r.ch (line 129 and 130 in the recorder.go), it just adds one more item into the slice.

	case act := <-r.ch:
		acts = append(acts, act)

But the test case always reads the first item, (line 76 in periodic_test.go)

if !reflect.DeepEqual(a[0].Params[0], &pb.CompactionRequest{Revision: expectedRevision}) {

Did I miss anything?

@ahrtr
Copy link
Member Author

ahrtr commented Feb 1, 2024

@moficodes are you still working on this issue? any update?

@moficodes
Copy link
Member

@ahrtr I would like to work on it more. But I am not sure what would be the path to solving this. What should I look into more?

@ahrtr
Copy link
Member Author

ahrtr commented Feb 5, 2024

Just saw this failure again https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/17367/pull-etcd-unit-test/1754513529177640960

Please try to figure out the root cause in the first step, thanks.

@Rajalakshmi-Girish
Copy link
Contributor

Also observed the flakiness recently on our internal CI job https://testgrid.k8s.io/ibm-etcd-tests-ppc64le#Periodic%20etcd%20test%20suite%20on%20ppc64le
Below three tests in etcd/server/v3/etcdserver/api/v3compactor were flaking:
TestPeriodicHourly
TestPeriodicSkipRevNotChange
TestPeriodicMinutes

@Rajalakshmi-Girish
Copy link
Contributor

=== FAIL: etcdserver/api/v3compactor TestPeriodicHourly (0.60s)
    logger.go:130: 2024-02-19T14:03:12.455Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s"}
    logger.go:130: 2024-02-19T14:03:12.465Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s", "took": "6m0s"}
    logger.go:130: 2024-02-19T14:03:12.614Z	INFO	starting auto periodic compaction	{"revision": 10, "compact-period": "2h0m0s"}
    logger.go:130: 2024-02-19T14:03:12.614Z	INFO	completed auto periodic compaction	{"revision": 10, "compact-period": "2h0m0s", "took": "0s"}
    logger.go:130: 2024-02-19T14:03:12.746Z	INFO	starting auto periodic compaction	{"revision": 20, "compact-period": "2h0m0s"}
    periodic_test.go:72: len(actions) = 0, expected >= 1
=== FAIL: etcdserver/api/v3compactor TestPeriodicSkipRevNotChange (1.09s)
    logger.go:130: 2024-02-19T11:03:16.286Z	INFO	starting auto periodic compaction	{"revision": 101, "compact-period": "5m0s"}
    logger.go:130: 2024-02-19T11:03:16.296Z	INFO	completed auto periodic compaction	{"revision": 101, "compact-period": "5m0s", "took": "30s"}
    periodic_test.go:233: len(actions) = 0, expected >= 1
    logger.go:130: 2024-02-19T11:03:17.215Z	INFO	starting auto periodic compaction	{"revision": 102, "compact-period": "5m0s"}
=== FAIL: etcdserver/api/v3compactor TestPeriodicMinutes (0.96s)
    logger.go:130: 2024-02-14T17:03:22.202Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.213Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "5m0s", "took": "30s"}
    logger.go:130: 2024-02-14T17:03:22.343Z	INFO	starting auto periodic compaction	{"revision": 10, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.343Z	INFO	completed auto periodic compaction	{"revision": 10, "compact-period": "5m0s", "took": "0s"}
    logger.go:130: 2024-02-14T17:03:22.536Z	INFO	starting auto periodic compaction	{"revision": 20, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.536Z	INFO	completed auto periodic compaction	{"revision": 20, "compact-period": "5m0s", "took": "0s"}
    logger.go:130: 2024-02-14T17:03:22.732Z	INFO	starting auto periodic compaction	{"revision": 29, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.732Z	INFO	completed auto periodic compaction	{"revision": 29, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:29 , want revision:30 
    logger.go:130: 2024-02-14T17:03:22.871Z	INFO	starting auto periodic compaction	{"revision": 39, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.871Z	INFO	completed auto periodic compaction	{"revision": 39, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:39 , want revision:40 
    logger.go:130: 2024-02-14T17:03:23.039Z	INFO	starting auto periodic compaction	{"revision": 49, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:23.039Z	INFO	completed auto periodic compaction	{"revision": 49, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:49 , want revision:50 

@ahrtr
Copy link
Member Author

ahrtr commented Feb 20, 2024

@moficodes are you still working on this case?

@ahrtr
Copy link
Member Author

ahrtr commented Feb 20, 2024

@ahrtr I would like to work on it more. But I am not sure what would be the path to solving this. What should I look into more?

The first step should be definitely to figure out the root cause. Please feedback if you have any difficulty or do not have bandwidth to continue to work on this case.

@Rajalakshmi-Girish
Copy link
Contributor

Rajalakshmi-Girish commented Feb 27, 2024

This is one of the top flaking tests in our internal CI. Please see testgrid
Is there any update on this? @moficodes @ahrtr

@joshuazh-x
Copy link
Contributor

The root cause, as aforementioned, should be out-of-sync of revision getter or compactor in two loops. It would help to remove timeout in revision getter's wait to minimize such kind of failure as in PR #17513

@ahrtr
Copy link
Member Author

ahrtr commented Mar 1, 2024

To repro this add a time.Sleep(11 * time.Millisecond) sleep before this line:

The recorder (backed by an unbuffered channel) is supposed to act as a semaphore here, but the Wait(1) might miss a revision and the follow-up calls might go out of sync.

It took me some time to read the test case and this comment again, eventually I understood your point. It makes sense. Thanks @tjungblu

It would help to remove timeout in revision getter's wait to minimize such kind of failure as in PR #17513

Thanks for the fix. Overall looks good with a minor comment.

@moficodes moficodes removed their assignment May 20, 2024
@srivastav-abhishek
Copy link
Contributor

Observed recent flakiness for go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor: TestPeriodicHourly : https://prow.ppc64le-cloud.cis.ibm.net/view/gs/ppc64le-kubernetes/logs/periodic-etcd-test-ppc64le/1815159905875660800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants