Flaky `TestPeriodicHourly` and `TestPeriodicMinutes` #17054

ahrtr · 2023-12-03T15:14:46Z

Which github workflows are flaking?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/16822/pull-etcd-unit-test/1731327844841164800

{Failed  === RUN   TestPeriodicHourly
    logger.go:130: 2023-12-03T15:04:01.795Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:01.806Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s", "took": "6m0s"}
    logger.go:130: 2023-12-03T15:04:01.975Z	INFO	starting auto periodic compaction	{"revision": 9, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:01.975Z	INFO	completed auto periodic compaction	{"revision": 9, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:9 , want revision:10 
    logger.go:130: 2023-12-03T15:04:02.154Z	INFO	starting auto periodic compaction	{"revision": 19, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:02.154Z	INFO	completed auto periodic compaction	{"revision": 19, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:19 , want revision:20 
    logger.go:130: 2023-12-03T15:04:02.375Z	INFO	starting auto periodic compaction	{"revision": 29, "compact-period": "2h0m0s"}
    logger.go:130: 2023-12-03T15:04:02.375Z	INFO	completed auto periodic compaction	{"revision": 29, "compact-period": "2h0m0s", "took": "0s"}
    periodic_test.go:77: compact request = revision:29 , want revision:30 
--- FAIL: TestPeriodicHourly (0.90s)

{Failed  === RUN   TestPeriodicMinutes
    logger.go:130: 2023-12-03T15:04:02.606Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "5m0s"}
    periodic_test.go:105: len(actions) = 0, expected >= 1
--- FAIL: TestPeriodicMinutes (0.27s)

Which tests are flaking?

TestPeriodicHourly and TestPeriodicMinutes

Github Action link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tjungblu · 2023-12-06T14:38:54Z

To repro this add a time.Sleep(11 * time.Millisecond) sleep before this line:

etcd/server/etcdserver/api/v3compactor/periodic.go

Line 125 in 4ece556

rev := pc.revs[0]

The recorder (backed by an unbuffered channel) is supposed to act as a semaphore here, but the Wait(1) might miss a revision and the follow-up calls might go out of sync.

~~An easy trick might be to let the recorder advance the fake clock directly on the periodic, so it's serialized within one goroutine - similar with the compaction request.~~ Seems the fake clock also introduces some asynchronicity when using pc.clock.After(retryInterval).

moficodes · 2023-12-07T19:15:24Z

I can take a look.

wenjiaswe · 2023-12-07T19:15:33Z

/assign @moficodes

moficodes · 2023-12-12T16:27:19Z

Followed the instruction from @tjungblu and managed to reproduce the error.

client/pkg/testutil/recorder.go L131 has a case <-time.After(10 * time.Millisecond) so if after 10ms action is dropped and the test fails.

One thing we could do is add some additional wait there. But that still does not feel like a great solution. But might be good enough as a test util?

@ahrtr thoughts?

ahrtr · 2023-12-27T18:21:40Z

Thanks for looking into the issue.

client/pkg/testutil/recorder.go L131 has a case <-time.After(10 * time.Millisecond) so if after 10ms action is dropped and the test fails.

I do not get time to have a deep dive into the test case. But based on the quick reading on the source code, it seems that the failure is irrelevant to the case <-time.After(10 * time.Millisecond) you mentioned. The reason is simple, even you increase the timeout, and it runs into the case act := <-r.ch (line 129 and 130 in the recorder.go), it just adds one more item into the slice.

	case act := <-r.ch:
		acts = append(acts, act)

But the test case always reads the first item, (line 76 in periodic_test.go)

if !reflect.DeepEqual(a[0].Params[0], &pb.CompactionRequest{Revision: expectedRevision}) {

Did I miss anything?

ahrtr · 2024-02-01T14:49:04Z

@moficodes are you still working on this issue? any update?

moficodes · 2024-02-01T19:01:14Z

@ahrtr I would like to work on it more. But I am not sure what would be the path to solving this. What should I look into more?

ahrtr · 2024-02-05T14:43:11Z

Just saw this failure again https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/17367/pull-etcd-unit-test/1754513529177640960

Please try to figure out the root cause in the first step, thanks.

Rajalakshmi-Girish · 2024-02-20T09:30:11Z

Also observed the flakiness recently on our internal CI job https://testgrid.k8s.io/ibm-etcd-tests-ppc64le#Periodic%20etcd%20test%20suite%20on%20ppc64le
Below three tests in etcd/server/v3/etcdserver/api/v3compactor were flaking:
TestPeriodicHourly
TestPeriodicSkipRevNotChange
TestPeriodicMinutes

Rajalakshmi-Girish · 2024-02-20T09:35:56Z

=== FAIL: etcdserver/api/v3compactor TestPeriodicHourly (0.60s)
    logger.go:130: 2024-02-19T14:03:12.455Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s"}
    logger.go:130: 2024-02-19T14:03:12.465Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "2h0m0s", "took": "6m0s"}
    logger.go:130: 2024-02-19T14:03:12.614Z	INFO	starting auto periodic compaction	{"revision": 10, "compact-period": "2h0m0s"}
    logger.go:130: 2024-02-19T14:03:12.614Z	INFO	completed auto periodic compaction	{"revision": 10, "compact-period": "2h0m0s", "took": "0s"}
    logger.go:130: 2024-02-19T14:03:12.746Z	INFO	starting auto periodic compaction	{"revision": 20, "compact-period": "2h0m0s"}
    periodic_test.go:72: len(actions) = 0, expected >= 1

=== FAIL: etcdserver/api/v3compactor TestPeriodicSkipRevNotChange (1.09s)
    logger.go:130: 2024-02-19T11:03:16.286Z	INFO	starting auto periodic compaction	{"revision": 101, "compact-period": "5m0s"}
    logger.go:130: 2024-02-19T11:03:16.296Z	INFO	completed auto periodic compaction	{"revision": 101, "compact-period": "5m0s", "took": "30s"}
    periodic_test.go:233: len(actions) = 0, expected >= 1
    logger.go:130: 2024-02-19T11:03:17.215Z	INFO	starting auto periodic compaction	{"revision": 102, "compact-period": "5m0s"}

=== FAIL: etcdserver/api/v3compactor TestPeriodicMinutes (0.96s)
    logger.go:130: 2024-02-14T17:03:22.202Z	INFO	starting auto periodic compaction	{"revision": 1, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.213Z	INFO	completed auto periodic compaction	{"revision": 1, "compact-period": "5m0s", "took": "30s"}
    logger.go:130: 2024-02-14T17:03:22.343Z	INFO	starting auto periodic compaction	{"revision": 10, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.343Z	INFO	completed auto periodic compaction	{"revision": 10, "compact-period": "5m0s", "took": "0s"}
    logger.go:130: 2024-02-14T17:03:22.536Z	INFO	starting auto periodic compaction	{"revision": 20, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.536Z	INFO	completed auto periodic compaction	{"revision": 20, "compact-period": "5m0s", "took": "0s"}
    logger.go:130: 2024-02-14T17:03:22.732Z	INFO	starting auto periodic compaction	{"revision": 29, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.732Z	INFO	completed auto periodic compaction	{"revision": 29, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:29 , want revision:30 
    logger.go:130: 2024-02-14T17:03:22.871Z	INFO	starting auto periodic compaction	{"revision": 39, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:22.871Z	INFO	completed auto periodic compaction	{"revision": 39, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:39 , want revision:40 
    logger.go:130: 2024-02-14T17:03:23.039Z	INFO	starting auto periodic compaction	{"revision": 49, "compact-period": "5m0s"}
    logger.go:130: 2024-02-14T17:03:23.039Z	INFO	completed auto periodic compaction	{"revision": 49, "compact-period": "5m0s", "took": "0s"}
    periodic_test.go:127: compact request = revision:49 , want revision:50

ahrtr · 2024-02-20T09:48:40Z

@moficodes are you still working on this case?

ahrtr · 2024-02-20T09:51:02Z

@ahrtr I would like to work on it more. But I am not sure what would be the path to solving this. What should I look into more?

The first step should be definitely to figure out the root cause. Please feedback if you have any difficulty or do not have bandwidth to continue to work on this case.

Rajalakshmi-Girish · 2024-02-27T04:44:24Z

This is one of the top flaking tests in our internal CI. Please see testgrid
Is there any update on this? @moficodes @ahrtr

joshuazh-x · 2024-03-01T09:01:55Z

The root cause, as aforementioned, should be out-of-sync of revision getter or compactor in two loops. It would help to remove timeout in revision getter's wait to minimize such kind of failure as in PR #17513

ahrtr · 2024-03-01T10:17:08Z

To repro this add a time.Sleep(11 * time.Millisecond) sleep before this line:

etcd/server/etcdserver/api/v3compactor/periodic.go

Line 125 in 4ece556

rev := pc.revs[0]

The recorder (backed by an unbuffered channel) is supposed to act as a semaphore here, but the Wait(1) might miss a revision and the follow-up calls might go out of sync.

It took me some time to read the test case and this comment again, eventually I understood your point. It makes sense. Thanks @tjungblu

It would help to remove timeout in revision getter's wait to minimize such kind of failure as in PR #17513

Thanks for the fix. Overall looks good with a minor comment.

srivastav-abhishek · 2024-07-22T09:30:54Z

Observed recent flakiness for go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor: TestPeriodicHourly : https://prow.ppc64le-cloud.cis.ibm.net/view/gs/ppc64le-kubernetes/logs/periodic-etcd-test-ppc64le/1815159905875660800

ahrtr added area/testing type/flake help wanted labels Dec 3, 2023

k8s-ci-robot assigned moficodes Dec 7, 2023

fuweid mentioned this issue Dec 29, 2023

server: fix learner metric incorrect issue #17176

Merged

Rajalakshmi-Girish mentioned this issue Feb 27, 2024

TestPeriodicSkipRevNotChange failed #17487

Closed

joshuazh-x mentioned this issue Mar 1, 2024

Use strict synchronization for revision getter to minimize flaky result caused by time racing. #17513

Closed

Fube mentioned this issue Apr 13, 2024

Flake TestRevision #17477

Open

moficodes removed their assignment May 20, 2024

srivastav-abhishek mentioned this issue Sep 19, 2024

Fixed periodic compaction tests #18608

Merged

ahrtr closed this as completed in #18608 Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky `TestPeriodicHourly` and `TestPeriodicMinutes` #17054

Flaky `TestPeriodicHourly` and `TestPeriodicMinutes` #17054

ahrtr commented Dec 3, 2023

tjungblu commented Dec 6, 2023 •

edited

Loading

moficodes commented Dec 7, 2023

wenjiaswe commented Dec 7, 2023

moficodes commented Dec 12, 2023

ahrtr commented Dec 27, 2023

ahrtr commented Feb 1, 2024

moficodes commented Feb 1, 2024

ahrtr commented Feb 5, 2024

Rajalakshmi-Girish commented Feb 20, 2024

Rajalakshmi-Girish commented Feb 20, 2024

ahrtr commented Feb 20, 2024

ahrtr commented Feb 20, 2024

Rajalakshmi-Girish commented Feb 27, 2024 •

edited

Loading

joshuazh-x commented Mar 1, 2024

ahrtr commented Mar 1, 2024

srivastav-abhishek commented Jul 22, 2024

Flaky TestPeriodicHourly and TestPeriodicMinutes #17054

Flaky TestPeriodicHourly and TestPeriodicMinutes #17054

Comments

ahrtr commented Dec 3, 2023

Which github workflows are flaking?

Which tests are flaking?

Github Action link

Reason for failure (if possible)

Anything else we need to know?

tjungblu commented Dec 6, 2023 • edited Loading

moficodes commented Dec 7, 2023

wenjiaswe commented Dec 7, 2023

moficodes commented Dec 12, 2023

ahrtr commented Dec 27, 2023

ahrtr commented Feb 1, 2024

moficodes commented Feb 1, 2024

ahrtr commented Feb 5, 2024

Rajalakshmi-Girish commented Feb 20, 2024

Rajalakshmi-Girish commented Feb 20, 2024

ahrtr commented Feb 20, 2024

ahrtr commented Feb 20, 2024

Rajalakshmi-Girish commented Feb 27, 2024 • edited Loading

joshuazh-x commented Mar 1, 2024

ahrtr commented Mar 1, 2024

srivastav-abhishek commented Jul 22, 2024

Flaky `TestPeriodicHourly` and `TestPeriodicMinutes` #17054

Flaky `TestPeriodicHourly` and `TestPeriodicMinutes` #17054

tjungblu commented Dec 6, 2023 •

edited

Loading

Rajalakshmi-Girish commented Feb 27, 2024 •

edited

Loading