etcd-tester: exit stresser only from cancel #4504

gyuho · 2016-02-12T16:09:05Z

Exit the stress goroutine only when the stresser.Cancel
method gets called. Bad network connections also cause errors
in Stress and in such cases, we should retry rather than
exiting for any error (this will stop the whole stress).

Fixes #4477.

(+ minor spell change on clientV2 to clientv2)

gyuho · 2016-02-12T16:09:39Z

@xiang90 After adding the new test cases (leader failures), it was much easier to reproduce this issue.

fileutil.Purge has no impact (usually takes less than 1ms)
Confirmed that it only happens when there is leader failure and network partition between two other

Last night, I deployed etcd with verbose outputs and found out that it was etcd-tester doing something wrong: the stresser was not being triggered after we have network partitions.

We added sync.WaitGroup to make sure about cancel operation and make it return when there's an error in PUT requests (https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-tester/stresser.go#L86-L94). But this is wrong because there could be transport is closing or context deadline exceeded error with bad network connections. We only want to return when we cancel the stresser, not when there are bad network connections. Otherwise, we never retry PUT after the cluster recovers from bad network connections.

So I deployed the code that only returns when stresser stopped flag is true and have been running for several hours of only case #3, and such error is gone now. I will submit the PR with this code.
@xiang90 After adding the new test cases (leader failures), it was much easier to reproduce this issue.

fileutil.Purge has no impact (usually takes less than 1ms)
Confirmed that it only happens when there is leader failure and bad network connection between two other

Last night, I deployed etcd with verbose outputs and found out that it was etcd-tester doing something wrong: the stresser was not being triggered after we have bad network connections.

EDIT: Used wrong term (network partitions -> bad network connections)

gyuho · 2016-02-12T17:50:15Z

This needs more investigation in order to make sure etcd didn't cause the bad network connections.

gyuho · 2016-02-12T20:11:18Z

Will rework on this after several other changes to v3 apis.

Exit the stress goroutine only when the stresser.Cancel method gets called. Network partitions also cause errors in Stress and in such cases, we should retry rather than exiting for any error (this will stop the whole stress). Fixes coreos#4477.

gyuho added the reviewed/needs more information label Feb 12, 2016

gyuho added the reviewed/needs rework label Feb 12, 2016

gyuho self-assigned this Feb 12, 2016

This was referenced Feb 12, 2016

etcd-tester: use Hash method to get both revision and hash #4513

Merged

functional-test: tester client error #4517

Closed

gyuho force-pushed the f0 branch from 17ec0ff to c7b3509 Compare February 15, 2016 06:54

gyuho mentioned this pull request Feb 17, 2016

etcd-tester: 10-second timeout for stressers #4547

Merged

gyuho closed this Feb 17, 2016

gyuho deleted the f0 branch February 18, 2016 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-tester: exit stresser only from cancel #4504

etcd-tester: exit stresser only from cancel #4504

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016

etcd-tester: exit stresser only from cancel #4504

etcd-tester: exit stresser only from cancel #4504

Conversation

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016

gyuho commented Feb 12, 2016