Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-tester: exit stresser only from cancel #4504

Closed
wants to merge 1 commit into from
Closed

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Feb 12, 2016

Exit the stress goroutine only when the stresser.Cancel
method gets called. Bad network connections also cause errors
in Stress and in such cases, we should retry rather than
exiting for any error (this will stop the whole stress).

Fixes #4477.

(+ minor spell change on clientV2 to clientv2)

@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

@xiang90 After adding the new test cases (leader failures), it was much easier to reproduce this issue.

  1. fileutil.Purge has no impact (usually takes less than 1ms)
  2. Confirmed that it only happens when there is leader failure and network partition between two other

Last night, I deployed etcd with verbose outputs and found out that it was etcd-tester doing something wrong: the stresser was not being triggered after we have network partitions.

We added sync.WaitGroup to make sure about cancel operation and make it return when there's an error in PUT requests (https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-tester/stresser.go#L86-L94). But this is wrong because there could be transport is closing or context deadline exceeded error with bad network connections. We only want to return when we cancel the stresser, not when there are bad network connections. Otherwise, we never retry PUT after the cluster recovers from bad network connections.

So I deployed the code that only returns when stresser stopped flag is true and have been running for several hours of only case #3, and such error is gone now. I will submit the PR with this code.
@xiang90 After adding the new test cases (leader failures), it was much easier to reproduce this issue.

  1. fileutil.Purge has no impact (usually takes less than 1ms)
  2. Confirmed that it only happens when there is leader failure and bad network connection between two other

Last night, I deployed etcd with verbose outputs and found out that it was etcd-tester doing something wrong: the stresser was not being triggered after we have bad network connections.

EDIT: Used wrong term (network partitions -> bad network connections)

@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

This needs more investigation in order to make sure etcd didn't cause the bad network connections.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

Will rework on this after several other changes to v3 apis.

Exit the stress goroutine only when the stresser.Cancel
method gets called. Network partitions also cause errors
in Stress and in such cases, we should retry rather than
exiting for any error (this will stop the whole stress).

Fixes coreos#4477.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant