Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: always remove member directory when bootstrap fails #4087

Merged
merged 1 commit into from
Dec 29, 2015

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Dec 29, 2015

This removes member directory when bootstrap fails including joining existing
cluster and forming a new cluster. This fixes coreos#3827.

@@ -272,6 +272,10 @@ func NewServer(cfg *ServerConfig) (*EtcdServer, error) {
var err error
str, err = discovery.JoinCluster(cfg.DiscoveryURL, cfg.DiscoveryProxy, m.ID, cfg.InitialPeerURLsMap.String())
if err != nil {
// It removes member directory when NewServer returns error.
// This prevents conflicts with 'proxy' directory when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcdserver should have no idea about proxy. This is just a necessary cleanup step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also remove member dir in another !haveWAL case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically in the two !haveWAL case, we should clean up member dir if bootstrap fails.

@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

@xiang90 I added more checks when deleting. Please take a look again and let me know.
Thanks,

@@ -241,6 +250,7 @@ func NewServer(cfg *ServerConfig) (*EtcdServer, error) {
}
existingCluster, err := GetClusterFromRemotePeers(getRemotePeerURLs(cl, cfg.Name), prt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't need this I think. What do you mean by another !haveWAL case?

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2015

@gyuho I should be more clear. For every error in the wal not existing case, we should remove member dir. We should only keep member dir if the new member is successfully bootstrapped. We know for sure it is a new member when wal does not perviously exists.

@@ -229,6 +229,15 @@ func NewServer(cfg *ServerConfig) (*EtcdServer, error) {
if err != nil {
return nil, err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a simple way:

right after line 225

if !haveWAL {
    defer func() {
        if err != nil {
            // cleans up member directory if bootstrap fails (including forming or joining a new cluster)
            os.RemoveAll(cfg.MemberDir())
        }
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to change the func signature to

 func NewServer(cfg *ServerConfig) (srv *EtcdServer, err error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@gyuho gyuho force-pushed the delete_discovery_check branch 2 times, most recently from 12570c6 to aacad34 Compare December 29, 2015 06:08
@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2015

LGTM. Let's test it, at least manually. So we know this does fix the issue reported.

@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

Fixing govet issues. And found that:

func (d *discovery) checkClusterRetry() ([]*client.Node, int, uint64, error) {
    if d.retries < nRetries {
        d.logAndBackoffForRetry("cluster status check")
        return d.checkCluster()
    }
    return nil, 0, 0, ErrTooManyRetries
}

is used for discovering. They just keep retrying and defer never get executed.
Will investigate more.

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2015

@gyuho You are right. Basically you are fixing #3827 now. To totally fix the discovery issue, we have to make sure that we do not create member dir before finishing discovery.

@@ -274,8 +284,9 @@ func NewServer(cfg *ServerConfig) (*EtcdServer, error) {
if err != nil {
return nil, &DiscoveryError{Op: "join", Err: err}
}
urlsmap, err := types.NewURLsMap(str)
if err != nil {
urlsmap, e := types.NewURLsMap(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e -> uerr

@gyuho gyuho force-pushed the delete_discovery_check branch 2 times, most recently from ea19714 to 0841134 Compare December 29, 2015 06:45
@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

@xiang90 I tested manually injecting error to make startEtcd return when !haveWAL and confirmed that the defer statement gets executed and remove member directory.

prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())
if err != nil {
return nil, err
prt, uerr := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not call everything uerr. When there is a naming conflict for error, we usually add one prefix. For newRoundTripper, probably just rterr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it.

existingCluster, err := GetClusterFromRemotePeers(getRemotePeerURLs(cl, cfg.Name), prt)
if err != nil {
return nil, fmt.Errorf("cannot fetch cluster info from peer urls: %v", err)
existingCluster, uerr := GetClusterFromRemotePeers(getRemotePeerURLs(cl, cfg.Name), prt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably gerr

@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

@xiang90 All variable name conflicts and govet shadowing issues are fixed. PTAL.

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2015

@gyuho Can we also update the commit message?

etcdserver: always remove member directory when bootstrap fails (including joining existing cluster and forming a new cluster)

This removes member directory when bootstrap fails including joining existing
cluster and forming a new cluster. This fixes etcd-io#3827.
@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

@xiang90 Just did. Thanks!

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2015

LGTM

@gyuho gyuho changed the title etcdserver: delete member directory when discovery fails etcdserver: always remove member directory when bootstrap fails Dec 29, 2015
@gyuho
Copy link
Contributor Author

gyuho commented Dec 29, 2015

Thanks, will merge after CI passes.

gyuho added a commit that referenced this pull request Dec 29, 2015
etcdserver: always remove member directory when bootstrap fails
@gyuho gyuho merged commit cd42c91 into etcd-io:master Dec 29, 2015
@gyuho gyuho deleted the delete_discovery_check branch January 31, 2016 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants