Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restrict node start-up when cluster name in data path #36519

Merged
merged 15 commits into from
Jan 2, 2019

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Dec 12, 2018

When a 2.x cluster is created, the structure of path.data has all contents for a node inside a directory named after the cluster name. This was changed in 5.x (#18554) to remove the directory with the cluster name and move its contents up a level. A 5.x cluster will still read the 2.x structure correctly. In 6.x this backwards compatible behavior is removed (#20433), and if a 6.x node is started with a data directory using the old 2.x structure, it will see it as if it was empty and ignore the existing data.

This PR makes it so that a 6.x node refuses to start when there exists a data path
with the cluster name in it.

relates: #32661 (comment)

@talevy talevy added WIP :Core/Infra/Core Core issues without another label labels Dec 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@talevy
Copy link
Contributor Author

talevy commented Dec 12, 2018

cc/ @bleskes

took me a bit of time to get to this only to realize what the tests show, we don't necessarily have permissions to check whether there is a cluster-name in the data path.

I haven't taken another look, so I will do so tomorrow.

@bleskes
Copy link
Contributor

bleskes commented Dec 12, 2018

Thx @talevy . I'm inclined to implement this as a BootstrapCheck that's always enforced. That said, those are run after the security manager is installed as well. @danielmitterdorfer can you please advise?

@danielmitterdorfer
Copy link
Member

A boostrap check makes sense to me (although they will only issue a warning when bound to loopback but I think it is ok?). I also think that we are safe with a bootstrap check because we setup the necessary permissions to access the data paths in Security.configureand then invoke the bootstrap checks. As the path in question is in a subdirectory of the data path (to which we have full access) I think we should be fine w.r.t. to access checks?

@talevy
Copy link
Contributor Author

talevy commented Dec 12, 2018

thanks @danielmitterdorfer @bleskes, I will see how to make this a BootstrapCheck

@talevy talevy force-pushed the cluster-in-path branch 3 times, most recently from e38a3fe to 48adce9 Compare December 12, 2018 17:54
talevy added a commit to talevy/elasticsearch that referenced this pull request Dec 12, 2018
There are certain BootstrapCheck checks that may need access environment-specific
values. Watcher's EncryptSensitiveDataBootstrapCheck passes in the node's environment
via a constructor to bypass the shortcoming in BootstrapContext. This commit
pulls in the node's environment into BootstrapContext.

Another case is found in elastic#36519, where it is useful to check the state of the
data-path. Since PathUtils.get and Paths.get are forbidden APIs, we rely on
the environment to retrieve references to things like node data paths.

This means that the BootstrapContext will have the same Settings used in the
Environment, which currently differs from the Node's settings.
@talevy
Copy link
Contributor Author

talevy commented Dec 12, 2018

Update: I've learned more about BootstrapChecks and realized there are a few nice refactors to do that will make writing the ClusterNameInDataPathCheck a lot cleaner. Long story short: This check needs access to the node's Environment to work right.

PR that needs to be merged before continuing: #36573
related PR that would be a nice to have: #36574

…ontents for a node inside a directory named after the cluster name. This was changed in 5.x (elastic#18554) to remove the directory with the cluster name and move its contents up a level. A 5.x cluster will still read the 2.x structure correctly. In 6.x this backwards compatible behavior is removed (elastic#20433), and if a 6.x node is started with a data directory using the old 2.x structure, it will see it as if it was empty and ignore the existing data.

This PR makes it so that a 6.x node refuses to start when there exists a data path
with the cluster name in it.

relates: elastic#32661 (comment)
talevy added a commit that referenced this pull request Dec 13, 2018
There are certain BootstrapCheck checks that may need access environment-specific
values. Watcher's EncryptSensitiveDataBootstrapCheck passes in the node's environment
via a constructor to bypass the shortcoming in BootstrapContext. This commit
pulls in the node's environment into BootstrapContext.

Another case is found in #36519, where it is useful to check the state of the
data-path. Since PathUtils.get and Paths.get are forbidden APIs, we rely on
the environment to retrieve references to things like node data paths.

This means that the BootstrapContext will have the same Settings used in the
Environment, which currently differs from the Node's settings.
talevy added a commit to talevy/elasticsearch that referenced this pull request Dec 13, 2018
There are certain BootstrapCheck checks that may need access environment-specific
values. Watcher's EncryptSensitiveDataBootstrapCheck passes in the node's environment
via a constructor to bypass the shortcoming in BootstrapContext. This commit
pulls in the node's environment into BootstrapContext.

Another case is found in elastic#36519, where it is useful to check the state of the
data-path. Since PathUtils.get and Paths.get are forbidden APIs, we rely on
the environment to retrieve references to things like node data paths.

This means that the BootstrapContext will have the same Settings used in the
Environment, which currently differs from the Node's settings.
@talevy talevy removed the WIP label Dec 13, 2018
@talevy
Copy link
Contributor Author

talevy commented Dec 13, 2018

Update: refactors to give access to Environment#pathFiles have made it in, so now this is ready!

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall. I left a couple of suggestions / questions about the error message.

@talevy
Copy link
Contributor Author

talevy commented Dec 17, 2018

example run from a test cluster named elasticsearch

[2018-12-17T10:00:23,572][INFO ][o.e.n.Node               ] [1Xo7d5f] initialized
[2018-12-17T10:00:23,573][INFO ][o.e.n.Node               ] [1Xo7d5f] starting ...
[2018-12-17T10:00:23,796][INFO ][o.e.t.TransportService   ] [1Xo7d5f] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}

ERROR: [1] bootstrap checks failed

[1]: Cluster name [elasticsearch] subdirectory exists in data paths [/distribution/archives/tar/build/distributions/elasticsearch-6.6.0-SNAPSHOT/data/elasticsearch]. All data under these paths must be moved up one directory to paths [/distribution/archives/tar/build/distributions/elasticsearch-6.6.0-SNAPSHOT/data]

[2018-12-17T10:00:23,850][INFO ][o.e.n.Node               ] [1Xo7d5f] stopping ...
[2018-12-17T10:00:23,865][INFO ][o.e.n.Node               ] [1Xo7d5f] stopped
[2018-12-17T10:00:23,866][INFO ][o.e.n.Node               ] [1Xo7d5f] closing ...
[2018-12-17T10:00:23,924][INFO ][o.e.n.Node               ] [1Xo7d5f] closed
[2018-12-17T10:00:23,928][INFO ][o.e.x.m.p.NativeController] [1Xo7d5f] Native controller process has stopped - no new native processes can be started

@talevy
Copy link
Contributor Author

talevy commented Dec 17, 2018

run the default distro tests

@talevy
Copy link
Contributor Author

talevy commented Dec 17, 2018

After discussion offline with the team, the decision is to bring this check in-line into the Node initialization instead of a Bootstrap Check. The reasoning is that bootstrap checks are, primarily, intended to be checks that can be enabled/disabled depending on the strictness of the environment. This data integrity check is one we want to run always, so it is a candidate for hardcoding the exception into the code

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The production code looks good, yet I would structure it a little differently?

@@ -739,6 +739,20 @@ public Node start() throws NodeValidationException {
} catch (IOException e) {
throw new UncheckedIOException(e);
}

final List<Path> existingPathsWithClusterName = Arrays.stream(environment.dataFiles())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code should be embedded directly in the node start method. Can you factor this into a dedicated method (e.g., see how we handled something similar in 8033c57). Then it can be that you test this method directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure, I was 50/50 on doing this, but decided against it due to fear of adding too much overhead. makes sense though, that method is large enough. I'll update

@talevy
Copy link
Contributor Author

talevy commented Dec 20, 2018

thanks for taking a look @jasontedor. I kept the node.start() test because I thought just testing the method would not be enough to check that the method is being called and used by the node's startup execution

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@talevy
Copy link
Contributor Author

talevy commented Jan 2, 2019

thanks Jason!

@talevy talevy merged commit 8d36cf3 into elastic:6.x Jan 2, 2019
@talevy talevy deleted the cluster-in-path branch January 2, 2019 18:02
@talevy
Copy link
Contributor Author

talevy commented Jan 2, 2019

and thanks @danielmitterdorfer for initial review and suggestions!

@ywelsch
Copy link
Contributor

ywelsch commented Jan 4, 2019

This PR was missing version and type labels. I've added them based on the commit that was merged. Please adapt if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement v6.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants