Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Write manifest file #35049

Merged
merged 61 commits into from
Nov 19, 2018
Merged

[Zen2] Write manifest file #35049

merged 61 commits into from
Nov 19, 2018

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Oct 29, 2018

Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
GatewayMetaState implements ClusterStateApplier and receives all
ClusterStateChanged events and is responsible for storing modified
metadata to disk.

When new ClusterStateChanged event is received, GatewayMetaState
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that GatewayMetaState checks if index
metadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by MetaDataStateFormat
class.

Unfortunately, there is no atomicity when more than one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by MetaState class. MetaState contains pointers to
current metadata.
More precisely, it stores global state generation as long and map from
Index to index metadata generation as long. Atomicity of writes for
manifest file is ensured by MetaStateFormat class.

The algorithm of writing changes to the disk would be the following:

  1. Write global metadata state file to disk and remember
    it's generation.
  2. For each new/changed index write state file to disk and remember
    it's generation. For each not-changed index use generation from
    previous manifest file. If index is removed or this node is no longer
    responsible for this index - forget about the index.
  3. Create MetaState object using previously remembered generations and
    write it to disk.
  4. Remove old state files for global metadata, indices metadata and
    manifest.

Additonally new implementation relies on enhanced MetaDataStateFormat
failure semantics, applyClusterState throws IOException, whose
descendant WriteStateException could be (and should be in Zen2)
explicitly handled.

@andrershov andrershov added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 29, 2018
@andrershov andrershov self-assigned this Oct 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov andrershov mentioned this pull request Oct 29, 2018
6 tasks
# Conflicts:
#	server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass and left a handful of thoughts.

@andrershov
Copy link
Contributor Author

@DaveCTurner I've fixed issues in 421caa1 + added some minor fixes not mentioned in your comments. Ready for the second pass.

@andrershov andrershov reopened this Nov 7, 2018
@andrershov
Copy link
Contributor Author

@ywelsch I've reopened the PR and made necessary fixes, except not writing non-upgraded index metadata on startup in non bwc mode, which I'll implement tomorrow. Could you please review it?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial pass. Have not looked in details at the tests yet.

@andrershov
Copy link
Contributor Author

@ywelsch I was working on the tests/bugfixes during the night. And now it's ready for the 3rd pass.
Regarding atomicity tests, I've implemented them in 2 commits:

  1. Allow subclasses to override formats used in MetaStateService (to prepare MetaStateService for failure injection) a6dddc8
  2. Add random atomicity test for Transaction 7936cc4

@@ -224,14 +241,17 @@ public final void write(final T state, final Path... locations) throws WriteStat
copyStateToExtraLocations(directories, tmpFileName);
performRenames(tmpFileName, fileName, directories);
performStateDirectoriesFsync(directories);
} catch (WriteStateException e) {
cleanupOldFiles(oldGenerationId, locations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous and can lead to data loss. Assume that you've successfully written a cluster state that contains an index "test" with state file of generation 1. Then you try to write an updated cluster state for index with state file generation 2. Writing the state file for the index is successful, but there's a failure later to write the state file for another index. Now the node crashes or the clean-up logic fails. When you then handle the next cluster state, you will try to write generation 3, which, if it fails, will clean up everything except generation 2.

I think that within MetaDataStateFormat, you can only clean up files within the write method that you know that you have written. In particular, you can't assume (given the manifest-based approach) that the highest generation you see here is the file to keep around (as the manifest might not be pointing to the one with the highest generation).

This clean-up logic will have to be handled at a higher level (GatewayMetaState)

Copy link
Contributor Author

@andrershov andrershov Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good catch! Thank you! The reason why I've implemented it this way, because I DO know generations of global metadata and index metadata from the manifest file, but I DON'T know the previous generation of the manifest file to perform cleanup of newly created manifest state file if the manifest write fails. I really would like to avoid returning manifest generation from loadFullState, because it already returns Tuple. Adding generation field to Manifest itself also feels wrong. So instead I've done the work in 2 commits:

  1. MetaDataStateFormat.write should not perform cleanup at all, writeAndCleanup should always 326be60
  2. Use writeManifestAndCleanup, reorder rollbackActions before write 982e853
    From my point of view, this should work.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more comments

ywelsch added a commit that referenced this pull request Nov 14, 2018
…s are stopped (#35494)

Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart
onNodeStopped is only called after the nodes are actually all stopped (and in particular not while
starting up some nodes again). This change also ensures that a closed node client is not being used
anymore (which required a small change to a test).

Relates to #35049
ywelsch added a commit that referenced this pull request Nov 14, 2018
…s are stopped (#35494)

Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart
onNodeStopped is only called after the nodes are actually all stopped (and in particular not while
starting up some nodes again). This change also ensures that a closed node client is not being used
anymore (which required a small change to a test).

Relates to #35049
ywelsch added a commit that referenced this pull request Nov 14, 2018
…s are stopped (#35494)

Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart
onNodeStopped is only called after the nodes are actually all stopped (and in particular not while
starting up some nodes again). This change also ensures that a closed node client is not being used
anymore (which required a small change to a test).

Relates to #35049
@andrershov
Copy link
Contributor Author

@ywelsch I've made required fixes and added comments to 2 remaining issues. It's now ready for the next pass.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made one more request for change and some nits.

@@ -224,14 +244,23 @@ public final void write(final T state, final Path... locations) throws WriteStat
copyStateToExtraLocations(directories, tmpFileName);
performRenames(tmpFileName, fileName, directories);
performStateDirectoriesFsync(directories);
} catch (WriteStateException e) {
if (cleanup) {
cleanupOldFiles(oldGenerationId, locations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not really needed for this PR (where we only call write, but not writeAndCleanup from GatewayMetaState, with one exception, see below). I think we should leave this extra clean-up out of this PR, because it might introduce regressions to other parts of the system, which are also using MetaDataStateFormat.

There is one exception, writeManifestAndCleanup. We can change that one as well to only writeManifest and do the clean up in the same way in GatewayMetaState as for the other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment above why I've done it in this way. Shortly, the reason why I'm using writeManifestAndCleanup, because I don't know manifest file generation. If manifest write succeeds I get current generation returned by the method call, if it fails - I don't have it.
In general, I don't think there is a problem with other callers calling writeAndCleanup, because they don't use the concept of manifest and they want to keep the previously written latest generation.
If you're really afraid of introducing a regression, I can suggest adding "oldGenerationId" to WriteStateException, that when writing manifest failed we can learn which file to delete. IMHO, this is not needed and cleaning leaving only previous generation of state files for all uses of writeAndCleanup is the right thing to do.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrershov
Copy link
Contributor Author

run gradle build tests please

@andrershov
Copy link
Contributor Author

andrershov commented Nov 16, 2018

@ywelsch thanks for your reviewing efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants