[Zen2] Write manifest file #35049

andrershov · 2018-10-29T16:38:17Z

Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
GatewayMetaState implements ClusterStateApplier and receives all
ClusterStateChanged events and is responsible for storing modified
metadata to disk.

When new ClusterStateChanged event is received, GatewayMetaState
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that GatewayMetaState checks if index
metadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by MetaDataStateFormat
class.

Unfortunately, there is no atomicity when more than one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by MetaState class. MetaState contains pointers to
current metadata.
More precisely, it stores global state generation as long and map from
Index to index metadata generation as long. Atomicity of writes for
manifest file is ensured by MetaStateFormat class.

The algorithm of writing changes to the disk would be the following:

Write global metadata state file to disk and remember
it's generation.
For each new/changed index write state file to disk and remember
it's generation. For each not-changed index use generation from
previous manifest file. If index is removed or this node is no longer
responsible for this index - forget about the index.
Create MetaState object using previously remembered generations and
write it to disk.
Remove old state files for global metadata, indices metadata and
manifest.

Additonally new implementation relies on enhanced MetaDataStateFormat
failure semantics, applyClusterState throws IOException, whose
descendant WriteStateException could be (and should be in Zen2)
explicitly handled.

elasticmachine · 2018-10-29T16:38:21Z

Pinging @elastic/es-distributed

# Conflicts: # server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

DaveCTurner

I've done a first pass and left a handful of thoughts.

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

server/src/main/java/org/elasticsearch/gateway/MetaStateService.java

server/src/main/java/org/elasticsearch/gateway/WriteStateException.java

server/src/main/java/org/elasticsearch/gateway/MetaStateService.java

andrershov · 2018-10-31T10:28:00Z

@DaveCTurner I've fixed issues in 421caa1 + added some minor fixes not mentioned in your comments. Ready for the second pass.

andrershov · 2018-11-07T19:54:49Z

@ywelsch I've reopened the PR and made necessary fixes, except not writing non-upgraded index metadata on startup in non bwc mode, which I'll implement tomorrow. Could you please review it?

ywelsch

I've done an initial pass. Have not looked in details at the tests yet.

server/src/main/java/org/elasticsearch/cluster/metadata/Manifest.java

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

server/src/main/java/org/elasticsearch/gateway/MetaStateService.java

server/src/test/java/org/elasticsearch/action/admin/indices/create/CreateIndexIT.java

server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java

andrershov · 2018-11-13T06:58:22Z

@ywelsch I was working on the tests/bugfixes during the night. And now it's ready for the 3rd pass.
Regarding atomicity tests, I've implemented them in 2 commits:

Allow subclasses to override formats used in MetaStateService (to prepare MetaStateService for failure injection) a6dddc8
Add random atomicity test for Transaction 7936cc4

ywelsch · 2018-11-13T13:15:06Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

@@ -224,14 +241,17 @@ public final void write(final T state, final Path... locations) throws WriteStat
            copyStateToExtraLocations(directories, tmpFileName);
            performRenames(tmpFileName, fileName, directories);
            performStateDirectoriesFsync(directories);
+        } catch (WriteStateException e) {
+            cleanupOldFiles(oldGenerationId, locations);


I think this is dangerous and can lead to data loss. Assume that you've successfully written a cluster state that contains an index "test" with state file of generation 1. Then you try to write an updated cluster state for index with state file generation 2. Writing the state file for the index is successful, but there's a failure later to write the state file for another index. Now the node crashes or the clean-up logic fails. When you then handle the next cluster state, you will try to write generation 3, which, if it fails, will clean up everything except generation 2.

I think that within MetaDataStateFormat, you can only clean up files within the write method that you know that you have written. In particular, you can't assume (given the manifest-based approach) that the highest generation you see here is the file to keep around (as the manifest might not be pointing to the one with the highest generation).

This clean-up logic will have to be handled at a higher level (GatewayMetaState)

This is a really good catch! Thank you! The reason why I've implemented it this way, because I DO know generations of global metadata and index metadata from the manifest file, but I DON'T know the previous generation of the manifest file to perform cleanup of newly created manifest state file if the manifest write fails. I really would like to avoid returning manifest generation from loadFullState, because it already returns Tuple. Adding generation field to Manifest itself also feels wrong. So instead I've done the work in 2 commits:

MetaDataStateFormat.write should not perform cleanup at all, writeAndCleanup should always 326be60

Use writeManifestAndCleanup, reorder rollbackActions before write 982e853
From my point of view, this should work.

ywelsch

I've left a few more comments

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java

…s are stopped (#35494) Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart onNodeStopped is only called after the nodes are actually all stopped (and in particular not while starting up some nodes again). This change also ensures that a closed node client is not being used anymore (which required a small change to a test). Relates to #35049

…ating nodes" This reverts commit f21ddf9

…s are stopped (#35494) Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart onNodeStopped is only called after the nodes are actually all stopped (and in particular not while starting up some nodes again). This change also ensures that a closed node client is not being used anymore (which required a small change to a test). Relates to #35049

andrershov · 2018-11-14T13:01:25Z

@ywelsch I've made required fixes and added comments to 2 remaining issues. It's now ready for the next pass.

ywelsch

I have made one more request for change and some nits.

ywelsch · 2018-11-14T12:41:26Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

@@ -224,14 +244,23 @@ public final void write(final T state, final Path... locations) throws WriteStat
            copyStateToExtraLocations(directories, tmpFileName);
            performRenames(tmpFileName, fileName, directories);
            performStateDirectoriesFsync(directories);
+        } catch (WriteStateException e) {
+            if (cleanup) {
+                cleanupOldFiles(oldGenerationId, locations);


This change is not really needed for this PR (where we only call write, but not writeAndCleanup from GatewayMetaState, with one exception, see below). I think we should leave this extra clean-up out of this PR, because it might introduce regressions to other parts of the system, which are also using MetaDataStateFormat.

There is one exception, writeManifestAndCleanup. We can change that one as well to only writeManifest and do the clean up in the same way in GatewayMetaState as for the other files.

Please see my comment above why I've done it in this way. Shortly, the reason why I'm using writeManifestAndCleanup, because I don't know manifest file generation. If manifest write succeeds I get current generation returned by the method call, if it fails - I don't have it.
In general, I don't think there is a problem with other callers calling writeAndCleanup, because they don't use the concept of manifest and they want to keep the previously written latest generation.
If you're really afraid of introducing a regression, I can suggest adding "oldGenerationId" to WriteStateException, that when writing manifest failed we can learn which file to delete. IMHO, this is not needed and cleaning leaving only previous generation of state files for all uses of writeAndCleanup is the right thing to do.

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

server/src/main/java/org/elasticsearch/gateway/MetaStateService.java

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java

ywelsch

LGTM

andrershov · 2018-11-16T09:12:17Z

run gradle build tests please

andrershov · 2018-11-16T09:12:43Z

@ywelsch thanks for your reviewing efforts!

[Zen2] Write manifest file

76474dc

andrershov added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 29, 2018

andrershov self-assigned this Oct 29, 2018

andrershov requested a review from DaveCTurner October 29, 2018 16:38

andrershov mentioned this pull request Oct 29, 2018

Zen2 ClusterState storage #33958

Closed

6 tasks

Merge branch 'zen2' into 'zen2_manifest'

39e4686

# Conflicts: # server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

DaveCTurner reviewed Oct 30, 2018

View reviewed changes

Fix code review issues

421caa1

andrershov mentioned this pull request Nov 2, 2018

[Zen2] Write manifest file (new implementation) #35199

Closed

andrershov closed this Nov 5, 2018

Andrey Ershov added 13 commits November 7, 2018 21:14

MetaState -> Manifest

5dcb6f5

Change order of ParseField in manifest

25e7c52

Rename globalGeneration and indexGenerations

6440a90

writeAndCleanup method

aba2e40

s/of/if

b6ca5d9

s/state id/generation id

9d482ad

s/zen 1/manifest-less

23c3ca6

ManifestTest: move, reformat, use eq&hash utils, enrich mutation

3e090b4

fileToCorrupt = file

988ed6d

Fix tests: FullClusterRestart callback

87f6ae9

indices typo fix

b1eb33e

MetaStateServiceTests refactoring

f5cd21f

GatewayMetaState and MetaStateService bug fix

1c11360

andrershov reopened this Nov 7, 2018

andrershov requested a review from ywelsch November 7, 2018 19:52

ywelsch suggested changes Nov 8, 2018

View reviewed changes

Add random atomicity test for Transaction

7936cc4

randomMetaData -> randomMetaDataForTx, where is already randomMetaData

ae1951e

ywelsch mentioned this pull request Nov 13, 2018

Adapt InternalCluster#fullRestart to call onNodeStopped when all nodes are stopped #35494

Merged

ywelsch reviewed Nov 13, 2018

View reviewed changes

Andrey Ershov added 4 commits November 13, 2018 19:11

write should not perform cleanup at all, writeAndCleanup should always

326be60

Use writeManifestAndCleanup, reorder rollbackActions before write

982e853

Transaction -> AtomicClusterStateWriter

4113830

Fix style

0e63bdc

andrershov force-pushed the zen2_manifest branch from ca44b4d to 0e63bdc Compare November 13, 2018 16:47

Do not cleanup non-referenced IMD in case of rollback

0bc5ab5

Andrey Ershov added 2 commits November 14, 2018 15:41

Revert "fullRestart should call onNodeStopped callback, before re-cre…

17bb22a

…ating nodes" This reverts commit f21ddf9

Merge branch zen2 into zen2_manifest

a5de8ad

ywelsch suggested changes Nov 14, 2018

View reviewed changes

Andrey Ershov added 5 commits November 15, 2018 12:25

getPrefix package private

c7fb593

Add final modifiers

736bde5

s/re-write/upgrade

2577b38

typo dangling

8b0ff4c

Rename test

21cec41

ywelsch approved these changes Nov 15, 2018

View reviewed changes

Merge zen2 into zen2_manifest

f00c74f

andrershov merged commit f9ecd0c into elastic:zen2 Nov 19, 2018

martijnvg mentioned this pull request Feb 7, 2019

Remove bwc state loading logic in MetaStateService #38557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Write manifest file #35049

[Zen2] Write manifest file #35049

andrershov commented Oct 29, 2018 •

edited

Loading

elasticmachine commented Oct 29, 2018

DaveCTurner left a comment

andrershov commented Oct 31, 2018

andrershov commented Nov 7, 2018

ywelsch left a comment

andrershov commented Nov 13, 2018

ywelsch Nov 13, 2018

andrershov Nov 13, 2018 •

edited

Loading

ywelsch left a comment

andrershov commented Nov 14, 2018

ywelsch left a comment

ywelsch Nov 14, 2018

andrershov Nov 15, 2018

ywelsch left a comment

andrershov commented Nov 16, 2018

andrershov commented Nov 16, 2018 •

edited

Loading

[Zen2] Write manifest file #35049

[Zen2] Write manifest file #35049

Conversation

andrershov commented Oct 29, 2018 • edited Loading

elasticmachine commented Oct 29, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

andrershov commented Oct 31, 2018

andrershov commented Nov 7, 2018

ywelsch left a comment

Choose a reason for hiding this comment

andrershov commented Nov 13, 2018

ywelsch Nov 13, 2018

Choose a reason for hiding this comment

andrershov Nov 13, 2018 • edited Loading

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

andrershov commented Nov 14, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Nov 14, 2018

Choose a reason for hiding this comment

andrershov Nov 15, 2018

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

andrershov commented Nov 16, 2018

andrershov commented Nov 16, 2018 • edited Loading

andrershov commented Oct 29, 2018 •

edited

Loading

andrershov Nov 13, 2018 •

edited

Loading

andrershov commented Nov 16, 2018 •

edited

Loading