Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive unknown or invalid settings on updates #28888

Merged

Conversation

jasontedor
Copy link
Member

Today we can end up in a situation where the cluster state contains unknown or invalid settings. This can happen easily during a rolling upgrade. For example, consider two nodes that are on a version that considers the setting foo.bar to be known and valid. Assume one of these nodes is restarted on a higher version that considers foo.bar to now be either unknown or invalid, and then the second node is restarted too. Now, both nodes will be on a version that consider foo.bar to be unknown or invalid yet this setting will still be contained in the cluster state. This means that if a cluster settings update is applied and we validate the settings update with the existing settings then validation will fail. In such a state, the offending setting can not even be removed. This commit helps out with this situation by archiving any settings that are unknown or invalid at the time that a settings update is applied. This allows the setting update to go through, and the archived settings can be removed at a later time.

Relates #28609

Today we can end up in a situation where the cluster state contains
unknown or invalid settings. This can happen easily during a rolling
upgrade. For example, consider two nodes that are on a version that
considers the setting foo.bar to be known and valid. Assume one of these
nodes is restarted on a higher version that considers foo.bar to now be
either unknown or invalid, and then the second node is restarted
too. Now, both nodes will be on a version that consider foo.bar to be
unknown or invalid yet this setting will still be contained in the
cluster state. This means that if a cluster settings update is applied
and we validate the settings update with the existing settings then
validation will fail. In such a state, the offending setting can not
even be removed. This commit helps out with this situation by archiving
any settings that are unknown or invalid at the time that a settings
update is applied. This allows the setting update to go through, and the
archived settings can be removed at a later time.
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minors. newlines are out of control here.

@@ -48,15 +54,35 @@ synchronized Settings getPersistentUpdate() {
return persistentUpdates.build();
}

synchronized ClusterState updateSettings(final ClusterState currentState, Settings transientToApply, Settings persistentToApply) {
synchronized ClusterState updateSettings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get your newlines under control my friend

* - validate the incoming settings update combined with the existing known and valid settings
* - merge in the archived unknown or invalid settings
*/
final Tuple<Settings, Settings> partitionedTransientSettings =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have 2 methods Settings getValidSettings(Settings) and Settings getInvalidSetting(Settings) that would make it simpler to see what is valid and not instead of a tuple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change. I kept the tuple but clarified its usage. I hope this helps.

final Map.Entry<String, String> e,
final IllegalArgumentException ex,
final Logger logger) {
logger.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man newlines?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about painting it yellow. 😛

@jasontedor jasontedor requested a review from s1monw March 6, 2018 17:02
@jasontedor
Copy link
Member Author

@s1monw I pushed some changes addressing your feedback; would you take another look?

@jasontedor
Copy link
Member Author

test this please

* master:
  [TEST] AwaitsFix QueryRescorerIT.testRescoreAfterCollapse
  Decouple XContentType from StreamInput/Output (elastic#28927)
  Remove BytesRef usage from XContentParser and its subclasses (elastic#28792)
  [DOCS] Correct typo in configuration (elastic#28903)
  Fix incorrect datemath example (elastic#28904)
  Add a usage example of the JLH score (elastic#28905)
  Wrap stream passed to createParser in try-with-resources (elastic#28897)
  Rescore collapsed documents (elastic#28521)
  Fix (simple)_query_string to ignore removed terms (elastic#28871)
  [Docs] Fix typo in composite aggregation (elastic#28891)
  Try if tombstone is eligable for pruning before locking on it's key (elastic#28767)
(e, ex) -> logInvalidSetting(settingsType, e, ex, logger));
return Tuple.tuple(
Settings.builder()
.put(settingsWithUnknownOrInvalidArchived.filter(k -> k.startsWith(ARCHIVED_SETTINGS_PREFIX) == false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this distinction a bit confusing - why do we keep the "old" invalid settings but remove the new ones? In the next iteration will just be including the newly archived setting in the previous ones. I'm inclined to say - either remove all archived setting (both new and old) when validating or keep them all. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes Consider a settings update to remove archived persistent settings:

{
  "persistent": {
    "archived.*": null
  }
}

If there are existing persistent archived settings in the cluster state, these should be removed by this update. The strategy that we use to handle unknown or invalid settings is to add them at the end with the archived prefix. If we do the same for existing archived settings, the application of the above settings update will not remove them and they will remain in the cluster state. Does that help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand you to say that you want the invalid settings not to be removed by the incoming request you gave. If so, nothing we do here will be "clean" (for example, after you called that command there may be archived settings in the cluster state, which is not what people expect). Not sure it's worth all the complexity. If you prefer it this way (I presume you do), I'm fine with keeping as is but please write an explicit test to demonstrate this behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, after you called that command there may be archived settings in the cluster state, which is not what people expect

I disagree with this. If I list my cluster settings and I see archived.foo, archived.bar, and invalid.setting, I do not expect PUT archived.*: null (lazy) to remove invalid.setting. Hence, the choice that I made here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a test in 3298664.

final Settings toApply = Settings.builder().put("dynamic.setting", "value").build();
final boolean applyTransient = randomBoolean();
final ClusterState clusterStateAfterUpdate;
if (applyTransient) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took me quite a while to understand what's going on - on the one hand the persistent vs transient question is handled by parameters to this method (please rename the function parameter if we keep it ;)) and on the other hand we have a boolean here that deals with it directly which may be inconsistent. On top of it all, the rest API and the transport layer allows changing both at once - something we don't test. I think the functional programming got a bit out of hand here, probably due to multiple iterations. Can we have a straight forward test that works with both persistent and transient (randomly combining them)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed d2bc7fd.

(e, ex) -> logInvalidSetting(settingsType, e, ex, logger));
return Tuple.tuple(
Settings.builder()
.put(settingsWithUnknownOrInvalidArchived.filter(k -> k.startsWith(ARCHIVED_SETTINGS_PREFIX) == false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand you to say that you want the invalid settings not to be removed by the incoming request you gave. If so, nothing we do here will be "clean" (for example, after you called that command there may be archived settings in the cluster state, which is not what people expect). Not sure it's worth all the complexity. If you prefer it this way (I presume you do), I'm fine with keeping as is but please write an explicit test to demonstrate this behavior.

* master: (28 commits)
  Maybe die before failing engine (elastic#28973)
  Remove special handling for _all in nodes info
  Remove Booleans use from XContent and ToXContent (elastic#28768)
  Update Gradle Testing Docs (elastic#28970)
  Make primary-replica resync failures less lenient (elastic#28534)
  Remove temporary file 10_basic.yml~
  Use different pipeline id in test. (pipelines do not get removed between tests extending from ESIntegTestCase)
  Use fixture to test the repository-gcs plugin (elastic#28788)
  Use String.join() to describe a list of tasks (elastic#28941)
  Fixed incorrect test try-catch statement
  Plugins: Consolidate plugin and module loading code (elastic#28815)
  percolator: Take `matchAllDocs` and `verified` of the sub result into account when analyzing a function_score query.
  Build: Remove rest tests on archive distribution projects (elastic#28952)
  Remove FastStringReader in favor of vanilla StringReader (elastic#28944)
  Remove FastCharArrayReader and FastCharArrayWriter (elastic#28951)
  Continue registering pipelines after one pipeline parse failure. (elastic#28752)
  Build: Fix ability to ignore when no tests are run (elastic#28930)
  [rest-api-spec] update doc link for /_rank_eval
  Switch XContentBuilder from BytesStreamOutput to ByteArrayOutputStream (elastic#28945)
  Factor UnknownNamedObjectException into its own class (elastic#28931)
  ...
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jasontedor for the extra iteration. I left optional comments.

for (final Setting<String> invalidSetting : invalidSettings) {
if (existingPersistentSettings.keys().contains(invalidSetting.getKey())) {
assertThat(
clusterStateAfterUpdate.metaData().persistentSettings().keySet(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that it doesn't exist in the non archived form? alternatively check the total count of settings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e85ad01.

final Settings.Builder transientToApply = Settings.builder();
for (final Setting<String> dynamicSetting : dynamicSettings) {
if (randomBoolean()) {
persistentToApply.put(dynamicSetting.getKey(), "value");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we skip some and see they don't change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e85ad01.

@jasontedor jasontedor merged commit 4dc3ada into elastic:master Mar 13, 2018
jasontedor added a commit that referenced this pull request Mar 13, 2018
Today we can end up in a situation where the cluster state contains
unknown or invalid settings. This can happen easily during a rolling
upgrade. For example, consider two nodes that are on a version that
considers the setting foo.bar to be known and valid. Assume one of these
nodes is restarted on a higher version that considers foo.bar to now be
either unknown or invalid, and then the second node is restarted
too. Now, both nodes will be on a version that consider foo.bar to be
unknown or invalid yet this setting will still be contained in the
cluster state. This means that if a cluster settings update is applied
and we validate the settings update with the existing settings then
validation will fail. In such a state, the offending setting can not
even be removed. This commit helps out with this situation by archiving
any settings that are unknown or invalid at the time that a settings
update is applied. This allows the setting update to go through, and the
archived settings can be removed at a later time.
@jasontedor jasontedor removed the v6.1.4 label Mar 13, 2018
@jasontedor
Copy link
Member Author

Thanks @bleskes and @s1monw.

jasontedor added a commit that referenced this pull request Mar 13, 2018
Today we can end up in a situation where the cluster state contains
unknown or invalid settings. This can happen easily during a rolling
upgrade. For example, consider two nodes that are on a version that
considers the setting foo.bar to be known and valid. Assume one of these
nodes is restarted on a higher version that considers foo.bar to now be
either unknown or invalid, and then the second node is restarted
too. Now, both nodes will be on a version that consider foo.bar to be
unknown or invalid yet this setting will still be contained in the
cluster state. This means that if a cluster settings update is applied
and we validate the settings update with the existing settings then
validation will fail. In such a state, the offending setting can not
even be removed. This commit helps out with this situation by archiving
any settings that are unknown or invalid at the time that a settings
update is applied. This allows the setting update to go through, and the
archived settings can be removed at a later time.
@jasontedor jasontedor deleted the unknown-or-invalid-settings-updates branch March 13, 2018 21:36
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 13, 2018
* master:
  Add search slowlog level to docs (elastic#29040)
  Add docs for error file configuration (elastic#29032)
  Archive unknown or invalid settings on updates (elastic#28888)
@jasontedor jasontedor added v6.2.4 and removed v6.2.3 labels Mar 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants