Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IndexAuditTrail rolling restart on rollover edge #35988

Merged

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Nov 28, 2018

This fixes two independent bugs , both tripping integ test failures.
They are both facilitated by the rolling nature of the audit index. Moreover, they
will both manifest only during a rolling upgrade executed while the audit index
rolls over.

The bugs:

  1. While rolling upgrading, the master has the responsibility to install the new index template. Non-master new nodes (upgraded before the master) should hold off from creating new indices, because of rolling over, until the new template is installed. Because of a backwards condition, onOrAfter, they will fail to do that and use the index template for the old nodes with the event data of the new ones.

  2. Also while rolling upgrading, non-master new nodes can wait indefinitely for a mapping update. The mapping update for the audit index is also the responsibility of the master node. However the master can fail to update the mapping for a foregoing index ( for which other new nodes might have events) and only upgrade it for the following index. Precisely on clock edge, a master with no events pipelined will only try to update the new index.

Closes #33867

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@@ -277,7 +277,7 @@ private boolean canStart(ClusterState clusterState) {
}

if (TemplateUtils.checkTemplateExistsAndVersionMatches(INDEX_TEMPLATE_NAME, SECURITY_VERSION_STRING,
clusterState, logger, Version.CURRENT::onOrAfter) == false) {
clusterState, logger, Version.CURRENT::onOrBefore) == false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug 1.
new nodes, with a greater version, will not be stopped to create a rolled over index using an older template.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I am misunderstanding but I don't follow the logic.

  1. template does not exist so we should not start - this makes sense
  2. template exists but is onOrBefore the current version of the node. On a new node, say 6.5.1, this means that if the version in the template is <= 6.5.1 that we start. Why would we start with a older template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scenario you're describing is just the scenario for onOrAfter:

public boolean onOrAfter(Version version) {
  return version.id <= id;
}

In this case id is 6.5.1 and version.id is the version of the template, because the predicate is Version.CURRENT::onOrAfter .

As another anchor, further down in the function:

Version.fromString(versionString).onOrAfter(Version.CURRENT)

is correct, because it has the template version on the left of CURRENT.

You don't want to know for how long I had been banging my head on it....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah! This is not easy to understand. For my sanity:

Version.CURRENT on new node = 6.5.1
template version = 6.5.0

Using onOrAfter would be:
6.5.0 <= 6.5.1 = true

Using onOrBefore would be:
6.5.0 >= 6.5.1 = false

@@ -381,7 +390,7 @@ void updateCurrentIndexMappingsIfNecessary(ClusterState state) {
IndexMetaData indexMetaData = indices.get(0);
MappingMetaData docMapping = indexMetaData.mapping("doc");
if (docMapping == null) {
if (indexToRemoteCluster || state.nodes().isLocalNodeElectedMaster()) {
if (indexToRemoteCluster || state.nodes().isLocalNodeElectedMaster() || hasStaleMessage()) {
putAuditIndexMappingsAndStart(index);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug 2.
master will only update the next rolled over index, if there are no events (on master) for the old index. However, there could be events of new nodes for the old index, whose template will never be updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have the master issue a mapping update for the previous index or indices. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been entertaining this idea too. I've chosen this alternative because the code is much simpler. In the obverse, the previous index might not exist or the rollover setting might have changed during the restart, which I think is too complex to code in an already complex mayhem.
I think the surface for the dreaded cluster update storm is acceptable, as there is a very good chance the non-master will not be in sync during the repeated start retrials.

@albertzaharovits albertzaharovits merged commit 5eb7040 into elastic:master Nov 29, 2018
@albertzaharovits albertzaharovits deleted the fix_index_audit_upgrade_it branch November 29, 2018 14:12
@albertzaharovits
Copy link
Contributor Author

albertzaharovits commented Nov 29, 2018

If you're here because of the silly commit message I blame github for failing to keep the merge message on retrying a failed squash and merge. 🙏

albertzaharovits added a commit that referenced this pull request Nov 29, 2018
This fixes two independent bugs , both tripping integ test failures.
They are both facilitated by the rolling nature of the audit index.
Moreover, they will both manifest only during a rolling upgrade executed
while the audit index rolls over.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexAuditUpgradeIT.testAuditLogs fails
4 participants