Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NoopGatewayAllocator in favor of a more realistic mock #20637

Merged
merged 20 commits into from
Sep 25, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 23, 2016

Many of our unit tests instantiate an AllocationService, which requires having a GatewayAllocator. Today almost all of our test use a class called NoopGatewayAllocator which does nothing, effectively leaving all shard assignments to the balanced allocator. This is sad as it means we test a system that behaves differently than our production logic in very basic things. For example, a started primary that is lost will be assigned to a node that didn't use to have it.

This PR removes NoopGatewayAllocator in favor of a new TestGatewayAllocator that inherits the standard GatewayAllocator and overrides shard information fetching to return information based on historical assignments the allocator has done. The only exception is BalanceConfigurationTests which does test only the balancer and I opted to not have it work around the GatewayAllocator being in it's way.

The PR also fixes all resulting failures plus one production failure for which I will open a dedicated PR (note - that fix is included here as well to make sure it works).

…RecoverySource

When initializing a new index routing table, we make a decision where the primary shards should be recovered from. This can be an empty folder for new indices, a set of specific allocation ids for old indices or a snapshot. We currently allow callers of  IndexRoutingTable.initializeEmpty to supply the source but also set it automatically if null is given. Sadly the current logic is reusing the supplied parameter to store the result of the automatic decision. This is flawed if some of the decision should be *different* between the different index shard (as the first decision that is maid sticks).

 This commit fixes this but also simplifies the API to always make an automatic decision
the recovery source bug prevented propper coverage
@bleskes bleskes added >enhancement >test Issues or PRs that are addressing/adding tests v6.0.0-alpha1 v5.1.1 v5.0.0 labels Sep 23, 2016
bleskes added a commit that referenced this pull request Sep 23, 2016
… RecoverySource (#20638)

When initializing a new index routing table, we make a decision where the primary shards should be recovered from. This can be an empty folder for new indices, a set of specific allocation ids for old indices or a snapshot. We currently allow callers of `IndexRoutingTable.initializeEmpty` to supply the source but also set it automatically if null is given. Sadly the current logic is reusing the supplied parameter to store the result of the automatic decision. This is flawed if some of the decision should be *different* between the different index shard (as the first decision that is maid sticks).

 This commit fixes this but also simplifies the API to always make an automatic decision.

This was discovered while working on #20637 which strengthens the testing infra and caused this to bubble up. I put it as a separate commit to make sure it is not lost as part of a bigger test only PR.
bleskes added a commit that referenced this pull request Sep 23, 2016
… RecoverySource (#20638)

When initializing a new index routing table, we make a decision where the primary shards should be recovered from. This can be an empty folder for new indices, a set of specific allocation ids for old indices or a snapshot. We currently allow callers of `IndexRoutingTable.initializeEmpty` to supply the source but also set it automatically if null is given. Sadly the current logic is reusing the supplied parameter to store the result of the automatic decision. This is flawed if some of the decision should be *different* between the different index shard (as the first decision that is maid sticks).

 This commit fixes this but also simplifies the API to always make an automatic decision.

This was discovered while working on #20637 which strengthens the testing infra and caused this to bubble up. I put it as a separate commit to make sure it is not lost as part of a bigger test only PR.
bleskes added a commit that referenced this pull request Sep 23, 2016
… RecoverySource (#20638)

When initializing a new index routing table, we make a decision where the primary shards should be recovered from. This can be an empty folder for new indices, a set of specific allocation ids for old indices or a snapshot. We currently allow callers of `IndexRoutingTable.initializeEmpty` to supply the source but also set it automatically if null is given. Sadly the current logic is reusing the supplied parameter to store the result of the automatic decision. This is flawed if some of the decision should be *different* between the different index shard (as the first decision that is maid sticks).

 This commit fixes this but also simplifies the API to always make an automatic decision.

This was discovered while working on #20637 which strengthens the testing infra and caused this to bubble up. I put it as a separate commit to make sure it is not lost as part of a bigger test only PR.
@bleskes
Copy link
Contributor Author

bleskes commented Sep 23, 2016

@ywelsch I've merged master into this so #20638 will be remove from it.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments / questions. It will be interesting t o combine this with IndicesClusterStateServiceRandomUpdatesTests by introducing random failures into the TestGatewayAllocator to see how the system reacts to random failures during the shard fetching ;-)

@@ -290,7 +290,7 @@ public String prettyPrint() {
for (int shard = 0; shard < indexMetaData.getNumberOfShards(); shard++) {
sb.append(TAB).append(TAB).append(shard).append(": ");
sb.append("p_term [").append(indexMetaData.primaryTerm(shard)).append("], ");
sb.append("a_ids ").append(indexMetaData.inSyncAllocationIds(shard)).append("\n");
sb.append("sa_ids ").append(indexMetaData.inSyncAllocationIds(shard)).append("\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "sa" supposed to mean? Maybe insync_ids or isa_ids (inSyncAllocationIds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced allocations, but I agree that isa_ids are better.

@@ -138,7 +138,8 @@ public UnassignedShardDecision makeAllocationDecision(final ShardRouting unassig

if (inSyncAllocationIds.isEmpty()) {
assert Version.indexCreated(indexMetaData.getSettings()).before(Version.V_5_0_0_alpha1) :
"trying to allocated a primary with an empty allocation id set, but index is new";
"trying to allocated a primary with an empty in sync allocation id set, but index is new. index: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to allocate a primary ...

Copy link
Contributor Author

@bleskes bleskes Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

if (primaryShard.initializing() &&
RecoverySource.isInitialRecovery(primaryShard.recoverySource().getType()) == false ) {
// simulate a primary was initialized based on aid
insyncAids.add(primaryShard.allocationId().getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be interesting to add this as assertion to IndexRoutingTable#validate after the line

if (shardRouting.active() && indexMetaData.inSyncAllocationIds(shardRouting.id()).contains(shardRouting.allocationId().getId()) == false) {

(essentially to validate whether this reverse engineering is correct)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. added.

}

private ClusterState createInitialClusterState(AllocationService service, Settings settings) {
boolean shrinkIndex = randomBoolean();
RecoverySource.Type recoveryType = randomFrom(RecoverySource.Type.EMPTY_STORE,
RecoverySource.Type.LOCAL_SHARDS, RecoverySource.Type.SNAPSHOT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make RecoverySource.INITIAL_RECOVERY_TYPES public, you can use that directly here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to do it like it is now for two reasons - it's more explicit so people reading can know what's going on and it offers an external validation of what's in INITIAL_RECOVERY_TYPES.

// with old version, we can't know if a shard was allocated before or not
allocation = routingAllocationWithOnePrimaryNoReplicas(yesAllocationDeciders(),
INDEX_REOPENED, Version.CURRENT);
//randomFrom(INDEX_CREATED, CLUSTER_RECOVERED, INDEX_REOPENED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. left over from debugging. I put it back into it's place.

*/
public class TestGatewayAllocator extends GatewayAllocator {

Map<String, Map<ShardId,ShardRouting>> knownAllocations = new HashMap<>();
Copy link
Contributor

@ywelsch ywelsch Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShardId, space ShardRouting ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this map is queried, it seems more efficient to make it a
Map<ShardId, Map<String /* node id */, ShardRouting>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have been spending too much time with the TLA model :) . Of course, you are right but I opted to go for clarity and simplicity here. I think it's the right thing to optimize for here.


primaryShardAllocator.allocateUnassigned(allocation);
replicaShardAllocator.processExistingRecoveries(allocation);
replicaShardAllocator.allocateUnassigned(allocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this method by

@Override
public void allocateUnassigned(RoutingAllocation allocation) {
     updateNodes(allocation);
     super.allocateUnassigned(allocation);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now remember why I did it this way - I couldn't find a good way to get all the inner classes to work as they can not be constructed and passed around as a parameter. I took another approach with 13b5c5c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cleaner way would be to add a constructor

public GatewayAllocator(Settings settings, final TransportNodesListGatewayStartedShards startedAction, final TransportNodesListShardStoreMetaData storeAction, PrimaryShardAllocator primaryShardAllocator, ReplicaShardAllocator replicaShardAllocator) {
    ...
    }

to GatewayAllocator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,the problem is that all implementations of ReplicaShardAllocator & PrimaryShardAllocator are non-static inner classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that's unfortunate :-/

public class TestGatewayAllocator extends GatewayAllocator {

Map<String, Map<ShardId,ShardRouting>> knownAllocations = new HashMap<>();
Map<String, DiscoveryNode> knownNodes = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simpler just to store DiscoveryNodes directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great idea.

}
};

public TestGatewayAllocator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the constructor up before the first method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the first method. the rest are fields. I personally don't care. let me know what you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I missed that. Ignore my comment.


/**
* A gateway allocator implementation that keeps an in memory list of started shard allocation
* that are used as replies to the, normally async, fetch data requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a few sentences how it behaves under nodes that disappear / reappear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

abeyad pushed a commit to abeyad/elasticsearch that referenced this pull request Sep 23, 2016
GatewayAllocator#applyFailedShards to take both a RoutingAllocation
and a list of shards to apply. This allows better mock allocators
to be created as being done in elastic#20637.
abeyad pushed a commit that referenced this pull request Sep 23, 2016
Changes the API of GatewayAllocator#applyStartedShards and 
GatewayAllocator#applyFailedShards to take both a RoutingAllocation
and a list of shards to apply. This allows better mock allocators
to be created as being done in #20637.

Closes #20642
@abeyad
Copy link

abeyad commented Sep 23, 2016

@bleskes in #20626 and #20642, StartedRerouteAllocation and FailedRerouteAllocation were removed, the consequence to your PR being that GatewayAllocator#applyStartedShards now takes (RoutingAllocation, List<ShardRouting> startedShards) and GatewayAllocator#applyFailedShards now takes (RoutingAllocation, List<FailedShard> failedShards)

@bleskes
Copy link
Contributor Author

bleskes commented Sep 23, 2016

@abeyad yep saw it. Thanks for the extra ping.

abeyad pushed a commit that referenced this pull request Sep 23, 2016
Changes the API of GatewayAllocator#applyStartedShards and 
GatewayAllocator#applyFailedShards to take both a RoutingAllocation
and a list of shards to apply. This allows better mock allocators
to be created as being done in #20637.

Closes #20642
@bleskes
Copy link
Contributor Author

bleskes commented Sep 23, 2016

@ywelsch thanks for the review. I merged from master (to get all the changes in) and pushed some more commits as a fall out of the changes you suggested. Can you take another look?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @bleskes!

@bleskes
Copy link
Contributor Author

bleskes commented Sep 23, 2016

@ywelsch I'm holding on pushing this. The extra assertion in IndexRoutingTable#validate has surfaced another issue with shadow replicas that I want to talk about (around not reusing aid with reinitlizing a replica as a primary - see [1]). I'm leaning towards removing this assertion for now so we can get this in and follow up after discussion.

[1] look for the only caller of RoutingChangesObserver#startedPrimaryReinitialized

@ywelsch
Copy link
Contributor

ywelsch commented Sep 23, 2016

I'm good if you add && IndexMetaData.isIndexUsingShadowReplicas(indexMetaData.getSettings()) == false to the assertion, get this in and open an issue for the shadow replica case.

@bleskes bleskes merged commit ee76c1a into elastic:master Sep 25, 2016
@bleskes bleskes deleted the test_gateway_allocator branch September 25, 2016 18:15
bleskes added a commit that referenced this pull request Sep 25, 2016
Many of our unit tests instantiate an `AllocationService`, which requires having a `GatewayAllocator`. Today almost all of our test use a class called `NoopGatewayAllocator` which does nothing, effectively leaving all shard assignments to the balanced allocator. This is sad as it means we test a system that behaves differently than our production logic in very basic things. For example, a started primary that is lost will be assigned to a node that didn't use to have it.

This PR removes `NoopGatewayAllocator` in favor of a new `TestGatewayAllocator` that inherits the standard `GatewayAllocator` and overrides shard information fetching to return information based on historical assignments the allocator has done. The only exception is `BalanceConfigurationTests` which does test only the balancer and I opted to not have it work around the `GatewayAllocator` being in it's way.
bleskes added a commit that referenced this pull request Sep 25, 2016
Many of our unit tests instantiate an `AllocationService`, which requires having a `GatewayAllocator`. Today almost all of our test use a class called `NoopGatewayAllocator` which does nothing, effectively leaving all shard assignments to the balanced allocator. This is sad as it means we test a system that behaves differently than our production logic in very basic things. For example, a started primary that is lost will be assigned to a node that didn't use to have it.

This commit removes `NoopGatewayAllocator` in favor of a new `TestGatewayAllocator` that inherits the standard `GatewayAllocator` and overrides shard information fetching to return information based on historical assignments the allocator has done. The only exception is `BalanceConfigurationTests` which does test only the balancer and I opted to not have it work around the `GatewayAllocator` being in it's way.
jasontedor added a commit that referenced this pull request Sep 28, 2016
* master: (1199 commits)
  [DOCS] Remove non-valid link to mapping migration document
  Revert "Default `include_in_all` for numeric-like types to false"
  test: add a test with ipv6 address
  docs: clearify that both ip4 and ip6 addresses are supported
  Include complex settings in settings requests
  Add production warning for pre-release builds
  Clean up confusing error message on unhandled endpoint
  [TEST] Increase logging level in testDelayShards()
  change health from string to enum (#20661)
  Provide error message when plugin id is missing
  Document that sliced scroll works for reindex
  Make reindex-from-remote ignore unknown fields
  Remove NoopGatewayAllocator in favor of a more realistic mock (#20637)
  Remove Marvel character reference from guide
  Fix documentation for setting Java I/O temp dir
  Update client benchmarks to log4j2
  Changes the API of GatewayAllocator#applyStartedShards and (#20642)
  Removes FailedRerouteAllocation and StartedRerouteAllocation
  IndexRoutingTable.initializeEmpty shouldn't override supplied primary RecoverySource (#20638)
  Smoke tester: Adjust to latest changes (#20611)
  ...
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test Issues or PRs that are addressing/adding tests v5.0.0-rc1 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants