Permalink
Commits on Jul 26, 2015
  1. No need to find replica copy when index is created

    There is no need to try and go fetch replica copies for best allocation when the index is created
    kimchy committed Jul 24, 2015
Commits on Jul 24, 2015
  1. Cancel replica recovery on another sync option copy found

    When a replica is initializing from the primary, and we find a better node that has full sync id match, it is better to cancel the existing replica allocation and allocate it to the new node with sync id match (eventually)
    kimchy committed Jul 23, 2015
Commits on Jul 23, 2015
  1. Simplify Replica Allocator

    Simplify the codebase of replica allocator and add more unit tests for it
    kimchy committed Jul 22, 2015
Commits on Jul 21, 2015
  1. Replica allocator unit tests

    First batch of unit tests to verify the behavior of replica allocator
    kimchy committed Jul 21, 2015
  2. Replace primaryPostAllocated flag and use UnassignedInfo

    There is no need to maintain additional state as to if a primary was allocated post api creation on the index routing table, we hold all this information already in the UnassignedInfo class.
    closes #12374
    kimchy committed Jul 21, 2015
Commits on Jul 20, 2015
  1. Simplify handling of ignored unassigned shards

    Fold ignored unassigned to a UnassignedShards and have simpler handling of them. Also remove the trapy way of adding an ignored unassigned shards today directly to the list, and have dedicated methods for it.
    
    This change also removes the useless moving of unassigned shards to the end, since anyhow we first, sort those unassigned shards, and second, we now have persistent "store exceptions" that should not cause "dead letter" shard allocation.
    kimchy committed Jul 20, 2015
  2. Initial Refactor Gateway Allocator

    Break it into more manageable code by separating allocation primaries and allocating replicas. Start adding basic unit tests for primary shard allocator.
    kimchy committed Jul 19, 2015
Commits on Jul 15, 2015
  1. Unique allocation id

    Add a unique allocation id for a shard, helping to uniquely identify a specific allocation taking place to a node.
    A special case is relocation, where a transient relocationId is kept around to make sure the target initializing shard (when using RoutingNodes) is using it for its id, and when relocation is done, the transient relocationId becomes the actual id of it.
    closes #12242
    kimchy committed Jul 14, 2015
  2. Carry over shard exception failure to master node

    Don't loose the shard exception failure when sending a shard failrue to the master node
    kimchy committed Jul 15, 2015
Commits on Jul 14, 2015
  1. Simplify assignToNode to only do initializing

    The method really only should do the move from unassigned to initializing, all the other moves have explicit methods like relocate
    kimchy committed Jul 14, 2015
  2. Default delayed allocation timeout to 1m from 0

    Change the default delayed allocation timeout from 0 (no delayed allocation) to 1m. The value came from a test of having a node with 50 shards being indexed into (so beefy translog requiring flush on shutdown), then shutting it down and starting it back up and waiting for it to join the cluster. This took, on a slow machine, about 30s.
    The value is conservatively low and does not try to address a virtual machine / OS restart for now, in order to not have the affect of node going away and users being concerned that shards are not being allocated to the rest of the cluster as a result of that. The setting can always be changed in order to increase the delayed allocation if needed.
    closes #12166
    kimchy committed Jul 9, 2015
Commits on Jul 9, 2015
  1. Merge pull request #12147 from kimchy/remove_double_elect

    Remove double call to elect primaries
    kimchy committed Jul 9, 2015
  2. Remove double call to elect primaries

    There is no need to call the elect logic twice, we used to need it, but no longer since we handle dangling replicas for unassigned primaries properly
    kimchy committed Jul 9, 2015
Commits on Jul 8, 2015
  1. Consolidate ShardRouting construction

    Simplify and consolidate ShardRouting construction. Make sure that there is really only one place it gets created, when a shard is first created in unassigned state, and from there on, it is either copy constructed or built internally as a target for relocation.
    This change helps make sure within our codebase data carries over by the ShardRouting is not lost as the shard goes through transitions, and can help simplify the addition of more data on it (like uuid).
    For testing, a centralized TestShardRouting allows to create testable versions of ShardRouting, that are not needed to be as strict as the non test codebase. This can be cleanup more later on, but it is a good start.
    closes #12125
    kimchy committed Jul 8, 2015
Commits on Jun 23, 2015
  1. Remove scheduled routing

    Today, we have scheduled reroute that kicks every 10 seconds and checks if a
    reroute is needed. We use it when adding nodes, since we don't reroute right
    away once its added, and give it a time window to add additional nodes.
    
    We do have recover after nodes setting and such in order to wait for enough
    nodes to be added, and also, it really depends at what part of the 10s window
    you end up, sometimes, it might not be effective at all. In general, its historic
    from the times before we had recover after nodes and such.
    
    This change removes the 10s scheduling, simplifies RoutingService, and adds
    explicit reroute when a node is added to the system. It also adds unit tests
    to RoutingService.
    
    closes #11776
    kimchy committed Jun 18, 2015
  2. Set randomized node/index settings in the right place

    Don't set node settings in the index template, and try and set less index settings in the node settings
    closes #11767
    kimchy committed Jun 18, 2015
Commits on Jun 22, 2015
  1. Remove reroute with no reassign

    Its not used in our codebase anymore, so no need for it
    closes #11804
    kimchy committed Jun 22, 2015
Commits on Jun 18, 2015
  1. [TEST] Use the correct renamed setting

    and make the default value setting private
    kimchy committed Jun 18, 2015
  2. [TEST] assertBusy on hasUnassigned

    on fast machines, node leave might not move shards to unassigned right away, wait for it
    kimchy committed Jun 18, 2015
  3. Reset registeredNextDelaySetting on reroute

    Need to reset the registered setting in order to make sure the nex round will capture the right delay interval
    
    also randomize setting and name the setting properly
    
    closes #11759
    kimchy committed Jun 18, 2015
  4. Optional Delayed Allocation on Node leave

    Allow to set delayed allocation timeout on unassigned shards when a node leaves the cluster. This allows to wait for the node to come back for a specific period in order to try and assign the shards back to it to reduce shards movements and unnecessary relocations.
    
    The setting is an index level setting under `index.unassigned.node_left.delayed_timeout` and defaults to 0 (== no delayed allocation). We might want to change the default, but lets do it in a different change to come up with the best value for it. The setting can be updated dynamically.
    
    When shards are delayed, a log message with "info" level will notify how many shards are being delayed.
    
    An implementation note, we really only need to care about delaying allocation on unassigned replica shards. If the primary shard is unassigned, anyhow we are going to wait for a copy of it, so really the only case to delay allocation is for replicas.
    
    close #11712
    kimchy committed Jun 17, 2015
  5. remove 1.7 version check

    kimchy committed Jun 18, 2015
Commits on Jun 16, 2015
  1. Add Unassigned meta data

    Unassigned meta includes additional information as to why a shard is unassigned, this is especially handy when a shard moves to unassigned due to node leaving or shard failure.
    
    The additional data is provided as part of the cluster state, and as part of `_cat/shards` API.
    
    The additional meta includes the timestamp that the shard has moved to unassigned, allowing us in the future to build functionality such as delay allocation due to node leaving until a copy of the shard is found.
    closes #11653
    kimchy committed Jun 15, 2015
Commits on Jun 14, 2015
  1. Add version 1.7.0

    it was added in 1.x, but not in master
    kimchy committed Jun 14, 2015
Commits on Jun 12, 2015
  1. Simplify ShardRouting and centralize move to unassigned

    Make sure there is a single place where shard routing move to unassigned, so we can add additional metadata when it does, also, simplify shard routing implementations a bit
    closes #11634
    kimchy committed Jun 12, 2015
Commits on Jun 2, 2015
  1. Fail shard if search execution uncovers corruption

    If, as part of the search execution, a corruption is uncovered, we should fail the shard
    relates to #11419
    kimchy committed Jun 1, 2015
Commits on May 29, 2015
  1. Reduce cluster update reroutes with async fetch

    When using async fetch, we can end up with cluster updates and reroutes based on teh number of shards. While not disastrous we can optimize it, since a single reroute is enough to apply to all the async fetch results that arrived during that time.
    kimchy committed May 29, 2015
Commits on May 22, 2015
  1. Merge pull request #11304 from kimchy/upgrade_netty_3_10_3

    Upgrade to Netty 3.10.3
    kimchy committed May 22, 2015
  2. Upgrade to Netty 3.10.3

    kimchy committed May 22, 2015
Commits on May 21, 2015
Commits on May 20, 2015
  1. Async fetch of shard started and store during allocation

    Today, when a primary shard is not allocated we go to all the nodes to find where it is allocated (listing its started state). When we allocate a replica shard, we head to all the nodes and list its store to allocate the replica on a node that holds the closest matching index files to the primary.
    
    Those two operations today execute synchronously within the GatewayAllocator, which means they execute in a sync manner within the cluster update thread. For large clusters, or environments with very slow disk, those operations will stall the cluster update thread, making it seem like its stuck.
    
    Worse, if the FS is really slow, we timeout after 30s the operation (to not stall the cluster update thread completely). This means that we will have another run for the primary shard if we didn't find one, or we won't find the best node to place a shard since it might have timed out (listing stores need to list all files and read the checksum at the end of each file).
    
    On top of that, this sync operation happen one shard at a time, so its effectively compounding the problem in a serial manner the more shards we have and the slower FS is...
    
    This change moves to perform both listing the shard started states and the shard stores to an async manner. During the allocation by the GatewayAllocator, if data needs to be fetched from a node, it is done in an async fashion, with the response triggering a reroute to make sure the results will be taken into account. Also, if there are on going operations happening, the relevant shard data will not be taken into account until all the ongoing listing operations are done executing.
    
    The execution of listing shard states and stores has been moved to their own respective thread pools (scaling, so will go down to 0 when not needed anymore, unbounded queue, since we don't want to timeout, just let it execute based on how fast the local FS is). This is needed sine we are going to blast nodes with a lot of requests and we need to make sure there is no thread explosion.
    
    This change also improves the handling of shard failures coming from a specific node. Today, those nodes were ignored from allocation only for the single reroute round. Now, since fetching is async, we need to keep those failures around at least until a single successful fetch without the node is done, to make sure not to repeat allocating to the failed node all the time.
    
    Note, if before the indication of slow allocation was high pending tasks since the allocator was waiting for responses, not the pending tasks will be much smaller. In order to still indicate that the cluster is in the middle of fetching shard data, 2 attributes were added to the cluster health API, indicating the number of ongoing fetches of both started shards and shard store.
    
    closes #9502
    closes #11101
    kimchy committed May 10, 2015
Commits on May 19, 2015
  1. [TEST] add await fix for #11226

    kimchy committed May 19, 2015
  2. [TEST] Add a corrupted replica test verifying its still allocated

    Add a test that verifies that even though all replicas are corrupted on all available nodes, and listing of shard stores faield, it still get allocated and properly recovered from the primary shard
    kimchy committed May 19, 2015