Switch branches/tags
Commits on Jan 28, 2013
  1. move more control of copy transfers to vnode and add basic forwarding

    jrwest committed Jan 28, 2013
    * Instead of vnode manager triggering each transfer for each source
      index it triggers one "copy" transfer each. The
      copy transfer contains the list of target indexes to "copy"
      to. The vnode then triggers an outbound ownership_copy one at a time
      until all transfers for the list of indexes are complete. Once
      complete, it notifies the vnode manager like reglular handoff.
    * Added (barely tested) support for forwarding.
    * This approach more closely resembles typical ownership
      transfer/hinted handoff for a vnode. The primary differences are: 1)
      data is not deleted after handoff completes  (this needs to be
      addressed -- at some point some data needs to be deleted, see
      comments). 2) in the case that an index exists in both old & new
      rings it may copy its data to target indexes and then keep
      running. In this case data also needs to be deleted (also punted on)
      but some data must still remain (referred to as rehash in Core 2.0
      doc). 3) the same vnodes that are affected by #2 also differ in that
      after they begin forwarding they may stop and continue running in
      their regular state. In addition, when forwarding, these indexes will
      forward some requests but others will still be handled by the local vnode
      (not forwarded). What to do with a request during explicit forwarding
      (when vnode returns {forward, X} during handle_handoff_command) when
      forwarding that message would result in it being delivered to same
      vnode still needs to be addressed (see comments).
    * This commit adds a vnode callback, request_hash, required only if
      supporting changing ring sizes. We probably need something better than
      this but its sufficient for a prototype. The function's argument is
      the request to be handled by the vnode and the return value is the
      hashed value of the key from the request. This is necessary because
      the request is opaque to riak_core_vnode. One obvious issue, for
      example, is in the case of FOLD_REQ there is no key to hash -- even
      though we probably shouldn't be and in some cases don't forward this
Commits on Jan 9, 2013
  1. move copy logic down to vnode from vnode manager

    jrwest committed Jan 9, 2013
    WIP, begin pushing logic for "ownership_copy" down to the vnode layer.
    This will be necessary to implement forwarding (and is a better implementation)
Commits on Jan 6, 2013
  1. a rough sketch of an expandable ring

    jrwest committed Jan 6, 2013
    (not including much of the work necessary in riak_core_vnode)
    a basic implementation of the "get the data where it needs to be"
    portition of an expandable ring. The other portition, modifying vnode
    forwarding to properly forward operations to multiple indexes during
    ring expansion is not addressed here (coming soon).
    Many edges are rough and several temporary punts were made. What is
    implementented as well as what needs to be improved/addressed/completed
    is detailed below.
    what is implemented:
      * riak_core_claimant:
        * expanding, or more generally changing the size of, the ring is
          a cluster operation that is planned/staged/commited.
        * the plan generated by expanding the ring is created by making a
          chash with more indexes, assigning the existing indexes
          to their current owners and the new indexes to a dummy owner. The
          larger ring is then run through claim. The resulting ring and
          orignal ring are used as inputs to determine the outbound
          handoffs necessary for each existing index. Each existing index will handoff
          a portion of its data to itself (if it has been relocated
          during claim) and its `OrigRingSize * ((NewRingSize/OrigRingSize)
          - 1)` predeccessors. This is only valid when (NewRingSize/OrigRingSize =<
          OrigRingSize). The derivation and proof of this function are not
          outlined here but it is possible to prove via brute force (due to
          the limited nature of the input set). It is also possible to
          prove more formally -- although at this time that has not yet been
          completed. The restriction on the the inputs,
          (NewRingSize/OrigRingSize =< OrigRingSize) is reasonable
          given it allows a ring of 64 partitions to expand to 4096 and for
          128 its even larger (OrigRingSize^2 specifically). It is also possible to
          alleviate this restriction at the cost of some complexity. In addition, it
          should be noted that there can be a considerable amount of
          transfers. For example, transitioning from 64 to 128 partitions
          incurrs 65 transfers, and from 64 to 256 - 193 transfers.
          These transfers, do not send the entire keyspace held by an
          existing vnode but in the case of riak_kv for example still
          requires that many folds accross the whole keyspace held by the
        * after being committed, the ring size change causes the claimant
          to *not* be run until all transfers complete (see needs improvement
          for more on what needs to change here). The goal here is to
          delay the installation of the larger ring until all transfers
          have completed. This differs from typical ownership transfers
          where new owners are installed as soon as all transfers (for all
          vnode modules) have completed.
      * riak_core_ring:
        * the next member of chstate_v2 has been changed to accomodate
          storing the new ring size and a modified transfer list necessary
          to schedule transfers for the operation
        * other changes necessary to facilitate change in ring size
      * riak_core_vnode_manager:
        * similar to typical ownership transfer, the vnode_manager determines
          if a local vnode should participate in a transfer involved in
          changing the ring size based on the next list
        * new indexes in the larger ring will not have vnode proxies
          started for them. The vnode_manager potentially starts the
          proxies for those new indexes as part of the management tick
      * riak_core_handoff_*:
        * added new type of outbound handoff, "ownership_copy". similar to
          ownership_transfer/hinted_handoff except may be between two different
          indexes and only part of the keyspace held by source index is sent to
          the target. The determination of which keys are sent is
          made by looking at the position of the source and target indexes
          in the preflist for the key in the old and new rings,
          respectively. If the positions are the same the key is sent.
        * handoff type now passed explicitly to handoff manager. not sure
          this was the best decision but at the time seemed necessary
          because the handoff type determined by the handoff manager might
          be ambiguous between the new "ownership_copy" and existing
          "repair" types.
    needs improvement:
      * riak_core_claimant:
        * the prevention of running the claimant entirely during ring size
          change is too broad of scope. It prevents operations like force-*
          and the handling of down nodes properly. The current
          implementation is a path to least resistance but should be
          modified to narrow the scope of what runs in the claimant during
          size change
        * It is a bit confusing that the functions that determine what
          transfers are necessary to change the ring size operate on the
          next list format for typical ownership transfer while building
          the new next list format for "ownership_copy" transfer
        * change_size operations are not properly validated or filtered
        * currently the behaviour for staged join/remove/leave operations
          planned along with expansion have untested/undefined behaviour
      * riak_core_ring:
        * the ring reconciliation logic is not entirely correct/complete
        * the changes to chstate_v2 will break mixed version clusters
          because it is not backwards compat. Need a v3 or v2.1 or
          something or a smarter next list
        * Other necessary changes have not been made. Known
          issues/incompletion are marked with TODOs
      * riak_core_vnode_manager:
        * Currently, the vnode manager's use of the next list to determine
          outbound transfers is less than ideal. It will schedule many
          transfers outbound from the same index during expansion
          while leaving other indexes that have scheduled copy transfers
          idle. This becomes more or less obvious depending on
          max_concurrency being lower or higher.
        * a portiton of the work should actually be performed
          by the vnode not the vnode manager. see comments in vnode_manager
        * the management tick is not the most ideal place to start vnode
          proxies for new indexes
      * riak_core_handoff_*:
        * when "ownership_copy" completes, the handoff_sender circumvents
          the vnode and goes directly to the vnode manager to notify it of
          completion. this should work more like typical handoff completion
          going through the vnode (which generates a vnode event)
      * if an index that exists in both the old and new rings is not to a
        new node then it will hold data it is no longer responsible for.
        This data can either be left as is or, as a final part of the ring
        resizing, it can be removed.
      * a capability should be registered for dynamic ring sizing. All
        nodes must have the capability for the operation to be allowed
      * this is just the ability to expand the ring. It does share a good
        amount with what will be necessary to allow shrinking as well
    other notes:
      * I'm not a huge fan of some of the naming chosen. Its unclear and
        inconsistent in places, requires cleanup.
      * obviously, this does not include changes necessary in
        riak_kv/search (will certainly crash with AAE enabled).
Commits on Jan 3, 2013
  1. Merge pull request #264 from basho/eas-fix-vnode-terminate-abnormal-r…

    engelsanchez committed Jan 3, 2013
    Make vnode terminate backend for any exit reason
  2. Merge pull request #251 from basho/readme-rewrite

    seancribbs committed Jan 3, 2013
    rewriting revised readme in .md and removing .org version.
  3. Line wrap stuff.

    seancribbs committed Jan 3, 2013
Commits on Jan 2, 2013
  1. Wrapping pool shutdown in try/catch

    engelsanchez committed Jan 2, 2013
    Making sure that we really always call terminate to avoid other issues
    like basho/riak_test#137
  2. Make vnode terminate backend for any exit reason

    engelsanchez committed Jan 2, 2013
    This should fix issue basho/riak_test#137, where partition repair was
    killing vnodes with reason kill_for_test and experiencing sporading
    bitcask data corruption.
Commits on Dec 20, 2012
  1. Fixate lager dependency on 1.2.1

    Jared Morrow
    Jared Morrow committed Dec 20, 2012
  2. Merge pull request #259 from basho/jdb-supervisor-order

    jtuple committed Dec 20, 2012
    Adjust riak_core_sup child order for cleaner shutdown
  3. add enable_health_checks config option

    jrwest committed Dec 18, 2012
    prevents registration of health checks in the case that
    {enable_health_checks, false} is set for the riak_core application
  4. health check changes

    jrwest committed Dec 17, 2012
    * change the checking processes to use gen_server:cast instead of exit for
      message passing between it and the node_watcher process for valid return
      values (true/false). Invalid return values are still handled via exit.
    * provide functions (resume_health_checks/suspend_health_checks) on
      riak_core_node_watcher to enable/disable all health checks. This will also
      toggle the new healths_enabled flag in the node_watcher's state which is
      used to prevent re-starting checks when a node goes down and comes back up
Commits on Dec 19, 2012
  1. Increased timer:sleep in vnode stop

    Joe DeVivo
    Joe DeVivo committed Dec 19, 2012
  2. Increased timeout to the bloom test

    Joe DeVivo
    Joe DeVivo committed Dec 19, 2012
Commits on Dec 18, 2012
  1. Fix vnode eqc test regression

    rzezeski committed Dec 18, 2012
    The commit d769cfb broke the test
    because now instead of always going through the mgr it will
    shortcut and hit the ets tab directly.  This means there can be a
    slice of time where the supervisor reports a vnode pid but the ets tab
    hasn't indexed it yet.  Calling `get_vnode_pid` after starting the
    vnode ensures that all previous msgs in vnode mgr mailbox have been
    handled and the ets tab is up-to-date for all subsequent calls.
Commits on Dec 15, 2012
  1. Merge pull request #257 from basho/jdb-health-check

    jtuple committed Dec 15, 2012
    Enable riak_core apps to provide a health check callback at
    registration, allowing registered apps to take advantage of
    the new health check functionality added to riak_core_node_watcher.
  2. Change vnode manager API to read from ETS when possible

    jtuple committed Dec 5, 2012
    Various parts of Riak regularly call the all_vnodes/0, all_vnodes/1,
    and all_index_pid/1 functions provided by riak_core_vnode_manager.
    Prior to this commit, all three calls performed gen_server calls to
    the vnode manager. This forces unnecessary serialization and can be
    bad news if the vnode manager ever ends up with a large message queue,
    which happens frequently during cluster transitions.
    This commit changes the existing private ETS table that the vnode manager
    uses to keep track of vnodes, and makes the table protected. The three
    API calls above are then changed to directly read from this ETS table
    when possible, and only fallback to gen_server calls to the vnode manager
    when necessary.  For example, if an API call comes in while the vnode
    manager is still starting up, then the ETS table may not yet exist, and
    the the calling process will fallback to a call to the vnode manager
    which is guaranteed to be handled after the ETS table has been created
    and populated.
  3. Enable riak_core apps to provide a health_check callback

    jtuple committed Dec 5, 2012
    Applications built on riak_core can now provide {health_check, MFA} as
    an option during registration. The provided health check will then be
    passed on to riak_core_node_watcher:service_up when the service is
    marked as up. This enables riak_core applications to take advantage of
    the new health check functionality added to riak_core_node_watcher.
  4. Merge pull request #240 from basho/issue_388

    jtuple committed Dec 5, 2012
    Extend riak_core_node_watcher to support the registration of health
    check logic that monitors a given service, automatically marking the
    service as down when unhealthy and back-up when healthy.
Commits on Dec 7, 2012
Commits on Dec 6, 2012
Commits on Dec 5, 2012
  1. Fix long lines and comment styles in riak_core_node_watcher

    jtuple committed Dec 5, 2012
    Wrap extraordinarily long lines in riak_core_node_watcher to enable
    viewing code / commits on Github without horizontal scrolling. Typically,
    we aim to limit lines to 79 characters for optimal terminal viewing as
    well, but we've all been guilty of longer lines here and there so I only
    modified lines that were easily changed and/or too long for Github.
    Changed a few comment-only lines from '%' to '%%' to match Riak convention,
    and make Emacs erlang-mode auto-indent happy.
  2. Fix bugs in node watcher health check code

    jtuple committed Dec 5, 2012
    Change incorrect use of #health_check.interval_tref to correct
    use of #health_check.check_interval.
    Fix badmatch error by changing handle_fsm_exit to return a 2-tuple as
    expected at the call-site rather than a 3-tuple.
    Changed determine_time to return an integer as required by Erlang as a
    timeout value. Previously, the function could return a float and
    trigger a badarg.
Commits on Nov 29, 2012
  1. Merge pull request #255 from basho/sv-fix-legacy-upgrade-msg

    vinoski committed Nov 29, 2012
    upgrade legacy ring only if needed
  2. upgrade legacy ring only if needed

    vinoski committed Nov 29, 2012
    Check for a legacy ring before upgrading it; for non-legacy rings, this
    eliminates the confusing unconditional "Upgrading legacy ring" message
    users mentioned on the riak-core mailing list.
Commits on Nov 27, 2012
  1. Merge pull request #254 from branch 'bwf-dialyzer'

    Bryan Fink
    Bryan Fink committed Nov 27, 2012