- 
                Notifications
    
You must be signed in to change notification settings  - Fork 63
 
Description
Metadata Classification
During the operation of kvrocks cluster, the following four types of metadata are more important and need to be stored in etcd cluster, so that when the controller switches between leader and follower, the new leader load metadata from etcd can continue to run.
- cluster topu: topology information of the cluster(shards, slotrangs, nodes etc.).
 - migrate info: related to metadata of migration(source node, target node, migrate slotranges etc.).
 - failover info: related to metadata of failover(node, noderole etc.).
 - config info: including the configuration of namespace, cluster, migrate, and failover etc.
 
Memory and Remote Consistency
The difficulty of the etcd + controller architecture is that the inconsistency between the controller memory data and the remote etcd persistent data. It causes the leader-follower switch of the controller to have a great impact on the availability of the cluster. The update of metadata mainly comes from the following two aspects:
- Devops operation need to update metadata, such as cluster initialization, cluster expansion and contraction, and manual failover of specified nodes, etc. First update the metadata before operating the cluster.
 - The leader controller initiated by internal logic, such as auto failover, migrate slot, etc. need to operate the cluster first and then update the metadata.
 
For the first case above, you can force all operations that need to update metadata to be sent to the leader controller. After the leader controller receives the request, it will first verify it according to the local memory data. After the verification is passed, the remote etcd will be updated, and after the update success, update the local memory of the leader controller (after verification, the memory update must be successful by default).
For the second case above, in order to ensure the availability of the cluster, first update the memory and then synchronize to the full cluster nodes, and then update etcd. If the update etcd fails, the background goroutine will periodically pull etcd storage until success, by compare cluster version guarante correct. If a leader-follower switch during this period, the new leader first collect the cluster information on the node and perform integrity verification, then compare it with the metadata inside etcd, and update etcd based on the collected node information.
Metadata Organization
Metadata in Etcd
/*
 * etcd data format desgin principle: 
 * 1. continuous storage among cluster data(topo and meta) under the same namespace
 * 2. minimize multiple etcd accesses for one upper-layer operation
 *
 * namespace format:
 *    `/namespace/${ns}:${ns}`
 *    all namespace name has the same prefix `/namespace/`, convenient for list namespace
 *
 *  cluster topo format:
 *    `/${ns}/cluster/${cl}:${Clsuter-json}` 	
 *    all clusters(topo and meta) hava the same prefix /${ns}/
 *    all clusters topo data hava the same prefix /${ns}/cluster/
 *
 * migrate metedata format:
 *    `/${ns}/${cl}/migrate/tasks/${timestamp}_${sub_id}:${MigrateTask-json}`
 *    `/${ns}/${cl}/migrate/doing:${MigrateTask-json}`
 *    `/${ns}/${cl}/migrate/history/${timestamp}_${sub_id}:${MigrateTask-json}`
 *    all migrate metedata under specified cluster hava the same prefix /${ns}/${cl}/migrate/
 *
 * failover metedata format:
 *    `/${ns}/${cl}/failover/tasks/${nodeid}:${timestamp}`
 *    `/${ns}/${cl}/failover/doing:${FiloverTask-json}`
 *    `/${ns}/${cl}/failover/history/${timestamp}_${nodeid}:${FiloverTask-json}`
 *    all failover metedata under specified cluster hava the same prefix /${ns}/${cl}/failover/
 */
Metadata logic organize
${var} 表示变量,${var...} 表示相同类型变量多个,‘:’ 之后表示结构的 json 格式
/*
 * /namespace/${ns_name}
 *    /${ns_name}/cluster/${clsuter_name}: ${Clsuter-json}
 *    /${ns_name}/${clsuter_name}/migrate
 *                                       /tasks
 *                                             /${timestamp}_${sub_id...}:${MigrateTask-json}
 *                                       /doing:${MigrateTask-json}
 *                                       /history
 *                                              /${timestamp}_${sub_id...}:${MigrateTask-json}
 *    /${ns_name}/${clsuter_name}/failover
 *                                       /tasks
 *                                             /${nodeid...}:${timestamp}
 *                                       /doing:${FiloverTask-json}`
 *                                       /history
 *                                               /${timestamp}_${nodeid}:${FiloverTask-json}
 *     /${ns_name}/cluster/${clsuter_name...}: ${Clsuter-json}
 * /namespace/${ns_name...}
 */
Cluster Metadata
/* 
 * /namespace/${ns_name}
 *    /${ns_name}/cluster/${clsuter_name}: ${Clsuter-json}
 */
type ClusterConfig struct {
  Name              string `json:"name"`
  HeartBeatInterval uint64 `json:"heartbeatinterval"`
  HeartBeatRetrys   uint64 `json:"heartbeatretrys"`
  
  // Global migration configuration information, the configuration information inside MigrateTask will cover the configuration 
  // information of the cluster, so that the task granularity can be adjusted
  MigrateMeta  MigrateConfig{}   // TODO: migrate  submodel
  // Global Failover configuration information, the configuration information inside MigrateTask will cover the configuration 
  // information of the cluster, so that the task granularity can be adjusted
  FailoverMeta FailoverConfig{}   // TODO: failover submodel
}
type Cluster struct {
   Version int64          `json:"version"`
   Shards  []Shard        `json:"shards"`
   Config  ClusterConfig  `json:"config"`
}
Migrate Metadata
/* 
 *    /${ns_name}/${clsuter_name}/migrate
 *                                       /tasks
 *                                             /${timestamp}_${sub_id...}:${MigrateTask-json}
 *                                       /doing:${MigrateTask-json}
 *                                       /history
 *                                              /${timestamp}_${sub_id...}:${MigrateTask-json}
 *
 * migrate_${timestamp}_${sub_id} one expansion may involve multiple relocation tasks (different nodes, different slots)
 * they have the same timestamp, distinguished by sub_id
 */
type MigrateTask struct {
   Source      string    // soucre nodeid
   Target      string    // target nodeid
   MigrateSlot SlotRange // migrate slots
   // statistics
   CreateTime  uint64    
   DoingTime   uint64    
   DoneTime    uint64    
   Status        int     // init,doing,success/failed
   Err           error   // if failed, Err is not nil
   MigrateMeta   MigrateConfig
}
type MigrateConfig struct {
	ConcurrencyNodes  int // max migrate task among nodes
         ConcurrencySlots  int // mas migrate slots between Source and Target
}
Failover MetaTask
/*
 *    /${ns_name}/${clsuter_name}/failover
 *                                       /tasks
 *                                             /${nodeid...}:${timestamp}
 *                                       /doing:${FiloverTask-json}`
 *                                       /history
 *                                               /${timestamp}_${nodeid}:${FiloverTask-json}
 *
 * failover_${nodeid}, there is only one failover task inside a node and failover queue
 */
type FailoverTask struct {
   Source  string // fault node id
   Target  string // if Source is master,new master nodeid
   Role    string // node role
   // auto or manual, if manual failover will probe Source healthy
   // enforce Target promote new master
   Type    int    // auto or manual
   
   // statistics
   CreateTime  uint64    
   DoingTime   uint64    
   DoneTime    uint64    
   Status        int     // init,doing,success/failed
   Err           error   // if failed, Err is not nil
   FailoverMeta  FailoverConfig
}
type FailoverConfig struct {
   FailoverTimeout uint64
   FailoverRetrys  int    // after retrys, task failed
   // failoverNodes > clusterNodes * FailoverRaito / 10, enter SafeMode
   // FailoverRaito between 0-10
   FailoverRaito   int 
}