Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
allreduce/broadcast cache #98
The goal of this pr is to implement immutable cache in rabit to help failed worker recover not syned allreduces in bootstrap time. examples of such are
It's specifically designed to help recovered node catch up with rest of nodes with minimal overhead in non recovery mode.
If I understand correctly, It's same as fail all nodes and load checkpoint starts from same code path. in this case checkpoint has to be stored in remote file system since all nodes are failing and restart. If we decided to go for this approach, we can get ride of allreducerobust and link to allreducebase. Checkpoint and loadcheckpoint can be done relatively easily with dmlc::stream pointing to a url path.
That's fair point, another way is to eliminate cache proposed in this pr and keep resbuf and checkpoint payload in rocksdb (disk) and always keep unique list of allreduce/broadcast call payloads identified by signature (hide to user) take fast hist as example, kv store keeps three category of results on disk.
bootstrap section (before first success checkpoint) (essentially used to serve as cache)
iteration section (after last checkpoint section) (essentially what resbuf query and pushtemp did)
last success checkpoint (essentially checkpoint payload)
so the checkpoint will be like any node do async snapshot of all three sections and upload to hdfs while executing next iterations mutating content of table. https://rocksdb.org/blog/2015/11/10/use-checkpoints-for-efficient-snapshots.html
example pr actual works in progress.
same as rocksdb like, I inline to have rocksdb as lib to deal with consistency and snapshot gen. XGB-Spark might be able to integrate with uploading local binary to remote and vice versa.
no, my proposal is actually inspired by flink's rollback mechanism http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf (3.2.2)
Yes, I read about it before, two key points IMO
this command instructs the workers to load checkpoint (this step actually rollback the model to the last iteration), and then the workers are to run
but I found the current code structure makes it very difficult to pull workers back when they are blocked for an allreduce/broadcast (<= if we can do this....hmmm....)
Just double confirm allreduce and broadcast are synchronous blocking call and results are same across all nodes, unlike partitioned operator keyby some partition key and hold part of state of logic operation. Rabit state is much naive simpler. So if we want to apply what the paper described, we can also do checkpoint or any rank really.
yeah, we can instrument init in rabit level but logic actually runs on framework like xgboost. Certain amount of in memory objects (user states in flink term) in xgboost needs to recompute from last checkpoint and we just go back to restart strategy.
according to this comments #98 (comment)
maxseqno diff_seq is not used for now. What we can use is to check if everyone have same max cache seq, there is no need to run cache restore at all.
I guess the question is why we introduce this and how it functions.
ActionSummary act do minimal seqno of all nodes
Two requirements that are slicely different than seqno mimimal / or opeations in action summary reducer function.
As Cache cross seqno reset lifecyles and we want to only restore cache if not all nodes are calling getcache. It translate to bit operation of AND (1,1,1,1) = 1 (1,0,1,1) = 0
As we want to recover node without any cache with all cache entries, all nodes needs to know how many cache entreis with "max" value. Those nodes with max value are offering data to those don't. In case of nodes get different cache entries, the node with largest cache entry win (immutable) and rest will get reset to cache entries same as winner.
Another question is why we need this max instead of use min, the answer lay on an edge case: if node recover, its cache is empty so it will always be min with cache_seq = 0, it doesn't means the rest of nodes actually have anything to offer(they may also be 0) with "max" we can decide if we need to do cache recover at all.
In future, if we can only recover selectively based on some partition or hashing scheme. (This is one of feature in mind, still lots of questions how to implement. But high level would be if we can partition and replicated DMatrix, we can recover host without reload data from hdfs per sea)
On top of everything we talked about, we want to optimize this consensus allreduce call to minimal payload in one call verses mulitple adhoc tryallreduce calls to get "max" or "or" in flags. That leads to two seperate integers and flags SeqType. I guess naming is confusing at this point.