|
// Create peer-to-peer connections with all peers for the round |
|
await this.establishPeerConnections() |
|
// Exchange weight updates with peers and return aggregated weights |
|
return await this.exchangeWeightUpdates(weights) |
Decentralized learning sometimes stalls after the first few rounds.
This appears to be caused by differences in how quickly nodes establish connections with their peers in each round.
Currently, once a node finishes establishing connections via establishPeerConnections(), it immediately proceeds to exchangeWeightUpdates() and starts sending weight updates to other peers. However, if some peers are slower in completing establishPeerConnections(), they may not yet be ready to receive these updates.
As a result, slower nodes can miss incoming weight updates and eventually timeout while waiting for them, causing the entire training process to stall.
disco/discojs/src/client/decentralized/decentralized_client.ts
Lines 153 to 156 in c2c2111
Decentralized learning sometimes stalls after the first few rounds.
This appears to be caused by differences in how quickly nodes establish connections with their peers in each round.
Currently, once a node finishes establishing connections via
establishPeerConnections(), it immediately proceeds toexchangeWeightUpdates()and starts sending weight updates to other peers. However, if some peers are slower in completingestablishPeerConnections(), they may not yet be ready to receive these updates.As a result, slower nodes can miss incoming weight updates and eventually timeout while waiting for them, causing the entire training process to stall.